# Layer normalization layer (Ba et al., 2016).

Source:`R/layers-normalization.R`

`layer_layer_normalization.Rd`

Normalize the activations of the previous layer for each given example in a batch independently, rather than across a batch like Batch Normalization. i.e. applies a transformation that maintains the mean activation within each example close to 0 and the activation standard deviation close to 1.

If `scale`

or `center`

are enabled, the layer will scale the normalized
outputs by broadcasting them with a trainable variable `gamma`

, and center
the outputs by broadcasting with a trainable variable `beta`

. `gamma`

will
default to a ones tensor and `beta`

will default to a zeros tensor, so that
centering and scaling are no-ops before training has begun.

So, with scaling and centering enabled the normalization equations are as follows:

Let the intermediate activations for a mini-batch to be the `inputs`

.

For each sample `x`

in a batch of `inputs`

, we compute the mean and
variance of the sample, normalize each value in the sample
(including a small factor `epsilon`

for numerical stability),
and finally,
transform the normalized output by `gamma`

and `beta`

,
which are learned parameters:

```
outputs <- inputs |> apply(1, function(x) {
x_normalized <- (x - mean(x)) /
sqrt(var(x) + epsilon)
x_normalized * gamma + beta
})
```

`gamma`

and `beta`

will span the axes of `inputs`

specified in `axis`

, and
this part of the inputs' shape must be fully defined.

For example:

```
layer <- layer_layer_normalization(axis = c(2, 3, 4))
layer(op_ones(c(5, 20, 30, 40))) |> invisible() # build()
shape(layer$beta)
```

`shape(layer$gamma)`

Note that other implementations of layer normalization may choose to define
`gamma`

and `beta`

over a separate set of axes from the axes being
normalized across. For example, Group Normalization
(Wu et al. 2018) with group size of 1
corresponds to a `layer_layer_normalization()`

that normalizes across height, width,
and channel and has `gamma`

and `beta`

span only the channel dimension.
So, this `layer_layer_normalization()`

implementation will not match a
`layer_group_normalization()`

layer with group size set to 1.

## Usage

```
layer_layer_normalization(
object,
axis = -1L,
epsilon = 0.001,
center = TRUE,
scale = TRUE,
rms_scaling = FALSE,
beta_initializer = "zeros",
gamma_initializer = "ones",
beta_regularizer = NULL,
gamma_regularizer = NULL,
beta_constraint = NULL,
gamma_constraint = NULL,
...
)
```

## Arguments

- object
Object to compose the layer with. A tensor, array, or sequential model.

- axis
Integer or list. The axis or axes to normalize across. Typically, this is the features axis/axes. The left-out axes are typically the batch axis/axes.

`-1`

is the last dimension in the input. Defaults to`-1`

.- epsilon
Small float added to variance to avoid dividing by zero. Defaults to 1e-3.

- center
If

`TRUE`

, add offset of`beta`

to normalized tensor. If`FALSE`

,`beta`

is ignored. Defaults to`TRUE`

.- scale
If

`TRUE`

, multiply by`gamma`

. If`FALSE`

,`gamma`

is not used. When the next layer is linear (also e.g.`layer_activation_relu()`

), this can be disabled since the scaling will be done by the next layer. Defaults to`TRUE`

.- rms_scaling
If

`TRUE`

,`center`

and`scale`

are ignored, and the inputs are scaled by`gamma`

and the inverse square root of the square of all inputs. This is an approximate and faster approach that avoids ever computing the mean of the input.- beta_initializer
Initializer for the beta weight. Defaults to zeros.

- gamma_initializer
Initializer for the gamma weight. Defaults to ones.

- beta_regularizer
Optional regularizer for the beta weight.

`NULL`

by default.- gamma_regularizer
Optional regularizer for the gamma weight.

`NULL`

by default.- beta_constraint
Optional constraint for the beta weight.

`NULL`

by default.- gamma_constraint
Optional constraint for the gamma weight.

`NULL`

by default.- ...
Base layer keyword arguments (e.g.

`name`

and`dtype`

).

## Value

The return value depends on the value provided for the first argument.
If `object`

is:

a

`keras_model_sequential()`

, then the layer is added to the sequential model (which is modified in place). To enable piping, the sequential model is also returned, invisibly.a

`keras_input()`

, then the output tensor from calling`layer(input)`

is returned.`NULL`

or missing, then a`Layer`

instance is returned.

## See also

Other normalization layers: `layer_batch_normalization()`

`layer_group_normalization()`

`layer_spectral_normalization()`

`layer_unit_normalization()`

Other layers: `Layer()`

`layer_activation()`

`layer_activation_elu()`

`layer_activation_leaky_relu()`

`layer_activation_parametric_relu()`

`layer_activation_relu()`

`layer_activation_softmax()`

`layer_activity_regularization()`

`layer_add()`

`layer_additive_attention()`

`layer_alpha_dropout()`

`layer_attention()`

`layer_average()`

`layer_average_pooling_1d()`

`layer_average_pooling_2d()`

`layer_average_pooling_3d()`

`layer_batch_normalization()`

`layer_bidirectional()`

`layer_category_encoding()`

`layer_center_crop()`

`layer_concatenate()`

`layer_conv_1d()`

`layer_conv_1d_transpose()`

`layer_conv_2d()`

`layer_conv_2d_transpose()`

`layer_conv_3d()`

`layer_conv_3d_transpose()`

`layer_conv_lstm_1d()`

`layer_conv_lstm_2d()`

`layer_conv_lstm_3d()`

`layer_cropping_1d()`

`layer_cropping_2d()`

`layer_cropping_3d()`

`layer_dense()`

`layer_depthwise_conv_1d()`

`layer_depthwise_conv_2d()`

`layer_discretization()`

`layer_dot()`

`layer_dropout()`

`layer_einsum_dense()`

`layer_embedding()`

`layer_feature_space()`

`layer_flatten()`

`layer_flax_module_wrapper()`

`layer_gaussian_dropout()`

`layer_gaussian_noise()`

`layer_global_average_pooling_1d()`

`layer_global_average_pooling_2d()`

`layer_global_average_pooling_3d()`

`layer_global_max_pooling_1d()`

`layer_global_max_pooling_2d()`

`layer_global_max_pooling_3d()`

`layer_group_normalization()`

`layer_group_query_attention()`

`layer_gru()`

`layer_hashed_crossing()`

`layer_hashing()`

`layer_identity()`

`layer_integer_lookup()`

`layer_jax_model_wrapper()`

`layer_lambda()`

`layer_lstm()`

`layer_masking()`

`layer_max_pooling_1d()`

`layer_max_pooling_2d()`

`layer_max_pooling_3d()`

`layer_maximum()`

`layer_mel_spectrogram()`

`layer_minimum()`

`layer_multi_head_attention()`

`layer_multiply()`

`layer_normalization()`

`layer_permute()`

`layer_random_brightness()`

`layer_random_contrast()`

`layer_random_crop()`

`layer_random_flip()`

`layer_random_rotation()`

`layer_random_translation()`

`layer_random_zoom()`

`layer_repeat_vector()`

`layer_rescaling()`

`layer_reshape()`

`layer_resizing()`

`layer_rnn()`

`layer_separable_conv_1d()`

`layer_separable_conv_2d()`

`layer_simple_rnn()`

`layer_spatial_dropout_1d()`

`layer_spatial_dropout_2d()`

`layer_spatial_dropout_3d()`

`layer_spectral_normalization()`

`layer_string_lookup()`

`layer_subtract()`

`layer_text_vectorization()`

`layer_tfsm()`

`layer_time_distributed()`

`layer_torch_module_wrapper()`

`layer_unit_normalization()`

`layer_upsampling_1d()`

`layer_upsampling_2d()`

`layer_upsampling_3d()`

`layer_zero_padding_1d()`

`layer_zero_padding_2d()`

`layer_zero_padding_3d()`

`rnn_cell_gru()`

`rnn_cell_lstm()`

`rnn_cell_simple()`

`rnn_cells_stack()`