Layer normalization layer (Ba et al., 2016).
Source:R/layers-normalization.R
layer_layer_normalization.RdNormalize the activations of the previous layer for each given example in a batch independently, rather than across a batch like Batch Normalization. i.e. applies a transformation that maintains the mean activation within each example close to 0 and the activation standard deviation close to 1.
If scale or center are enabled, the layer will scale the normalized
outputs by broadcasting them with a trainable variable gamma, and center
the outputs by broadcasting with a trainable variable beta. gamma will
default to a ones tensor and beta will default to a zeros tensor, so that
centering and scaling are no-ops before training has begun.
So, with scaling and centering enabled the normalization equations are as follows:
Let the intermediate activations for a mini-batch to be the inputs.
For each sample x in a batch of inputs, we compute the mean and
variance of the sample, normalize each value in the sample
(including a small factor epsilon for numerical stability),
and finally,
transform the normalized output by gamma and beta,
which are learned parameters:
outputs <- inputs |> apply(1, function(x) {
x_normalized <- (x - mean(x)) /
sqrt(var(x) + epsilon)
x_normalized * gamma + beta
})gamma and beta will span the axes of inputs specified in axis, and
this part of the inputs' shape must be fully defined.
For example:
layer <- layer_layer_normalization(axis = c(2, 3, 4))
layer(op_ones(c(5, 20, 30, 40))) |> invisible() # build()
shape(layer$beta)shape(layer$gamma)Note that other implementations of layer normalization may choose to define
gamma and beta over a separate set of axes from the axes being
normalized across. For example, Group Normalization
(Wu et al. 2018) with group size of 1
corresponds to a layer_layer_normalization() that normalizes across height, width,
and channel and has gamma and beta span only the channel dimension.
So, this layer_layer_normalization() implementation will not match a
layer_group_normalization() layer with group size set to 1.
Usage
layer_layer_normalization(
object,
axis = -1L,
epsilon = 0.001,
center = TRUE,
scale = TRUE,
rms_scaling = FALSE,
beta_initializer = "zeros",
gamma_initializer = "ones",
beta_regularizer = NULL,
gamma_regularizer = NULL,
beta_constraint = NULL,
gamma_constraint = NULL,
...
)Arguments
- object
Object to compose the layer with. A tensor, array, or sequential model.
- axis
Integer or list. The axis or axes to normalize across. Typically, this is the features axis/axes. The left-out axes are typically the batch axis/axes.
-1is the last dimension in the input. Defaults to-1.- epsilon
Small float added to variance to avoid dividing by zero. Defaults to 1e-3.
- center
If
TRUE, add offset ofbetato normalized tensor. IfFALSE,betais ignored. Defaults toTRUE.- scale
If
TRUE, multiply bygamma. IfFALSE,gammais not used. When the next layer is linear (also e.g.layer_activation_relu()), this can be disabled since the scaling will be done by the next layer. Defaults toTRUE.- rms_scaling
If
TRUE,centerandscaleare ignored, and the inputs are scaled bygammaand the inverse square root of the square of all inputs. This is an approximate and faster approach that avoids ever computing the mean of the input. Note that this isn't equivalent to the computation that thelayer_rms_normalizationlayer performs.- beta_initializer
Initializer for the beta weight. Defaults to zeros.
- gamma_initializer
Initializer for the gamma weight. Defaults to ones.
- beta_regularizer
Optional regularizer for the beta weight.
NULLby default.- gamma_regularizer
Optional regularizer for the gamma weight.
NULLby default.- beta_constraint
Optional constraint for the beta weight.
NULLby default.- gamma_constraint
Optional constraint for the gamma weight.
NULLby default.- ...
Base layer keyword arguments (e.g.
nameanddtype).
Value
The return value depends on the value provided for the first argument.
If object is:
a
keras_model_sequential(), then the layer is added to the sequential model (which is modified in place). To enable piping, the sequential model is also returned, invisibly.a
keras_input(), then the output tensor from callinglayer(input)is returned.NULLor missing, then aLayerinstance is returned.
See also
Other normalization layers: layer_batch_normalization() layer_group_normalization() layer_rms_normalization() layer_spectral_normalization() layer_unit_normalization()
Other layers: Layer() layer_activation() layer_activation_elu() layer_activation_leaky_relu() layer_activation_parametric_relu() layer_activation_relu() layer_activation_softmax() layer_activity_regularization() layer_add() layer_additive_attention() layer_alpha_dropout() layer_attention() layer_aug_mix() layer_auto_contrast() layer_average() layer_average_pooling_1d() layer_average_pooling_2d() layer_average_pooling_3d() layer_batch_normalization() layer_bidirectional() layer_category_encoding() layer_center_crop() layer_concatenate() layer_conv_1d() layer_conv_1d_transpose() layer_conv_2d() layer_conv_2d_transpose() layer_conv_3d() layer_conv_3d_transpose() layer_conv_lstm_1d() layer_conv_lstm_2d() layer_conv_lstm_3d() layer_cropping_1d() layer_cropping_2d() layer_cropping_3d() layer_cut_mix() layer_dense() layer_depthwise_conv_1d() layer_depthwise_conv_2d() layer_discretization() layer_dot() layer_dropout() layer_einsum_dense() layer_embedding() layer_equalization() layer_feature_space() layer_flatten() layer_flax_module_wrapper() layer_gaussian_dropout() layer_gaussian_noise() layer_global_average_pooling_1d() layer_global_average_pooling_2d() layer_global_average_pooling_3d() layer_global_max_pooling_1d() layer_global_max_pooling_2d() layer_global_max_pooling_3d() layer_group_normalization() layer_group_query_attention() layer_gru() layer_hashed_crossing() layer_hashing() layer_identity() layer_integer_lookup() layer_jax_model_wrapper() layer_lambda() layer_lstm() layer_masking() layer_max_num_bounding_boxes() layer_max_pooling_1d() layer_max_pooling_2d() layer_max_pooling_3d() layer_maximum() layer_mel_spectrogram() layer_minimum() layer_mix_up() layer_multi_head_attention() layer_multiply() layer_normalization() layer_permute() layer_rand_augment() layer_random_brightness() layer_random_color_degeneration() layer_random_color_jitter() layer_random_contrast() layer_random_crop() layer_random_erasing() layer_random_flip() layer_random_gaussian_blur() layer_random_grayscale() layer_random_hue() layer_random_invert() layer_random_perspective() layer_random_posterization() layer_random_rotation() layer_random_saturation() layer_random_sharpness() layer_random_shear() layer_random_translation() layer_random_zoom() layer_repeat_vector() layer_rescaling() layer_reshape() layer_resizing() layer_rms_normalization() layer_rnn() layer_separable_conv_1d() layer_separable_conv_2d() layer_simple_rnn() layer_solarization() layer_spatial_dropout_1d() layer_spatial_dropout_2d() layer_spatial_dropout_3d() layer_spectral_normalization() layer_stft_spectrogram() layer_string_lookup() layer_subtract() layer_text_vectorization() layer_tfsm() layer_time_distributed() layer_torch_module_wrapper() layer_unit_normalization() layer_upsampling_1d() layer_upsampling_2d() layer_upsampling_3d() layer_zero_padding_1d() layer_zero_padding_2d() layer_zero_padding_3d() rnn_cell_gru() rnn_cell_lstm() rnn_cell_simple() rnn_cells_stack()