A preprocessing layer to convert raw audio signals to Mel spectrograms.
Source:R/layers-preprocessing.R
layer_mel_spectrogram.Rd
This layer takes float32
/float64
single or batched audio signal as
inputs and computes the Mel spectrogram using Short-Time Fourier Transform
and Mel scaling. The input should be a 1D (unbatched) or 2D (batched) tensor
representing audio signals. The output will be a 2D or 3D tensor
representing Mel spectrograms.
A spectrogram is an image-like representation that shows the frequency spectrum of a signal over time. It uses x-axis to represent time, y-axis to represent frequency, and each pixel to represent intensity. Mel spectrograms are a special type of spectrogram that use the mel scale, which approximates how humans perceive sound. They are commonly used in speech and music processing tasks like speech recognition, speaker identification, and music genre classification.
Usage
layer_mel_spectrogram(
object,
fft_length = 2048L,
sequence_stride = 512L,
sequence_length = NULL,
window = "hann",
sampling_rate = 16000L,
num_mel_bins = 128L,
min_freq = 20,
max_freq = NULL,
power_to_db = TRUE,
top_db = 80,
mag_exp = 2,
min_power = 1e-10,
ref_power = 1,
...
)
Arguments
- object
Object to compose the layer with. A tensor, array, or sequential model.
- fft_length
Integer, size of the FFT window.
- sequence_stride
Integer, number of samples between successive STFT columns.
- sequence_length
Integer, size of the window used for applying
window
to each audio frame. IfNULL
, defaults tofft_length
.- window
String, name of the window function to use. Available values are
"hann"
and"hamming"
. Ifwindow
is a tensor, it will be used directly as the window and its length must besequence_length
. Ifwindow
isNULL
, no windowing is used. Defaults to"hann"
.- sampling_rate
Integer, sample rate of the input signal.
- num_mel_bins
Integer, number of mel bins to generate.
- min_freq
Float, minimum frequency of the mel bins.
- max_freq
Float, maximum frequency of the mel bins. If
NULL
, defaults tosampling_rate / 2
.- power_to_db
If TRUE, convert the power spectrogram to decibels.
- top_db
Float, minimum negative cut-off
max(10 * log10(S)) - top_db
.- mag_exp
Float, exponent for the magnitude spectrogram. 1 for magnitude, 2 for power, etc. Default is 2.
- min_power
Float, minimum value for power and
ref_power
.- ref_power
Float, the power is scaled relative to it
10 * log10(S / ref_power)
.- ...
For forward/backward compatability.
Value
The return value depends on the value provided for the first argument.
If object
is:
a
keras_model_sequential()
, then the layer is added to the sequential model (which is modified in place). To enable piping, the sequential model is also returned, invisibly.a
keras_input()
, then the output tensor from callinglayer(input)
is returned.NULL
or missing, then aLayer
instance is returned.
Examples
Unbatched audio signal
layer <- layer_mel_spectrogram(
num_mel_bins = 64,
sampling_rate = 8000,
sequence_stride = 256,
fft_length = 2048
)
layer(random_uniform(shape = c(16000))) |> shape()
Batched audio signal
layer <- layer_mel_spectrogram(
num_mel_bins = 80,
sampling_rate = 8000,
sequence_stride = 128,
fft_length = 2048
)
layer(random_uniform(shape = c(2, 16000))) |> shape()
See also
Other preprocessing layers: layer_category_encoding()
layer_center_crop()
layer_discretization()
layer_feature_space()
layer_hashed_crossing()
layer_hashing()
layer_integer_lookup()
layer_normalization()
layer_random_brightness()
layer_random_contrast()
layer_random_crop()
layer_random_flip()
layer_random_rotation()
layer_random_translation()
layer_random_zoom()
layer_rescaling()
layer_resizing()
layer_string_lookup()
layer_text_vectorization()
Other layers: Layer()
layer_activation()
layer_activation_elu()
layer_activation_leaky_relu()
layer_activation_parametric_relu()
layer_activation_relu()
layer_activation_softmax()
layer_activity_regularization()
layer_add()
layer_additive_attention()
layer_alpha_dropout()
layer_attention()
layer_average()
layer_average_pooling_1d()
layer_average_pooling_2d()
layer_average_pooling_3d()
layer_batch_normalization()
layer_bidirectional()
layer_category_encoding()
layer_center_crop()
layer_concatenate()
layer_conv_1d()
layer_conv_1d_transpose()
layer_conv_2d()
layer_conv_2d_transpose()
layer_conv_3d()
layer_conv_3d_transpose()
layer_conv_lstm_1d()
layer_conv_lstm_2d()
layer_conv_lstm_3d()
layer_cropping_1d()
layer_cropping_2d()
layer_cropping_3d()
layer_dense()
layer_depthwise_conv_1d()
layer_depthwise_conv_2d()
layer_discretization()
layer_dot()
layer_dropout()
layer_einsum_dense()
layer_embedding()
layer_feature_space()
layer_flatten()
layer_flax_module_wrapper()
layer_gaussian_dropout()
layer_gaussian_noise()
layer_global_average_pooling_1d()
layer_global_average_pooling_2d()
layer_global_average_pooling_3d()
layer_global_max_pooling_1d()
layer_global_max_pooling_2d()
layer_global_max_pooling_3d()
layer_group_normalization()
layer_group_query_attention()
layer_gru()
layer_hashed_crossing()
layer_hashing()
layer_identity()
layer_integer_lookup()
layer_jax_model_wrapper()
layer_lambda()
layer_layer_normalization()
layer_lstm()
layer_masking()
layer_max_pooling_1d()
layer_max_pooling_2d()
layer_max_pooling_3d()
layer_maximum()
layer_minimum()
layer_multi_head_attention()
layer_multiply()
layer_normalization()
layer_permute()
layer_random_brightness()
layer_random_contrast()
layer_random_crop()
layer_random_flip()
layer_random_rotation()
layer_random_translation()
layer_random_zoom()
layer_repeat_vector()
layer_rescaling()
layer_reshape()
layer_resizing()
layer_rnn()
layer_separable_conv_1d()
layer_separable_conv_2d()
layer_simple_rnn()
layer_spatial_dropout_1d()
layer_spatial_dropout_2d()
layer_spatial_dropout_3d()
layer_spectral_normalization()
layer_string_lookup()
layer_subtract()
layer_text_vectorization()
layer_tfsm()
layer_time_distributed()
layer_torch_module_wrapper()
layer_unit_normalization()
layer_upsampling_1d()
layer_upsampling_2d()
layer_upsampling_3d()
layer_zero_padding_1d()
layer_zero_padding_2d()
layer_zero_padding_3d()
rnn_cell_gru()
rnn_cell_lstm()
rnn_cell_simple()
rnn_cells_stack()