First, we process the wav file through functions in order to obtain a mel-spectrogram that we will feed into the model.
You can find very good explanations about audio signal processing for machine learning on this youtube channel : Audio Signal Processing for Machine Learning
Spectrogram helping class is called with parameters regarding to audio signal:
Spectrogram(nn.Module):
def __init__(self, n_fft=2048, hop_length=None, win_length=None,
window='hann', center=True, pad_mode='reflect', power=2.0,
freeze_parameters=True)
n_fft = window size applied on a frame of the sound signal hop_length = number of samples the window slides on the sound signal before applying window function window = window function applied (‘hann’)
this function calls the Short Time Fourier Transform function (STFT) in order to return the spectrogram.
Figure 1 : Example of feature extraction process based on the short-time Fourier transform (STFT)Example of feature extraction process based on the short-time Fourier transform (STFT). Power spectrum density P s (m, k) of each signal frame s(m, k) is computed using the STFT. The feature vector x m is constructed by extracting six components from lower subband (15.1375–20.508 Hz) and six components from upper subband (38.574–84.961 Hz). The discrete frequencies are obtained with a sampling frequency of 250 Hz and a frame length of 512 samples.
↓
Spectrogram
↓
The spectrogram is passed on LogmelFilterBank in order to obtain a logmel spectrogram :
class LogmelFilterBank(nn.Module)
self.melW = librosa.filters.mel(sr=sr, n_fft=n_fft, n_mels=n_mels,
fmin=fmin, fmax=fmax).T
self.melW = nn.Parameter(torch.Tensor(self.melW))
power_to_db(self, input) (S, ref=1.0, amin=1e-10, top_db=80.0)
power_to_db converts a power spectrogram (amplitude squared) to decibel (dB) units. This computes the scaling 10 * log10(S / ref) in a numerically stable way.
Logmel spectrogram
↓
Batch Normalization 2D
↓
Possibility of SpecAugmentation (not used in our case)
SpecAugmentation(
time_drop_width=64,
time_stripes_num=2,
freq_drop_width=8,
freq_stripes_num=2)
Reference paper : Specaugment: A simple data augmentation method for automatic speech recognition.
SpecAugmentation sets a time dropper and a frequency dropper then the DropStripes function does it.
↓
At this point we have: x (batch_size, 1, time_steps, mel_bins), frames_num
↓
2nd dimension of x is expanded from 1 to 3 channels and x is fed to DenseNet121. DenseNet121 is used as feature extractor.
Densenet121 Reference paper : Densely Connected Convolutional Networks
Figure 2 : DenseNet architecture↓
After passing through densenet, frequency axis is aggregated (mean)
# Aggregate in frequency axis
x = torch.mean(x, dim=3)
↓
Pooling layers max and avg, then they are added
x1 = F.max_pool1d(x, kernel_size=3, stride=1, padding=1)
x2 = F.avg_pool1d(x, kernel_size=3, stride=1, padding=1)
x = x1 + x2
↓
Dropout p = 0.5
↓
FC1
↓
ReLU
↓
Dropout p = 0.5
↓
Attention Block: classification for each segment
(clipwise_output, norm_att, segmentwise_output) = self.att_block(x)
(source : Non-local Neural Networks)
Weights are initialized with Xavier initialization. Attention block uses 1D convolution layer with a kernel size of 1 (att() and cla()). Then it outputs self attention map and classification :
def forward(self, x):
# x: (n_samples, n_in, n_time)
#calculate self-attention map along time axis:
norm_att = torch.softmax(torch.tanh(self.att(x)), dim=-1)
# calculates segment wise classification result:
cla = self.nonlinear_transform(self.cla(x))
# attention aggregation is performed to get clip wise prediction:
x = torch.sum(norm_att * cla, dim=2)
return x, norm_att, cla
A segment prediction counts for a frame prediction. So interpolate() and pad_framewise_output() helps to get the right dimensions back. (not used in our case)
↓
clipwise predition is saved in the output dictionnary