Considerations for a speech enhancement model #502
Replies: 4 comments 12 replies
-
Thanks for your detailed post.
Even the single_src functions?
What do you mean exactly? Compared to having a time-domain loss? What the order of magnitude of "faster"?
Could you find the error message, so that we can try to fix it. We don't have extensive testing on it, and it would be great if it were usable.
Great! Which mask are you using exactly, and what is your loss function then? Do you still input only the magnitude to the masker?
This is an unexpected result. I expect this might be due to feeding the magnitude only being fed to the network.
Is this true for all masker? Have you tried overfitting on a very small part of the data? Regarding your dataset, I'm interested in the clean speech and noise generation. Have you validated that it helps the enhancement performance compared to using only the 40hrs of clean speech/noise pairs? If yes, do you have numbers? FYI, I think there is some high sample rate noise here but not sure about its quality. |
Beta Was this translation helpful? Give feedback.
-
FWIW I've been using Asteroid to train 24 kHz models for a while now and it works great. I've mostly been using DCUNet, DCCRN, CRUSE (not yet contributed to Asteroid). Can you tell why you lack clean speech data? There are multiple open and free 44/48 kHz clean speech and also noise datasets. |
Beta Was this translation helpful? Give feedback.
-
By the way, I believe all of your issues have nothing to do with sample rate, maybe we should adapt the title accordingly so as to not confuse future readers. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
Firstly, thank you for providing such a useful library. I think it can only do good for the source separation and enhancement fields to have datasets, models, and recipes using shared terminology and building blocks.
I am particularly interested in high sample rate (44.1/48kHz) applications, and have been trying to adapt the existing models with varying success. For my application, both the speech and the noise signals are of interest. As a starting point I've decided to recreate a model by Wichern and Lukin which seems relatively simple Paper.
My understanding of it:
Encoder: STFT (2048 Hann window, 512 stride)
Masker: The magnitude of the spectrum feeds 2 layers of 256 unit BLSTM to produce a speech magnitude mask with a sigmoid activation. The noise mask is the difference (1 - speech mask).
Decoder: The masked magnitudes are combined with the input phase, and then ISTFT back to audio.
Loss: MSE of the masked magnitude spectrum, plus a tunable crosstalk parameter.
One particular challenge is that I am unable to get separated training data in the target domain, so the speech and noise signals have been created using another separation model. I am doing some things to mitigate this:
What I've learnt:
What I'm looking to improve:
Restricting the noise mask to be the difference signal has improved the separation, but it's still not perfect. There seems to be a roughness in the speech output, and the noise output is still modulated by the speech to some extent. I have a few ideas of where to go, but I'd really welcome some thoughts and suggestions from the experts!
Beta Was this translation helpful? Give feedback.
All reactions