Wavesplit 2021 #454

popcornell · 2021-02-24T15:29:33Z

Should work now with oracle embedding. I made a separate pull request because it is faster
See previous pull request also: #70 from last year.
Many thanks to Neil (@lienz) again.
Help from anyone is very welcomed as I am currently very GPU-constrained. Also time-constrained

popcornell · 2021-02-24T15:32:09Z

egs/wham/wavesplit/train.py

+        mixtures, oracle_s, oracle_ids = batch
+        b, n_spk, frames = oracle_s.size()
+
+        # spk_vectors = self.model.get_speaker_vectors(mixtures)


Here and in validation steps I use oracle embeddings for now and no speaker stack

popcornell · 2021-02-24T15:32:40Z

egs/wham/wavesplit/train.py

+    model = Wavesplit(
+        conf["masknet"]["n_src"],
+        {"embed_dim": 512},
+        {"embed_dim": 512, "spk_vec_dim": 512, "n_repeats": 4, "return_all_layers": False},


If anyone wants to experiment with this here you can change the hyperparams

popcornell · 2021-02-24T15:33:11Z

egs/wham/wavesplit/local/conf.yml

+  nondefault_nsrc:
+  sample_rate: 8000
+  mode: min
+  segment: 1.0


1.0 seconds or 0.75 as in the paper is enough

mpariente · 2021-02-25T08:59:50Z

I'll review after @JorisCos

popcornell · 2021-02-25T11:46:08Z

It would be cool if someone can try to run the training with the full system and not oracle embeddings. You can wait for review when the full system has been trained and performance is decent

JorisCos

It was a very nice surprise to see a new Wavesplit PR for Asteroid thanks @popcornell.
I made my review with general comments and questions. Aren't we missing the eval script and the tests ?

JorisCos · 2021-03-02T09:51:21Z

egs/wham/wavesplit/dataloading.py

+if __name__ == "__main__":
+    a = WHAMID(
+        "/media/sam/bx500/wavesplit/asteroid/egs/wham/wavesplit/data/wav8k/min/tt", "sep_clean"
+    )
+
+    for i in a:
+        print(i[-1])


To be removed

JorisCos · 2021-03-02T09:51:48Z

egs/wham/wavesplit/local/preprocess_wham.py

+if __name__ == "__main__":
+    parser = argparse.ArgumentParser("WHAM data preprocessing")
+    parser.add_argument(
+        "--in_dir", type=str, default=None, help="Directory path of wham including tr, cv and tt"
+    )
+    parser.add_argument(
+        "--out_dir", type=str, default=None, help="Directory path to put output files"
+    )
+    args = parser.parse_args()
+    print(args)
+    preprocess(args)


I think we should create a def main(args) at the beginning if the file, put the args for the parser also at the beginning and call preprocess inside main(args) it's more user friendly we can see directly the arguments and the function that is called without scrolling

JorisCos · 2021-03-02T09:51:58Z

egs/wham/wavesplit/losses.py

+        # exp normalize trick
+        # with torch.no_grad():
+        #   b = torch.max(distances, dim=1, keepdim=True)[0]
+        # out = -distance_utt + b.squeeze(1) - torch.log(torch.exp(-distances + b).sum(1))
+        # return out.sum(1)
+


JorisCos · 2021-03-02T09:52:06Z

egs/wham/wavesplit/losses.py

+    # testing exp normalize average
+    # distances = torch.ones((1, 101, 4000))
+    # with torch.no_grad():
+    #   b = torch.max(distances, dim=1, keepdim=True)[0]
+    # out = b.squeeze(1) - torch.log(torch.exp(-distances + b).sum(1))
+    # out2 = - torch.log(torch.exp(-distances).sum(1))


JorisCos · 2021-03-02T09:52:13Z

egs/wham/wavesplit/model.py

+from kmeans_pytorch import kmeans, kmeans_predict
+
+
+class Conv1DBlock(nn.Module):


Don't you think we should make wavesplit part of asteroid not just the WHAM recipes ?

JorisCos · 2021-10-07T09:13:05Z

Just letting you know that I am currently working on the recipe to run some experiments.
Hopefully, the results will be as expected and we will finally merge this 🚀

lminer · 2021-10-15T17:28:01Z

@JorisCos Does that mean there's a more current version of this branch somewhere? Would be nice to be able to take a look if possible.

wangshuo182 · 2022-06-05T14:37:07Z

It seems to work well with oracle embedding (18.5dB score was improved in the WSJ-2mix validation after 50 epochs). But when two stacks are jointly trained, the separation stack yields almost the same signals as the mixture, and the SISDR metric tends to be zero. Could you please share the results if anyone has tried the complete pipeline? @popcornell @JorisCos

popcornell · 2022-06-05T14:44:55Z

That's very interesting to know !
Unfortunately all I have is here on GitHub. Maybe Joris has more up to date code.

Do you think the degradation is due to over fitting of the training speakers IDs ?
It may be. In the paper they use some things like speaker dropout to mitigate that.
WSJ2Mix is small regarding speakers diversity after all and for reasonable speakers ID extraction usually you need tons of diversity e.g. voxceleb

lminer · 2022-06-05T15:38:35Z

I tried to run this implementation on a dataset with around 60000 speakers and the speaker stack loss never changed. Could there be a bug somewhere?

…

On Sun, Jun 5, 2022 at 7:45 AM Samuele Cornell ***@***.***> wrote: That's very interesting to know ! Unfortunately all I have is here on GitHub. Maybe Joris has more up to date code. Do you think the degradation is due to over fitting of the training speakers IDs ? It may be. In the paper they use some things like speaker dropout to mitigate that. WSJ2Mix is small regarding speakers diversity after all and for reasonable speakers ID extraction usually you need tons of diversity e.g. voxceleb — Reply to this email directly, view it on GitHub <#454 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABHDTFNPWCHUKR62HVR33KDVNS4PDANCNFSM4YEW5YWQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

TCord · 2022-06-08T08:10:07Z

It seems to work well with oracle embedding (18.5dB score was improved in the WSJ-2mix validation after 50 epochs). But when two stacks are jointly trained, the separation stack yields almost the same signals as the mixture, and the SISDR metric tends to be zero. Could you please share the results if anyone has tried the complete pipeline? @popcornell @JorisCos

Hi, I also tried to run some experiments with Wavesplit (albeit in our own framework) in the past. I think a stagnating training of the speaker stack might result from two things:

Missing shuffling of speaker IDs during training:
The target speaker IDs need to be shuffled every time so that the model really needs the embeddings of the speaker stack to solve the permutation problem at the output. I haven't checked all of the code, but at least at first glance, I have not seen this shuffling for the target IDs.
The latent dimension of the speaker stack is too high:
If you don't use dynamic mixing (DM), the WSJ2mix dataset has 101 speakers, while the latent dimension is set to 256 in the corresponding paper. So, without DM (resulting in 285 speakers of WSJ0+WSJ1) there is no information bottleneck that would enable the model to generalize to unseen speakers. Even then, the generalization tricks that @popcornell mentioned (centroid dropout / mixup) still should be necessary for the model to really generalize to unseen speakers.
Reducing the latent dimension to e.g. 64 and see if it's learning something then might work.

But even then, all of this should not prevent the model from at least overfitting to the training set.

popcornell · 2022-06-08T13:04:25Z

Do they use shuffling in the paper ? It sounds a very smart thing to do but they don't seem to use it. There is no shuffling here and it will be great to add because for sure prevents the model to be lazy and memorize the speakers.
Your point on the dimension of the embedding is valid. I don't think dynamic mixing as they describe in the paper uses also WSJ1. If it uses it the results in the paper are not comparable with previous works as uses additional data (then better throw in LibriMix which has more speaker diversity !). If only WSJ0 then you have 101 speakers always even with DM. This exacerbates the problem for sure.

Most of the code here is also from the first version of the paper where there were not many augmentations on the speaker stack (no speaker dropout for example, maybe only gaussian noise ?). I did not implement these augmentations.
Maybe Neil Zeghidour still has some other hidden tricks to make the model generalize better. @TCord were you able to succesfully replicate it to some decent degree ?

popcornell · 2022-06-08T13:14:48Z

@lminer did you use voxceleb ?

TCord · 2022-06-08T13:15:21Z

If I remember it correctly, they also used the label shuffling in the paper. In my experiments, I did not use the architecture as proposed in the paper, but a Conv-TasNet as separation stack (i.e. I added an additional encoder/decoder layer) and reduced the total amount of layers. Here, I was able to train the model, but it did not improve upon the performance of a Conv-TasNet.
My conjecture was that a sample-wise resolution as in the paper is necessary to obtain good results and provides the most significant improvements. I think, in the DPRNN papers it was also shown that choosing a very small window size and frame advance in the encoder further improved the separation for anechoic data.
By employing an additional speaker stack jointly with a Conv-TasNet, the permutation problem could be solved through the speaker stack, but it did not improve the separation performance over a sole Conv-TasNet. As training the full-size Wavesplit model without any additional encoder/decoder layer takes a massive amount of GPU memory, I dropped these experiments afterwards.

popcornell · 2022-06-08T13:45:02Z

I have observed the same actually. Also according to https://arxiv.org/abs/2202.00733 the use of speaker ID info does in fact nor really help.

lminer · 2022-06-08T17:45:43Z

@popcornell I used my own private dataset.

popcornell added 2 commits February 24, 2021 16:23

works with oracle embeddings

bd0c7a9

Merge branch 'master' of https://github.com/mpariente/asteroid

d59edb3

popcornell commented Feb 24, 2021

View reviewed changes

popcornell requested a review from mpariente February 24, 2021 15:33

popcornell added the enhancement New feature or request label Feb 24, 2021

mpariente requested a review from JorisCos February 25, 2021 08:59

JorisCos reviewed Mar 2, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wavesplit 2021 #454

Wavesplit 2021 #454

popcornell commented Feb 24, 2021 •

edited

Loading

popcornell Feb 24, 2021

popcornell Feb 24, 2021

popcornell Feb 24, 2021

mpariente commented Feb 25, 2021

popcornell commented Feb 25, 2021

JorisCos left a comment

JorisCos Mar 2, 2021

JorisCos Mar 2, 2021

JorisCos Mar 2, 2021

JorisCos Mar 2, 2021

JorisCos Mar 2, 2021

JorisCos commented Oct 7, 2021

lminer commented Oct 15, 2021

wangshuo182 commented Jun 5, 2022

popcornell commented Jun 5, 2022

lminer commented Jun 5, 2022 via email

TCord commented Jun 8, 2022

popcornell commented Jun 8, 2022 •

edited

Loading

popcornell commented Jun 8, 2022

TCord commented Jun 8, 2022

popcornell commented Jun 8, 2022

lminer commented Jun 8, 2022

		from kmeans_pytorch import kmeans, kmeans_predict


		class Conv1DBlock(nn.Module):

Wavesplit 2021 #454

Are you sure you want to change the base?

Wavesplit 2021 #454

Conversation

popcornell commented Feb 24, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mpariente commented Feb 25, 2021

popcornell commented Feb 25, 2021

JorisCos left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JorisCos commented Oct 7, 2021

lminer commented Oct 15, 2021

wangshuo182 commented Jun 5, 2022

popcornell commented Jun 5, 2022

lminer commented Jun 5, 2022 via email

TCord commented Jun 8, 2022

popcornell commented Jun 8, 2022 • edited Loading

popcornell commented Jun 8, 2022

TCord commented Jun 8, 2022

popcornell commented Jun 8, 2022

lminer commented Jun 8, 2022

popcornell commented Feb 24, 2021 •

edited

Loading

popcornell commented Jun 8, 2022 •

edited

Loading