-
Notifications
You must be signed in to change notification settings - Fork 422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wavesplit 2021 #454
base: master
Are you sure you want to change the base?
Wavesplit 2021 #454
Conversation
mixtures, oracle_s, oracle_ids = batch | ||
b, n_spk, frames = oracle_s.size() | ||
|
||
# spk_vectors = self.model.get_speaker_vectors(mixtures) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and in validation steps I use oracle embeddings for now and no speaker stack
model = Wavesplit( | ||
conf["masknet"]["n_src"], | ||
{"embed_dim": 512}, | ||
{"embed_dim": 512, "spk_vec_dim": 512, "n_repeats": 4, "return_all_layers": False}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If anyone wants to experiment with this here you can change the hyperparams
nondefault_nsrc: | ||
sample_rate: 8000 | ||
mode: min | ||
segment: 1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1.0 seconds or 0.75 as in the paper is enough
I'll review after @JorisCos |
It would be cool if someone can try to run the training with the full system and not oracle embeddings. You can wait for review when the full system has been trained and performance is decent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was a very nice surprise to see a new Wavesplit PR for Asteroid thanks @popcornell.
I made my review with general comments and questions. Aren't we missing the eval script and the tests ?
if __name__ == "__main__": | ||
a = WHAMID( | ||
"/media/sam/bx500/wavesplit/asteroid/egs/wham/wavesplit/data/wav8k/min/tt", "sep_clean" | ||
) | ||
|
||
for i in a: | ||
print(i[-1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be removed
if __name__ == "__main__": | ||
parser = argparse.ArgumentParser("WHAM data preprocessing") | ||
parser.add_argument( | ||
"--in_dir", type=str, default=None, help="Directory path of wham including tr, cv and tt" | ||
) | ||
parser.add_argument( | ||
"--out_dir", type=str, default=None, help="Directory path to put output files" | ||
) | ||
args = parser.parse_args() | ||
print(args) | ||
preprocess(args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should create a def main(args)
at the beginning if the file, put the args for the parser also at the beginning and call preprocess inside main(args)
it's more user friendly we can see directly the arguments and the function that is called without scrolling
# exp normalize trick | ||
# with torch.no_grad(): | ||
# b = torch.max(distances, dim=1, keepdim=True)[0] | ||
# out = -distance_utt + b.squeeze(1) - torch.log(torch.exp(-distances + b).sum(1)) | ||
# return out.sum(1) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove ?
# testing exp normalize average | ||
# distances = torch.ones((1, 101, 4000)) | ||
# with torch.no_grad(): | ||
# b = torch.max(distances, dim=1, keepdim=True)[0] | ||
# out = b.squeeze(1) - torch.log(torch.exp(-distances + b).sum(1)) | ||
# out2 = - torch.log(torch.exp(-distances).sum(1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove ?
from kmeans_pytorch import kmeans, kmeans_predict | ||
|
||
|
||
class Conv1DBlock(nn.Module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you think we should make wavesplit part of asteroid not just the WHAM recipes ?
Just letting you know that I am currently working on the recipe to run some experiments. |
@JorisCos Does that mean there's a more current version of this branch somewhere? Would be nice to be able to take a look if possible. |
It seems to work well with oracle embedding (18.5dB score was improved in the WSJ-2mix validation after 50 epochs). But when two stacks are jointly trained, the separation stack yields almost the same signals as the mixture, and the SISDR metric tends to be zero. Could you please share the results if anyone has tried the complete pipeline? @popcornell @JorisCos |
That's very interesting to know ! Do you think the degradation is due to over fitting of the training speakers IDs ? |
I tried to run this implementation on a dataset with around 60000 speakers
and the speaker stack loss never changed. Could there be a bug somewhere?
…On Sun, Jun 5, 2022 at 7:45 AM Samuele Cornell ***@***.***> wrote:
That's very interesting to know !
Unfortunately all I have is here on GitHub. Maybe Joris has more up to
date code.
Do you think the degradation is due to over fitting of the training
speakers IDs ?
It may be. In the paper they use some things like speaker dropout to
mitigate that.
WSJ2Mix is small regarding speakers diversity after all and for reasonable
speakers ID extraction usually you need tons of diversity e.g. voxceleb
—
Reply to this email directly, view it on GitHub
<#454 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABHDTFNPWCHUKR62HVR33KDVNS4PDANCNFSM4YEW5YWQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Hi, I also tried to run some experiments with Wavesplit (albeit in our own framework) in the past. I think a stagnating training of the speaker stack might result from two things:
But even then, all of this should not prevent the model from at least overfitting to the training set. |
Do they use shuffling in the paper ? It sounds a very smart thing to do but they don't seem to use it. There is no shuffling here and it will be great to add because for sure prevents the model to be lazy and memorize the speakers. Most of the code here is also from the first version of the paper where there were not many augmentations on the speaker stack (no speaker dropout for example, maybe only gaussian noise ?). I did not implement these augmentations. |
@lminer did you use voxceleb ? |
If I remember it correctly, they also used the label shuffling in the paper. In my experiments, I did not use the architecture as proposed in the paper, but a Conv-TasNet as separation stack (i.e. I added an additional encoder/decoder layer) and reduced the total amount of layers. Here, I was able to train the model, but it did not improve upon the performance of a Conv-TasNet. |
I have observed the same actually. Also according to https://arxiv.org/abs/2202.00733 the use of speaker ID info does in fact nor really help. |
@popcornell I used my own private dataset. |
Should work now with oracle embedding. I made a separate pull request because it is faster
See previous pull request also: #70 from last year.
Many thanks to Neil (@lienz) again.
Help from anyone is very welcomed as I am currently very GPU-constrained. Also time-constrained