Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about the model performance #3

Open
YiyuLuo opened this issue Jul 23, 2019 · 9 comments
Open

Questions about the model performance #3

YiyuLuo opened this issue Jul 23, 2019 · 9 comments

Comments

@YiyuLuo
Copy link

YiyuLuo commented Jul 23, 2019

Hi! Thank you very much for your great work!
I'm also working on this paper these days, but it takes a long time to train the model exactly the same as in the paper. I haven't got good results so far.
I find you change some layers of the network. I'm also wondering whether it is possible to build a small model while keeping the performance. Could you please tell me how's the perfomance of your modified model?

@RemiRigal
Copy link

Hi @YiyuLuo,

I'm not the author of the repository but I'm currently implementing a PyTorch version of the network described in the paper.

Regarding the size of the model I think it is reasonable to decrease the size of some layers. An interesting part of the paper is the ablation study:
Table_6
It shows that some parts of the model are not gainful such as the fully connected layers.
Considering only the magnitude mask for the audio is also quite relevant, it decreases the input size by a factor of 2 and the loss is small.

I think that reducing the three FC layers to one FC layer of 100 units (as done by @mayurnewase) can not be sufficient enough to retain the whole complexity of the output masks, though.

@YiyuLuo
Copy link
Author

YiyuLuo commented Nov 9, 2019

Hi @YiyuLuo,

I'm not the author of the repository but I'm currently implementing a PyTorch version of the network described in the paper.

Regarding the size of the model I think it is reasonable to decrease the size of some layers. An interesting part of the paper is the ablation study:
Table_6
It shows that some parts of the model are not gainful such as the fully connected layers.
Considering only the magnitude mask for the audio is also quite relevant, it decreases the input size by a factor of 2 and the loss is small.

I think that reducing the three FC layers to one FC layer of 100 units (as done by @mayurnewase) can not be sufficient enough to retain the whole complexity of the output masks, though.

Thanks for your reply!
I tried dispensing three FC layers in an Audio-only model of 2 speakers. However, the performance was bad, different from the results reported in paper.

@RemiRigal
Copy link

I tried dispensing three FC layers in an Audio-only model of 2 speakers. However, the performance was bad, different from the results reported in paper.

What was the size of your three FC layers ?

@YiyuLuo
Copy link
Author

YiyuLuo commented Nov 12, 2019

I tried dispensing three FC layers in an Audio-only model of 2 speakers. However, the performance was bad, different from the results reported in paper.

What was the size of your three FC layers ?

the same as the paper, 600 units each

@RemiRigal
Copy link

I tried dispensing three FC layers in an Audio-only model of 2 speakers. However, the performance was bad, different from the results reported in paper.

What was the size of your three FC layers ?

the same as the paper, 600 units each

How much of the AVSpeech dataset did you use ? I don't have as good results as they do in the paper but they are quite satisfying and I use a lighter model with only 15% of their dataset.

@YiyuLuo
Copy link
Author

YiyuLuo commented Nov 13, 2019

I tried dispensing three FC layers in an Audio-only model of 2 speakers. However, the performance was bad, different from the results reported in paper.

What was the size of your three FC layers ?

the same as the paper, 600 units each

How much of the AVSpeech dataset did you use ? I don't have as good results as they do in the paper but they are quite satisfying and I use a lighter model with only 15% of their dataset.

Due to some policy reasons, AVSpeech dataset is not available. I used GRID dataset instead, about 20,000 speech clips in total.

@RemiRigal
Copy link

RemiRigal commented Nov 14, 2019

Due to some policy reasons, AVSpeech dataset is not available. I used GRID dataset instead, about 20,000 speech clips in total.

I'm still able to download the AVSpeech dataset from this page. Is this website unavailable for you ?

@YiyuLuo
Copy link
Author

YiyuLuo commented Nov 14, 2019

This website is available but China mainland can't access YouTube.

@saarthak-kapse
Copy link

Hi @YiyuLuo,

I'm not the author of the repository but I'm currently implementing a PyTorch version of the network described in the paper.

Regarding the size of the model I think it is reasonable to decrease the size of some layers. An interesting part of the paper is the ablation study:
Table_6
It shows that some parts of the model are not gainful such as the fully connected layers.
Considering only the magnitude mask for the audio is also quite relevant, it decreases the input size by a factor of 2 and the loss is small.

I think that reducing the three FC layers to one FC layer of 100 units (as done by @mayurnewase) can not be sufficient enough to retain the whole complexity of the output masks, though.

Hey I am also using Pytorch but not able to get results, can you help me. Can I get your gmail so that I can discuss it with you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants