Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor results with Korean language dataset - Is Enhance limited to English? #44

Open
jumoney-git opened this issue Aug 21, 2024 · 5 comments

Comments

@jumoney-git
Copy link

Hello,

I'm a beginner trying to use AI, and this is my first attempt at training the Resemble-ai / Enhance model.

I've recently completed training using a dataset of 1 million Korean voice samples. However, when I ran my trained model, the results were extremely disappointing.

This leads me to ask:

  1. Is it possible that Enhance doesn't work well with Korean, despite being trained on Korean data?
  2. Is Enhance designed to work only with English?

I would greatly appreciate any insights or guidance on this matter. If Enhance is indeed language-specific, it would be helpful to know if there are plans to support other languages in the future.

Thank you for your time and assistance.

@jaehyun-ko
Copy link

  1. Which database did you use for fg/bg/rir respectively?
  2. How does it sound when you only listen to the denoised result?

@jumoney-git
Copy link
Author

Hello. It's so nice to meet a fellow Korean!
First of all, thank you so much for taking interest in my question.

  1. Which database did you use for fg/bg/rir respectively?
    FG = I used https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=data&dataSetSn=208, utilizing 100,000 audio files of voice data with lengths ranging from 1 to 3 seconds.
    For RIR, I used https://github.com/RoyJames/room-impulse-responses from the example in the README.md, converting all wav files to npy files.
    For BG, I used noise files from Musan.
  1. How does it sound when you only listen to the denoised result?
    Finally, at 1,000,000 steps, I think the noise was well removed in about 8 out of 10 samples, excluding 2 samples.

If you have any good tips for creating a Korean-specific model, I would greatly appreciate your help.

[The denoise stage log]

Reading hparams from config/denoiser.yaml
Found 100000 audio files in data/fg
Found 99990 foreground files and 930 background files
Found 10 foreground files and 930 background files
Train set: 99990 samples - Val set: 10 samples
Unable to find latest file at runs/denoiser/ds/G/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.

Training from step 0 to step 1000000

{
"step": 999998,
"G/losses/l1": 0.02123,
"G/loss": 0.02123,
"G/lr": 1e-05,
"G/grad_norm": 0.02765241637825966,
"elapsed_time": 0.3925
}
{
"step": 999999,
"G/losses/l1": 0.03506,
"G/loss": 0.03506,
"G/lr": 1e-05,
"G/grad_norm": 0.014176322147250175,
"elapsed_time": 0.3913
}
{
"step": 1000000,
"G/losses/l1": 0.04795,
"G/loss": 0.04795,
"G/lr": 1e-05,
"G/grad_norm": 0.020992033183574677,
"elapsed_time": 0.3915
}
Saved checkpoint to runs/denoiser/ds/G
Training finished
Saved checkpoint to runs/denoiser/ds/G

@jaehyun-ko
Copy link

I am also conducting experiments with a similar dataset (48KHz input). It seems necessary to monitor the convergence of Stage 1/2 during training. I have added a feature to log the progress using wandb in the version of the repository I forked. When training with more challenging conditions for LPF/BPF compared to the original, it seems to be learning well.

It may also be necessary to check whether the mean(Z) of LCFM converges to 0 properly.

Add a sample of korean.

step_00030000_010

@jaehyun-ko
Copy link

would also like to ask if you have experimented only with the denoiser, or if you have results from other stages as well.

The datasets I am using are OLKAVS(Clean Audio Only), MUSAN(Noise), and BUT(RIR).

@jaehyun-ko
Copy link

jaehyun-ko commented Sep 13, 2024

It seems good results produced from Resemble-Enhance Model, maybe some configs are mismatched in your reproduction experiment.
All discriminators should converge Real==Fake(1,1 for value)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants