Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLVP checkpoint? #1

Open
PiotrDabkowski opened this issue Jul 25, 2022 · 8 comments
Open

CLVP checkpoint? #1

PiotrDabkowski opened this issue Jul 25, 2022 · 8 comments

Comments

@PiotrDabkowski
Copy link

Thanks for the great project! I think it can be super useful, and if some papers pick it up and show it works well it can become the new FID for Audio :)

Would it be possible to upload CLVP checkpoint?

Thanks!

@xanguera
Copy link

Hi, any update on this?

@neonbjb
Copy link
Owner

neonbjb commented Jan 29, 2023

Hey there, CLVP is the same one that is used in github.com/tortoise-tts

I uploaded a copy of that here: https://huggingface.co/jbetker/tts-scores-clvp/tree/main

@xanguera
Copy link

Thanks a lot @neonbjb for such quick answer. I got the CLVP model from your Huggingface link but it does not look to be the same as this code is expecting. I am getting the error below.

    cv_metric = CLVPMetric(device='cpu')
  File "/Users/xanguera/software/tts-scores/.venv/lib/python3.10/site-packages/tts_scores/clvp.py", line 359, in __init__
    self.model.load_state_dict(sd)
  File "/Users/xanguera/software/tts-scores/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1671, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for CLVP:
	Missing key(s) in state_dict: "text_pos_emb.weight", "text_transformer.layers.layers.0.0.scale", "text_transformer.layers.layers.0.0.fn.norm.weight", "text_transformer.layers.layers.0.0.fn.norm.bias", "text_transformer.layers.layers.0.0.fn.fn.to_qkv.weight", "text_transformer.layers.layers.0.0.fn.fn.to_out.0.weight", "text_transformer.layers.layers.0.0.fn.fn.to_out.0.bias", "text_transformer.layers.layers.0.1.scale", "text_transformer.layers.layers.0.1.fn.norm.weight", "text_transformer.layers.layers.0.1.fn.norm.bias", "text_transformer.layers.layers.0.1.fn.fn.net.0.weight", "text_transformer.layers.layers.0.1.fn.fn.net.0.bias", "text_transformer.layers.layers.0.1.fn.fn.net.3.weight", "text_transformer.layers.layers.0.1.fn.fn.net.3.bias", "text_transformer.layers.layers.1.0.scale", "text_transformer.layers.layers.1.0.fn.norm.weight", "text_transformer.layers.layers.1.0.fn.norm.bias", "text_transformer.layers.layers.1.0.fn.fn.to_qkv.weight", "text_transformer.layers.layers.1.0.fn.fn.to_out.0.weight", "text_transformer.layers.layers.1.0.fn.fn.to_out.0.bias", "text_transformer.layers.layers.1.1.scale", "text_transformer.layers.layers.1.1.fn.norm.weight", "text_transformer.layers.layers.1.1.fn.norm.bias", "text_transformer.layers.layers.1.1.fn.fn.net.0.weight", "text_transformer.layers.layers.1.1.fn.fn.net.0.bias", "text_transformer.layers.layers.1.1.fn.fn.net.3.weight", "text_transformer.layers.layers.1.1.fn.fn.net.3.bias", "text_transformer.layers.layers.2.0.scale", "text_transformer.layers.layers.2.0.fn.norm.weight", "text_transformer.layers.layers.2.0.fn.norm.bias", "text_transformer.layers.layers.2.0.fn.fn.to_qkv.weight", "text_transformer.layers.layers.2.0.fn.fn.to_out.0.weight", "text_transformer.layers.layers.2.0.fn.fn.to_out.0.bias", "text_transformer.layers.layers.2.1.scale", "text_transformer.layers.layers.2.1.fn.norm.weight", "text_transformer.layers.layers.2.1.fn.norm.bias", "text_transformer.layers.layers.2.1.fn.fn.net.0.weight", "text_transformer.layers.layers.2.1.fn.fn.net.0.bias", "text_transformer.layers.layers.2.1.fn.fn.net.3.weight", "text_transformer.layers.layers.2.1.fn.fn.net.3.bias", "text_transformer.layers.layers.3.0.scale", "text_transformer.layers.layers.3.0.fn.norm.weight", "text_transformer.layers.layers.3.0.fn.norm.bias", "text_transformer.layers.layers.3.0.fn.fn.to_qkv.weight", "text_transformer.layers.layers.3.0.fn.fn.to_out.0.weight", "text_transformer.layers.layers.3.0.fn.fn.to_out.0.bias", "text_transformer.layers.layers.3.1.scale", "text_transformer.layers.layers.3.1.fn.norm.weight", "text_transformer.layers.layers.3.1.fn.norm.bias", "text_transformer.layers.layers.3.1.fn.fn.net.0.weight", "text_transformer.layers.layers.3.1.fn.fn.net.0.bias", "text_transformer.layers.layers.3.1.fn.fn.net.3.weight", "text_transformer.layers.layers.3.1.fn.fn.net.3.bias", "text_transformer.layers.layers.4.0.scale", "text_transformer.layers.layers.4.0.fn.norm.weight", "text_transformer.layers.layers.4.0.fn.norm.bias", "text_transformer.layers.layers.4.0.fn.fn.to_qkv.weight", "text_transformer.layers.layers.4.0.fn.fn.to_out.0.weight", "text_transformer.layers.layers.4.0.fn.fn.to_out.0.bias", "text_transformer.layers.layers.4.1.scale", "text_transformer.layers.layers.4.1.fn.norm.weight", "text_transformer.layers.layers.4.1.fn.norm.bias", "text_transformer.layers.layers.4.1.fn.fn.net.0.weight", "text_transformer.layers.layers.4.1.fn.fn.net.0.bias", "text_transformer.layers.layers.4.1.fn.fn.net.3.weight", "text_transformer.layers.layers.4.1.fn.fn.net.3.bias", "text_transformer.layers.layers.5.0.scale", "text_transformer.layers.layers.5.0.fn.norm.weight", "text_transformer.layers.layers.5.0.fn.norm.bias", "text_transformer.layers.layers.5.0.fn.fn.to_qkv.weight", "text_transformer.layers.layers.5.0.fn.fn.to_out.0.weight", "text_transformer.layers.layers.5.0.fn.fn.to_out.0.bias", "text_transformer.layers.layers.5.1.scale", "text_transformer.layers.layers.5.1.fn.norm.weight", "text_transformer.layers.layers.5.1.fn.norm.bias", "text_transformer.layers.layers.5.1.fn.fn.net.0.weight", "text_transformer.layers.layers.5.1.fn.fn.net.0.bias", "text_transformer.layers.layers.5.1.fn.fn.net.3.weight", "text_transformer.layers.layers.5.1.fn.fn.net.3.bias", "text_transformer.layers.layers.6.0.scale", "text_transformer.layers.layers.6.0.fn.norm.weight", "text_transformer.layers.layers.6.0.fn.norm.bias", "text_transformer.layers.layers.6.0.fn.fn.to_qkv.weight", "text_transformer.layers.layers.6.0.fn.fn.to_out.0.weight", "text_transformer.layers.layers.6.0.fn.fn.to_out.0.bias", "text_transformer.layers.layers.6.1.scale", "text_transformer.layers.layers.6.1.fn.norm.weight", "text_transformer.layers.layers.6.1.fn.norm.bias", "text_transformer.layers.layers.6.1.fn.fn.net.0.weight", "text_transformer.layers.layers.6.1.fn.fn.net.0.bias", "text_transformer.layers.layers.6.1.fn.fn.net.3.weight", "text_transformer.layers.layers.6.1.fn.fn.net.3.bias", "text_transformer.layers.layers.7.0.scale", "text_transformer.layers.layers.7.0.fn.norm.weight", "text_transformer.layers.layers.7.0.fn.norm.bias", "text_transformer.layers.layers.7.0.fn.fn.to_qkv.weight", "text_transformer.layers.layers.7.0.fn.fn.to_out.0.weight", "text_transformer.layers.layers.7.0.fn.fn.to_out.0.bias", "text_transformer.layers.layers.7.1.scale", "text_transformer.layers.layers.7.1.fn.norm.weight", "text_transformer.layers.layers.7.1.fn.norm.bias", "text_transformer.layers.layers.7.1.fn.fn.net.0.weight", "text_transformer.layers.layers.7.1.fn.fn.net.0.bias", "text_transformer.layers.layers.7.1.fn.fn.net.3.weight", "text_transformer.layers.layers.7.1.fn.fn.net.3.bias", "speech_enc.weight", "speech_enc.bias", "speech_pos_emb.weight", "speech_transformer.layers.layers.0.0.scale", "speech_transformer.layers.layers.0.0.fn.norm.weight", "speech_transformer.layers.layers.0.0.fn.norm.bias", "speech_transformer.layers.layers.0.0.fn.fn.to_qkv.weight", "speech_transformer.layers.layers.0.0.fn.fn.to_out.0.weight", "speech_transformer.layers.layers.0.0.fn.fn.to_out.0.bias", "speech_transformer.layers.layers.0.1.scale", "speech_transformer.layers.layers.0.1.fn.norm.weight", "speech_transformer.layers.layers.0.1.fn.norm.bias", "speech_transformer.layers.layers.0.1.fn.fn.net.0.weight", "speech_transformer.layers.layers.0.1.fn.fn.net.0.bias", "speech_transformer.layers.layers.0.1.fn.fn.net.3.weight", "speech_transformer.layers.layers.0.1.fn.fn.net.3.bias", "speech_transformer.layers.layers.1.0.scale", "speech_transformer.layers.layers.1.0.fn.norm.weight", "speech_transformer.layers.layers.1.0.fn.norm.bias", "speech_transformer.layers.layers.1.0.fn.fn.to_qkv.weight", "speech_transformer.layers.layers.1.0.fn.fn.to_out.0.weight", "speech_transformer.layers.layers.1.0.fn.fn.to_out.0.bias", "speech_transformer.layers.layers.1.1.scale", "speech_transformer.layers.layers.1.1.fn.norm.weight", "speech_transformer.layers.layers.1.1.fn.norm.bias", "speech_transformer.layers.layers.1.1.fn.fn.net.0.weight", "speech_transformer.layers.layers.1.1.fn.fn.net.0.bias", "speech_transformer.layers.layers.1.1.fn.fn.net.3.weight", "speech_transformer.layers.layers.1.1.fn.fn.net.3.bias", "speech_transformer.layers.layers.2.0.scale", "speech_transformer.layers.layers.2.0.fn.norm.weight", "speech_transformer.layers.layers.2.0.fn.norm.bias", "speech_transformer.layers.layers.2.0.fn.fn.to_qkv.weight", "speech_transformer.layers.layers.2.0.fn.fn.to_out.0.weight", "speech_transformer.layers.layers.2.0.fn.fn.to_out.0.bias", "speech_transformer.layers.layers.2.1.scale", "speech_transformer.layers.layers.2.1.fn.norm.weight", "speech_transformer.layers.layers.2.1.fn.norm.bias", "speech_transformer.layers.layers.2.1.fn.fn.net.0.weight", "speech_transformer.layers.layers.2.1.fn.fn.net.0.bias", "speech_transformer.layers.layers.2.1.fn.fn.net.3.weight", "speech_transformer.layers.layers.2.1.fn.fn.net.3.bias", "speech_transformer.layers.layers.3.0.scale", "speech_transformer.layers.layers.3.0.fn.norm.weight", "speech_transformer.layers.layers.3.0.fn.norm.bias", "speech_transformer.layers.layers.3.0.fn.fn.to_qkv.weight", "speech_transformer.layers.layers.3.0.fn.fn.to_out.0.weight", "speech_transformer.layers.layers.3.0.fn.fn.to_out.0.bias", "speech_transformer.layers.layers.3.1.scale", "speech_transformer.layers.layers.3.1.fn.norm.weight", "speech_transformer.layers.layers.3.1.fn.norm.bias", "speech_transformer.layers.layers.3.1.fn.fn.net.0.weight", "speech_transformer.layers.layers.3.1.fn.fn.net.0.bias", "speech_transformer.layers.layers.3.1.fn.fn.net.3.weight", "speech_transformer.layers.layers.3.1.fn.fn.net.3.bias", "speech_transformer.layers.layers.4.0.scale", "speech_transformer.layers.layers.4.0.fn.norm.weight", "speech_transformer.layers.layers.4.0.fn.norm.bias", "speech_transformer.layers.layers.4.0.fn.fn.to_qkv.weight", "speech_transformer.layers.layers.4.0.fn.fn.to_out.0.weight", "speech_transformer.layers.layers.4.0.fn.fn.to_out.0.bias", "speech_transformer.layers.layers.4.1.scale", "speech_transformer.layers.layers.4.1.fn.norm.weight", "speech_transformer.layers.layers.4.1.fn.norm.bias", "speech_transformer.layers.layers.4.1.fn.fn.net.0.weight", "speech_transformer.layers.layers.4.1.fn.fn.net.0.bias", "speech_transformer.layers.layers.4.1.fn.fn.net.3.weight", "speech_transformer.layers.layers.4.1.fn.fn.net.3.bias", "speech_transformer.layers.layers.5.0.scale", "speech_transformer.layers.layers.5.0.fn.norm.weight", "speech_transformer.layers.layers.5.0.fn.norm.bias", "speech_transformer.layers.layers.5.0.fn.fn.to_qkv.weight", "speech_transformer.layers.layers.5.0.fn.fn.to_out.0.weight", "speech_transformer.layers.layers.5.0.fn.fn.to_out.0.bias", "speech_transformer.layers.layers.5.1.scale", "speech_transformer.layers.layers.5.1.fn.norm.weight", "speech_transformer.layers.layers.5.1.fn.norm.bias", "speech_transformer.layers.layers.5.1.fn.fn.net.0.weight", "speech_transformer.layers.layers.5.1.fn.fn.net.0.bias", "speech_transformer.layers.layers.5.1.fn.fn.net.3.weight", "speech_transformer.layers.layers.5.1.fn.fn.net.3.bias", "speech_transformer.layers.layers.6.0.scale", "speech_transformer.layers.layers.6.0.fn.norm.weight", "speech_transformer.layers.layers.6.0.fn.norm.bias", "speech_transformer.layers.layers.6.0.fn.fn.to_qkv.weight", "speech_transformer.layers.layers.6.0.fn.fn.to_out.0.weight", "speech_transformer.layers.layers.6.0.fn.fn.to_out.0.bias", "speech_transformer.layers.layers.6.1.scale", "speech_transformer.layers.layers.6.1.fn.norm.weight", "speech_transformer.layers.layers.6.1.fn.norm.bias", "speech_transformer.layers.layers.6.1.fn.fn.net.0.weight", "speech_transformer.layers.layers.6.1.fn.fn.net.0.bias", "speech_transformer.layers.layers.6.1.fn.fn.net.3.weight", "speech_transformer.layers.layers.6.1.fn.fn.net.3.bias", "speech_transformer.layers.layers.7.0.scale", "speech_transformer.layers.layers.7.0.fn.norm.weight", "speech_transformer.layers.layers.7.0.fn.norm.bias", "speech_transformer.layers.layers.7.0.fn.fn.to_qkv.weight", "speech_transformer.layers.layers.7.0.fn.fn.to_out.0.weight", "speech_transformer.layers.layers.7.0.fn.fn.to_out.0.bias", "speech_transformer.layers.layers.7.1.scale", "speech_transformer.layers.layers.7.1.fn.norm.weight", "speech_transformer.layers.layers.7.1.fn.norm.bias", "speech_transformer.layers.layers.7.1.fn.fn.net.0.weight", "speech_transformer.layers.layers.7.1.fn.fn.net.0.bias", "speech_transformer.layers.layers.7.1.fn.fn.net.3.weight", "speech_transformer.layers.layers.7.1.fn.fn.net.3.bias", "speech_transformer.layers.layers.8.0.scale", "speech_transformer.layers.layers.8.0.fn.norm.weight", "speech_transformer.layers.layers.8.0.fn.norm.bias", "speech_transformer.layers.layers.8.0.fn.fn.to_qkv.weight", "speech_transformer.layers.layers.8.0.fn.fn.to_out.0.weight", "speech_transformer.layers.layers.8.0.fn.fn.to_out.0.bias", "speech_transformer.layers.layers.8.1.scale", "speech_transformer.layers.layers.8.1.fn.norm.weight", "speech_transformer.layers.layers.8.1.fn.norm.bias", "speech_transformer.layers.layers.8.1.fn.fn.net.0.weight", "speech_transformer.layers.layers.8.1.fn.fn.net.0.bias", "speech_transformer.layers.layers.8.1.fn.fn.net.3.weight", "speech_transformer.layers.layers.8.1.fn.fn.net.3.bias", "speech_transformer.layers.layers.9.0.scale", "speech_transformer.layers.layers.9.0.fn.norm.weight", "speech_transformer.layers.layers.9.0.fn.norm.bias", "speech_transformer.layers.layers.9.0.fn.fn.to_qkv.weight", "speech_transformer.layers.layers.9.0.fn.fn.to_out.0.weight", "speech_transformer.layers.layers.9.0.fn.fn.to_out.0.bias", "speech_transformer.layers.layers.9.1.scale", "speech_transformer.layers.layers.9.1.fn.norm.weight", "speech_transformer.layers.layers.9.1.fn.norm.bias", "speech_transformer.layers.layers.9.1.fn.fn.net.0.weight", "speech_transformer.layers.layers.9.1.fn.fn.net.0.bias", "speech_transformer.layers.layers.9.1.fn.fn.net.3.weight", "speech_transformer.layers.layers.9.1.fn.fn.net.3.bias".
	Unexpected key(s) in state_dict: "cond_emb.0.weight", "cond_emb.0.bias", "cond_emb.1.weight", "cond_emb.1.bias", "conditioning_transformer.transformer.attn_layers.layers.0.0.0.g", "conditioning_transformer.transformer.attn_layers.layers.0.1.to_q.weight", "conditioning_transformer.transformer.attn_layers.layers.0.1.to_k.weight", "conditioning_transformer.transformer.attn_layers.layers.0.1.to_v.weight", "conditioning_transformer.transformer.attn_layers.layers.0.1.to_out.weight", "conditioning_transformer.transformer.attn_layers.layers.0.1.to_out.bias", "conditioning_transformer.transformer.attn_layers.layers.1.0.0.g", "conditioning_transformer.transformer.attn_layers.layers.1.1.net.0.proj.weight", "conditioning_transformer.transformer.attn_layers.layers.1.1.net.0.proj.bias", "conditioning_transformer.transformer.attn_layers.layers.1.1.net.3.weight", "conditioning_transformer.transformer.attn_layers.layers.1.1.net.3.bias", "conditioning_transformer.transformer.attn_layers.layers.2.0.0.g", "conditioning_transformer.transformer.attn_layers.layers.2.1.to_q.weight", "conditioning_transformer.transformer.attn_layers.layers.2.1.to_k.weight", "conditioning_transformer.transformer.attn_layers.layers.2.1.to_v.weight", "conditioning_transformer.transformer.attn_layers.layers.2.1.to_out.weight", "conditioning_transformer.transformer.attn_layers.layers.2.1.to_out.bias", "conditioning_transformer.transformer.attn_layers.layers.3.0.0.g", "conditioning_transformer.transformer.attn_layers.layers.3.1.net.0.proj.weight", "conditioning_transformer.transformer.attn_layers.layers.3.1.net.0.proj.bias", "conditioning_transformer.transformer.attn_layers.layers.3.1.net.3.weight", "conditioning_transformer.transformer.attn_layers.layers.3.1.net.3.bias", "conditioning_transformer.transformer.attn_layers.layers.4.0.0.g", "conditioning_transformer.transformer.attn_layers.layers.4.1.to_q.weight", "conditioning_transformer.transformer.attn_layers.layers.4.1.to_k.weight", "conditioning_transformer.transformer.attn_layers.layers.4.1.to_v.weight", "conditioning_transformer.transformer.attn_layers.layers.4.1.to_out.weight", "conditioning_transformer.transformer.attn_layers.layers.4.1.to_out.bias", "conditioning_transformer.transformer.attn_layers.layers.5.0.0.g", "conditioning_transformer.transformer.attn_layers.layers.5.1.net.0.proj.weight", "conditioning_transformer.transformer.attn_layers.layers.5.1.net.0.proj.bias", "conditioning_transformer.transformer.attn_layers.layers.5.1.net.3.weight", "conditioning_transformer.transformer.attn_layers.layers.5.1.net.3.bias", "conditioning_transformer.transformer.attn_layers.layers.6.0.0.g", "conditioning_transformer.transformer.attn_layers.layers.6.1.to_q.weight", "conditioning_transformer.transformer.attn_layers.layers.6.1.to_k.weight", "conditioning_transformer.transformer.attn_layers.layers.6.1.to_v.weight", "conditioning_transformer.transformer.attn_layers.layers.6.1.to_out.weight", "conditioning_transformer.transformer.attn_layers.layers.6.1.to_out.bias", "conditioning_transformer.transformer.attn_layers.layers.7.0.0.g", "conditioning_transformer.transformer.attn_layers.layers.7.1.net.0.proj.weight", "conditioning_transformer.transformer.attn_layers.layers.7.1.net.0.proj.bias", "conditioning_transformer.transformer.attn_layers.layers.7.1.net.3.weight", "conditioning_transformer.transformer.attn_layers.layers.7.1.net.3.bias", "conditioning_transformer.transformer.attn_layers.rotary_pos_emb.inv_freq", "conditioning_transformer.transformer.norm.weight", "conditioning_transformer.transformer.norm.bias", "conditioning_transformer.pre_combiner.0.weight", "conditioning_transformer.pre_combiner.0.bias", "conditioning_transformer.pre_combiner.1.norm.weight", "conditioning_transformer.pre_combiner.1.norm.bias", "conditioning_transformer.pre_combiner.1.qkv.weight", "conditioning_transformer.pre_combiner.1.qkv.bias", "conditioning_transformer.pre_combiner.1.proj_out.weight", "conditioning_transformer.pre_combiner.1.proj_out.bias", "conditioning_transformer.pre_combiner.2.weight", "conditioning_transformer.pre_combiner.2.bias", "speech_emb.weight", "speech_emb.bias", "text_transformer.transformer.attn_layers.layers.0.0.0.g", "text_transformer.transformer.attn_layers.layers.0.0.0.scale_shift_process.weight", "text_transformer.transformer.attn_layers.layers.0.0.0.scale_shift_process.bias", "text_transformer.transformer.attn_layers.layers.0.1.to_q.weight", "text_transformer.transformer.attn_layers.layers.0.1.to_k.weight", "text_transformer.transformer.attn_layers.layers.0.1.to_v.weight", "text_transformer.transformer.attn_layers.layers.0.1.to_out.weight", "text_transformer.transformer.attn_layers.layers.0.1.to_out.bias", "text_transformer.transformer.attn_layers.layers.1.0.0.g", "text_transformer.transformer.attn_layers.layers.1.0.0.scale_shift_process.weight", "text_transformer.transformer.attn_layers.layers.1.0.0.scale_shift_process.bias", "text_transformer.transformer.attn_layers.layers.1.1.net.0.proj.weight", "text_transformer.transformer.attn_layers.layers.1.1.net.0.proj.bias", "text_transformer.transformer.attn_layers.layers.1.1.net.3.weight", "text_transformer.transformer.attn_layers.layers.1.1.net.3.bias", "text_transformer.transformer.attn_layers.layers.2.0.0.g", "text_transformer.transformer.attn_layers.layers.2.0.0.scale_shift_process.weight", "text_transformer.transformer.attn_layers.layers.2.0.0.scale_shift_process.bias", "text_transformer.transformer.attn_layers.layers.2.1.to_q.weight", "text_transformer.transformer.attn_layers.layers.2.1.to_k.weight", "text_transformer.transformer.attn_layers.layers.2.1.to_v.weight", "text_transformer.transformer.attn_layers.layers.2.1.to_out.weight", "text_transformer.transformer.attn_layers.layers.2.1.to_out.bias", "text_transformer.transformer.attn_layers.layers.3.0.0.g", "text_transformer.transformer.attn_layers.layers.3.0.0.scale_shift_process.weight", "text_transformer.transformer.attn_layers.layers.3.0.0.scale_shift_process.bias", "text_transformer.transformer.attn_layers.layers.3.1.net.0.proj.weight", "text_transformer.transformer.attn_layers.layers.3.1.net.0.proj.bias", "text_transformer.transformer.attn_layers.layers.3.1.net.3.weight", "text_transformer.transformer.attn_layers.layers.3.1.net.3.bias", "text_transformer.transformer.attn_layers.layers.4.0.0.g", "text_transformer.transformer.attn_layers.layers.4.0.0.scale_shift_process.weight", "text_transformer.transformer.attn_layers.layers.4.0.0.scale_shift_process.bias", "text_transformer.transformer.attn_layers.layers.4.1.to_q.weight", "text_transformer.transformer.attn_layers.layers.4.1.to_k.weight", "text_transformer.transformer.attn_layers.layers.4.1.to_v.weight", "text_transformer.transformer.attn_layers.layers.4.1.to_out.weight", "text_transformer.transformer.attn_layers.layers.4.1.to_out.bias", "text_transformer.transformer.attn_layers.layers.5.0.0.g", "text_transformer.transformer.attn_layers.layers.5.0.0.scale_shift_process.weight", "text_transformer.transformer.attn_layers.layers.5.0.0.scale_shift_process.bias", "text_transformer.transformer.attn_layers.layers.5.1.net.0.proj.weight", "text_transformer.transformer.attn_layers.layers.5.1.net.0.proj.bias", "text_transformer.transformer.attn_layers.layers.5.1.net.3.weight", "text_transformer.transformer.attn_layers.layers.5.1.net.3.bias", "text_transformer.transformer.attn_layers.layers.6.0.0.g", "text_transformer.transformer.attn_layers.layers.6.0.0.scale_shift_process.weight", "text_transformer.transformer.attn_layers.layers.6.0.0.scale_shift_process.bias", "text_transformer.transformer.attn_layers.layers.6.1.to_q.weight", "text_transformer.transformer.attn_layers.layers.6.1.to_k.weight", "text_transformer.transformer.attn_layers.layers.6.1.to_v.weight", "text_transformer.transformer.attn_layers.layers.6.1.to_out.weight", "text_transformer.transformer.attn_layers.layers.6.1.to_out.bias", "text_transformer.transformer.attn_layers.layers.7.0.0.g", "text_transformer.transformer.attn_layers.layers.7.0.0.scale_shift_process.weight", "text_transformer.transformer.attn_layers.layers.7.0.0.scale_shift_process.bias", "text_transformer.transformer.attn_layers.layers.7.1.net.0.proj.weight", "text_transformer.transformer.attn_layers.layers.7.1.net.0.proj.bias", "text_transformer.transformer.attn_layers.layers.7.1.net.3.weight", "text_transformer.transformer.attn_layers.layers.7.1.net.3.bias", "text_transformer.transformer.attn_layers.layers.8.0.0.g", "text_transformer.transformer.attn_layers.layers.8.0.0.scale_shift_process.weight", "text_transformer.transformer.attn_layers.layers.8.0.0.scale_shift_process.bias", "text_transformer.transformer.attn_layers.layers.8.1.to_q.weight", "text_transformer.transformer.attn_layers.layers.8.1.to_k.weight", "text_transformer.transformer.attn_layers.layers.8.1.to_v.weight", "text_transformer.transformer.attn_layers.layers.8.1.to_out.weight", "text_transformer.transformer.attn_layers.layers.8.1.to_out.bias", "text_transformer.transformer.attn_layers.layers.9.0.0.g", "text_transformer.transformer.attn_layers.layers.9.0.0.scale_shift_process.weight", "text_transformer.transformer.attn_layers.layers.9.0.0.scale_shift_process.bias", "text_transformer.transformer.attn_layers.layers.9.1.net.0.proj.weight", "text_transformer.transformer.attn_layers.layers.9.1.net.0.proj.bias", "text_transformer.transformer.attn_layers.layers.9.1.net.3.weight", "text_transformer.transformer.attn_layers.layers.9.1.net.3.bias", "text_transformer.transformer.attn_layers.layers.10.0.0.g", "text_transformer.transformer.attn_layers.layers.10.0.0.scale_shift_process.weight", "text_transformer.transformer.attn_layers.layers.10.0.0.scale_shift_process.bias", "text_transformer.transformer.attn_layers.layers.10.1.to_q.weight", "text_transformer.transformer.attn_layers.layers.10.1.to_k.weight", "text_transformer.transformer.attn_layers.layers.10.1.to_v.weight", "text_transformer.transformer.attn_layers.layers.10.1.to_out.weight", "text_transformer.transformer.attn_layers.layers.10.1.to_out.bias", "text_transformer.transformer.attn_layers.layers.11.0.0.g", "text_transformer.transformer.attn_layers.layers.11.0.0.scale_shift_process.weight", "text_transformer.transformer.attn_layers.layers.11.0.0.scale_shift_process.bias", "text_transformer.transformer.attn_layers.layers.11.1.net.0.proj.weight", "text_transformer.transformer.attn_layers.layers.11.1.net.0.proj.bias", "text_transformer.transformer.attn_layers.layers.11.1.net.3.weight", "text_transformer.transformer.attn_layers.layers.11.1.net.3.bias", "text_transformer.transformer.attn_layers.layers.12.0.0.g", "text_transformer.transformer.attn_layers.layers.12.0.0.scale_shift_process.weight", "text_transformer.transformer.attn_layers.layers.12.0.0.scale_shift_process.bias", "text_transformer.transformer.attn_layers.layers.12.1.to_q.weight", "text_transformer.transformer.attn_layers.layers.12.1.to_k.weight", "text_transformer.transformer.attn_layers.layers.12.1.to_v.weight", "text_transformer.transformer.attn_layers.layers.12.1.to_out.weight", "text_transformer.transformer.attn_layers.layers.12.1.to_out.bias", "text_transformer.transformer.attn_layers.layers.13.0.0.g", "text_transformer.transformer.attn_layers.layers.13.0.0.scale_shift_process.weight", "text_transformer.transformer.attn_layers.layers.13.0.0.scale_shift_process.bias", "text_transformer.transformer.attn_layers.layers.13.1.net.0.proj.weight", "text_transformer.transformer.attn_layers.layers.13.1.net.0.proj.bias", "text_transformer.transformer.attn_layers.layers.13.1.net.3.weight", "text_transformer.transformer.attn_layers.layers.13.1.net.3.bias", "text_transformer.transformer.attn_layers.layers.14.0.0.g", "text_transformer.transformer.attn_layers.layers.14.0.0.scale_shift_process.weight", "text_transformer.transformer.attn_layers.layers.14.0.0.scale_shift_process.bias", "text_transformer.transformer.attn_layers.layers.14.1.to_q.weight", "text_transformer.transformer.attn_layers.layers.14.1.to_k.weight", "text_transformer.transformer.attn_layers.layers.14.1.to_v.weight", "text_transformer.transformer.attn_layers.layers.14.1.to_out.weight", "text_transformer.transformer.attn_layers.layers.14.1.to_out.bias", "text_transformer.transformer.attn_layers.layers.15.0.0.g", "text_transformer.transformer.attn_layers.layers.15.0.0.scale_shift_process.weight", "text_transformer.transformer.attn_layers.layers.15.0.0.scale_shift_process.bias", "text_transformer.transformer.attn_layers.layers.15.1.net.0.proj.weight", "text_transformer.transformer.attn_layers.layers.15.1.net.0.proj.bias", "text_transformer.transformer.attn_layers.layers.15.1.net.3.weight", "text_transformer.transformer.attn_layers.layers.15.1.net.3.bias", "text_transformer.transformer.attn_layers.rotary_pos_emb.inv_freq", "text_transformer.transformer.norm.weight", "text_transformer.transformer.norm.bias", "text_transformer.pre_combiner.0.weight", "text_transformer.pre_combiner.0.bias", "text_transformer.pre_combiner.1.norm.weight", "text_transformer.pre_combiner.1.norm.bias", "text_transformer.pre_combiner.1.qkv.weight", "text_transformer.pre_combiner.1.qkv.bias", "text_transformer.pre_combiner.1.proj_out.weight", "text_transformer.pre_combiner.1.proj_out.bias", "text_transformer.pre_combiner.2.weight", "text_transformer.pre_combiner.2.bias", "speech_transformer.transformer.attn_layers.layers.0.0.0.g", "speech_transformer.transformer.attn_layers.layers.0.1.to_q.weight", "speech_transformer.transformer.attn_layers.layers.0.1.to_k.weight", "speech_transformer.transformer.attn_layers.layers.0.1.to_v.weight", "speech_transformer.transformer.attn_layers.layers.0.1.to_out.weight", "speech_transformer.transformer.attn_layers.layers.0.1.to_out.bias", "speech_transformer.transformer.attn_layers.layers.1.0.0.g", "speech_transformer.transformer.attn_layers.layers.1.1.net.0.proj.weight", "speech_transformer.transformer.attn_layers.layers.1.1.net.0.proj.bias", "speech_transformer.transformer.attn_layers.layers.1.1.net.3.weight", "speech_transformer.transformer.attn_layers.layers.1.1.net.3.bias", "speech_transformer.transformer.attn_layers.layers.2.0.0.g", "speech_transformer.transformer.attn_layers.layers.2.1.to_q.weight", "speech_transformer.transformer.attn_layers.layers.2.1.to_k.weight", "speech_transformer.transformer.attn_layers.layers.2.1.to_v.weight", "speech_transformer.transformer.attn_layers.layers.2.1.to_out.weight", "speech_transformer.transformer.attn_layers.layers.2.1.to_out.bias", "speech_transformer.transformer.attn_layers.layers.3.0.0.g", "speech_transformer.transformer.attn_layers.layers.3.1.net.0.proj.weight", "speech_transformer.transformer.attn_layers.layers.3.1.net.0.proj.bias", "speech_transformer.transformer.attn_layers.layers.3.1.net.3.weight", "speech_transformer.transformer.attn_layers.layers.3.1.net.3.bias", "speech_transformer.transformer.attn_layers.layers.4.0.0.g", "speech_transformer.transformer.attn_layers.layers.4.1.to_q.weight", "speech_transformer.transformer.attn_layers.layers.4.1.to_k.weight", "speech_transformer.transformer.attn_layers.layers.4.1.to_v.weight", "speech_transformer.transformer.attn_layers.layers.4.1.to_out.weight", "speech_transformer.transformer.attn_layers.layers.4.1.to_out.bias", "speech_transformer.transformer.attn_layers.layers.5.0.0.g", "speech_transformer.transformer.attn_layers.layers.5.1.net.0.proj.weight", "speech_transformer.transformer.attn_layers.layers.5.1.net.0.proj.bias", "speech_transformer.transformer.attn_layers.layers.5.1.net.3.weight", "speech_transformer.transformer.attn_layers.layers.5.1.net.3.bias", "speech_transformer.transformer.attn_layers.layers.6.0.0.g", "speech_transformer.transformer.attn_layers.layers.6.1.to_q.weight", "speech_transformer.transformer.attn_layers.layers.6.1.to_k.weight", "speech_transformer.transformer.attn_layers.layers.6.1.to_v.weight", "speech_transformer.transformer.attn_layers.layers.6.1.to_out.weight", "speech_transformer.transformer.attn_layers.layers.6.1.to_out.bias", "speech_transformer.transformer.attn_layers.layers.7.0.0.g", "speech_transformer.transformer.attn_layers.layers.7.1.net.0.proj.weight", "speech_transformer.transformer.attn_layers.layers.7.1.net.0.proj.bias", "speech_transformer.transformer.attn_layers.layers.7.1.net.3.weight", "speech_transformer.transformer.attn_layers.layers.7.1.net.3.bias", "speech_transformer.transformer.attn_layers.layers.8.0.0.g", "speech_transformer.transformer.attn_layers.layers.8.1.to_q.weight", "speech_transformer.transformer.attn_layers.layers.8.1.to_k.weight", "speech_transformer.transformer.attn_layers.layers.8.1.to_v.weight", "speech_transformer.transformer.attn_layers.layers.8.1.to_out.weight", "speech_transformer.transformer.attn_layers.layers.8.1.to_out.bias", "speech_transformer.transformer.attn_layers.layers.9.0.0.g", "speech_transformer.transformer.attn_layers.layers.9.1.net.0.proj.weight", "speech_transformer.transformer.attn_layers.layers.9.1.net.0.proj.bias", "speech_transformer.transformer.attn_layers.layers.9.1.net.3.weight", "speech_transformer.transformer.attn_layers.layers.9.1.net.3.bias", "speech_transformer.transformer.attn_layers.layers.10.0.0.g", "speech_transformer.transformer.attn_layers.layers.10.1.to_q.weight", "speech_transformer.transformer.attn_layers.layers.10.1.to_k.weight", "speech_transformer.transformer.attn_layers.layers.10.1.to_v.weight", "speech_transformer.transformer.attn_layers.layers.10.1.to_out.weight", "speech_transformer.transformer.attn_layers.layers.10.1.to_out.bias", "speech_transformer.transformer.attn_layers.layers.11.0.0.g", "speech_transformer.transformer.attn_layers.layers.11.1.net.0.proj.weight", "speech_transformer.transformer.attn_layers.layers.11.1.net.0.proj.bias", "speech_transformer.transformer.attn_layers.layers.11.1.net.3.weight", "speech_transformer.transformer.attn_layers.layers.11.1.net.3.bias", "speech_transformer.transformer.attn_layers.layers.12.0.0.g", "speech_transformer.transformer.attn_layers.layers.12.1.to_q.weight", "speech_transformer.transformer.attn_layers.layers.12.1.to_k.weight", "speech_transformer.transformer.attn_layers.layers.12.1.to_v.weight", "speech_transformer.transformer.attn_layers.layers.12.1.to_out.weight", "speech_transformer.transformer.attn_layers.layers.12.1.to_out.bias", "speech_transformer.transformer.attn_layers.layers.13.0.0.g", "speech_transformer.transformer.attn_layers.layers.13.1.net.0.proj.weight", "speech_transformer.transformer.attn_layers.layers.13.1.net.0.proj.bias", "speech_transformer.transformer.attn_layers.layers.13.1.net.3.weight", "speech_transformer.transformer.attn_layers.layers.13.1.net.3.bias", "speech_transformer.transformer.attn_layers.layers.14.0.0.g", "speech_transformer.transformer.attn_layers.layers.14.1.to_q.weight", "speech_transformer.transformer.attn_layers.layers.14.1.to_k.weight", "speech_transformer.transformer.attn_layers.layers.14.1.to_v.weight", "speech_transformer.transformer.attn_layers.layers.14.1.to_out.weight", "speech_transformer.transformer.attn_layers.layers.14.1.to_out.bias", "speech_transformer.transformer.attn_layers.layers.15.0.0.g", "speech_transformer.transformer.attn_layers.layers.15.1.net.0.proj.weight", "speech_transformer.transformer.attn_layers.layers.15.1.net.0.proj.bias", "speech_transformer.transformer.attn_layers.layers.15.1.net.3.weight", "speech_transformer.transformer.attn_layers.layers.15.1.net.3.bias", "speech_transformer.transformer.attn_layers.rotary_pos_emb.inv_freq", "speech_transformer.transformer.norm.weight", "speech_transformer.transformer.norm.bias", "speech_transformer.pre_combiner.0.weight", "speech_transformer.pre_combiner.0.bias", "speech_transformer.pre_combiner.1.norm.weight", "speech_transformer.pre_combiner.1.norm.bias", "speech_transformer.pre_combiner.1.qkv.weight", "speech_transformer.pre_combiner.1.qkv.bias", "speech_transformer.pre_combiner.1.proj_out.weight", "speech_transformer.pre_combiner.1.proj_out.bias", "speech_transformer.pre_combiner.2.weight", "speech_transformer.pre_combiner.2.bias".
	size mismatch for text_emb.weight: copying a param with shape torch.Size([256, 512]) from checkpoint, the shape in current model is torch.Size([148, 512]).
	size mismatch for to_text_latent.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([512, 512]).
	size mismatch for to_speech_latent.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([512, 512]).

@neonbjb
Copy link
Owner

neonbjb commented Feb 1, 2023

Hey there,
Sorry about that! I realized after you sent this message that I had a bunch of un-committed changes left in my local tts-scores repo. I've submitted those changes. I believe they should fix the above issue.

@xanguera
Copy link

xanguera commented Feb 2, 2023

Thanks @neonbjb , it now works perfectly.
A couple of questions/comments:

  • Question: In the CLVP and Frechet distances you are converting audio to 22K before computing MEL from it, but in the wav2vec audio needs to be at 16K as this is how the model was trained. Is there any reason for the conversion to 22K?
  • Comment: fd and clvp/wav2vec have different parameter requirements. If you're keen on it I can send you a PR to standardize them.

@neonbjb
Copy link
Owner

neonbjb commented Feb 3, 2023 via email

@fakerybakery
Copy link

Hi,
I see there's a CVLP 2 checkpoint now in the Tortoise repo. Should we use that over the original one?

@neonbjb
Copy link
Owner

neonbjb commented Dec 19, 2023

I recommend just removing the CLVP scores altogether. wav2vec Intelligibility has much better signal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants