Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guided synthesis - API Improvement #376

Merged
merged 44 commits into from
Apr 10, 2022

Conversation

Patchethium
Copy link
Contributor

@Patchethium Patchethium commented Mar 21, 2022

内容

Following #252, as I'm digging down into the GUI part I found some parts of the API design are obviously not clever enough and improved them.

Now the AudioQuery has a new optional section guidedInfo at root, containing all the information needed for guided synthesis and would directly be passed to the engine.

{
    "accent_phrases": [
       "..."
    ],
    "guidedInfo": {
        "enabled": true,
        "audioPath": "/home/.../sample.wav",
        "normalize": true,
        "precise": true
    }
}

As shown above, I replaced the uploaded file with a full path to the file in string so I can get rid of the form data and use a simpler design. As a result, guided_synthesis and guided_accent_phrases now are exactly the same with the synthesis API, which I think would ease the GUI development a lot.

More details could be found in the change of README.

PS: Considering the usage in GUI, guided_accent_phrases doesn't actually work like accent_phrases. That's the reason why I removed most of the parameters as well as the text and is_kana.

@github-actions
Copy link

github-actions bot commented Mar 21, 2022

Coverage Result

Resultを開く
Name Stmts Miss Cover
voicevox_engine/init.py 1 0 coverage-100%
voicevox_engine/acoustic_feature_extractor.py 75 0 coverage-100%
voicevox_engine/dev/synthesis_engine/init.py 2 0 coverage-100%
voicevox_engine/dev/synthesis_engine/mock.py 40 4 coverage-90%
voicevox_engine/experimental/init.py 0 0 coverage-100%
voicevox_engine/experimental/guided_extractor.py 128 94 coverage-27%
voicevox_engine/experimental/julius4seg/init.py 0 0 coverage-100%
voicevox_engine/experimental/julius4seg/converter.py 298 295 coverage-1%
voicevox_engine/experimental/julius4seg/sp_inserter.py 116 89 coverage-23%
voicevox_engine/full_context_label.py 162 3 coverage-98%
voicevox_engine/kana_parser.py 86 1 coverage-99%
voicevox_engine/model.py 163 7 coverage-96%
voicevox_engine/mora_list.py 4 0 coverage-100%
voicevox_engine/part_of_speech_data.py 5 0 coverage-100%
voicevox_engine/preset/Preset.py 12 0 coverage-100%
voicevox_engine/preset/PresetLoader.py 34 1 coverage-97%
voicevox_engine/preset/init.py 3 0 coverage-100%
voicevox_engine/synthesis_engine/init.py 5 0 coverage-100%
voicevox_engine/synthesis_engine/core_wrapper.py 167 132 coverage-21%
voicevox_engine/synthesis_engine/make_synthesis_engines.py 52 43 coverage-17%
voicevox_engine/synthesis_engine/synthesis_engine.py 181 69 coverage-62%
voicevox_engine/synthesis_engine/synthesis_engine_base.py 68 9 coverage-87%
voicevox_engine/user_dict.py 98 10 coverage-90%
voicevox_engine/utility/init.py 3 0 coverage-100%
voicevox_engine/utility/connect_base64_waves.py 35 3 coverage-91%
voicevox_engine/utility/engine_root.py 9 2 coverage-78%
TOTAL 1747 762 coverage-56%

Copy link
Member

@Hiroshiba Hiroshiba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed from reading the README you wrote, guided synthesis is at the frame level!
This has one positive and two negative consequences.

The advantage is of course the higher resolution of the input. It is undefined behavior for the model, but the user will feel happy.

The first disadvantage is that the VOICEVOX UI (mora level) does not allow adjustment of the pitch or length. Users will have to re-create the audio for the guide.

The second disadvantage is that it may become unavailable in the future. In fact, I am currently developing a decoder model for high quality, which does not allow frame-level F0 input.

I think it would be ok to either keep the frame level or change to mora level, since we are in the experimental stage right now. But we have to choose one of them....
Sorry for the late notice...

If we change to the mora level, the code would be very straightforward because we would create an API to get the AccentPhrase from the voice for the guide.


書いてくれたREADMEを読んで気づきました。guided synthesisはフレームレベルですね!
これには1つの嬉しいことと、2つの損があります。

利点はもちろん入力の解像度が高いことです。モデルにとっては未定義動作ですが、ユーザーは嬉しく感じるでしょう。

1つ目の欠点は、VOICEVOXのUI(モーラレベル)で音高や長さを微調整できない点です。ユーザーはガイド用の音声を作り直す必要があります。

2つ目の欠点は、将来的に利用不可になるかもしれない点です。実はいま高品質用デコーダーモデルを作成中なのですが、これはフレームレベルのF0入力ができません。

フレームレベルのままにするのか、モーラレベルに変更するか、今は実験段階なのでどちらでも良いと思います。が、どちらかを選ぶ必要があります・・・。
気づくのが遅れてしまって申し訳ないです・・・。

もしモーラレベルに変更すると、ガイド用音声からAccentPhraseを得るAPIを作ることでコードがとてもわかりやすくなりそうです。

@Patchethium
Copy link
Contributor Author

Patchethium commented Mar 29, 2022

In fact, I am currently developing a decoder model for high quality, which does not allow frame-level F0 input.

Interesting, I wonder how you'd handle the alignment.

Since I've already completed the GUI part I don't feel like giving up on a working feature. I suggest we keep this feature until the new architecture is introduced into this repository. No matter how it turns out to be there'll be a breaking change where we can remove this API by the way.

@Hiroshiba
Copy link
Member

Ok, I understand!

Copy link
Member

@Hiroshiba Hiroshiba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the wait!

voicevox_engine/model.py Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
voicevox_engine/synthesis_engine/synthesis_engine.py Outdated Show resolved Hide resolved
@Patchethium
Copy link
Contributor Author

Should be okay now.

@Patchethium
Copy link
Contributor Author

Another week is passing, how's it going on now?

Copy link
Member

@Hiroshiba Hiroshiba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!!

Sorry to keep you waiting!

@Hiroshiba Hiroshiba merged commit 9af4963 into VOICEVOX:master Apr 10, 2022
@Patchethium
Copy link
Contributor Author

Great, you may also want to check out the pr on GUI side so we can finish this feature for the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants