-
Notifications
You must be signed in to change notification settings - Fork 206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Guided synthesis - API Improvement #376
Conversation
# Conflicts: # .gitignore # voicevox_engine/dev/synthesis_engine/mock.py
# Conflicts: # run.py # voicevox_engine/dev/synthesis_engine/mock.py # voicevox_engine/synthesis_engine/synthesis_engine.py # voicevox_engine/synthesis_engine/synthesis_engine_base.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed from reading the README you wrote, guided synthesis is at the frame level!
This has one positive and two negative consequences.
The advantage is of course the higher resolution of the input. It is undefined behavior for the model, but the user will feel happy.
The first disadvantage is that the VOICEVOX UI (mora level) does not allow adjustment of the pitch or length. Users will have to re-create the audio for the guide.
The second disadvantage is that it may become unavailable in the future. In fact, I am currently developing a decoder model for high quality, which does not allow frame-level F0 input.
I think it would be ok to either keep the frame level or change to mora level, since we are in the experimental stage right now. But we have to choose one of them....
Sorry for the late notice...
If we change to the mora level, the code would be very straightforward because we would create an API to get the AccentPhrase from the voice for the guide.
書いてくれたREADMEを読んで気づきました。guided synthesisはフレームレベルですね!
これには1つの嬉しいことと、2つの損があります。
利点はもちろん入力の解像度が高いことです。モデルにとっては未定義動作ですが、ユーザーは嬉しく感じるでしょう。
1つ目の欠点は、VOICEVOXのUI(モーラレベル)で音高や長さを微調整できない点です。ユーザーはガイド用の音声を作り直す必要があります。
2つ目の欠点は、将来的に利用不可になるかもしれない点です。実はいま高品質用デコーダーモデルを作成中なのですが、これはフレームレベルのF0入力ができません。
フレームレベルのままにするのか、モーラレベルに変更するか、今は実験段階なのでどちらでも良いと思います。が、どちらかを選ぶ必要があります・・・。
気づくのが遅れてしまって申し訳ないです・・・。
もしモーラレベルに変更すると、ガイド用音声からAccentPhraseを得るAPIを作ることでコードがとてもわかりやすくなりそうです。
Interesting, I wonder how you'd handle the alignment. Since I've already completed the GUI part I don't feel like giving up on a working feature. I suggest we keep this feature until the new architecture is introduced into this repository. No matter how it turns out to be there'll be a breaking change where we can remove this API by the way. |
Ok, I understand! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the wait!
Should be okay now. |
Another week is passing, how's it going on now? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!!
Sorry to keep you waiting!
Great, you may also want to check out the pr on GUI side so we can finish this feature for the next release. |
内容
Following #252, as I'm digging down into the GUI part I found some parts of the API design are obviously not clever enough and improved them.
Now the
AudioQuery
has a new optional sectionguidedInfo
at root, containing all the information needed for guided synthesis and would directly be passed to the engine.As shown above, I replaced the uploaded file with a full path to the file in string so I can get rid of the form data and use a simpler design. As a result,
guided_synthesis
andguided_accent_phrases
now are exactly the same with thesynthesis
API, which I think would ease the GUI development a lot.More details could be found in the change of README.
PS: Considering the usage in GUI,
guided_accent_phrases
doesn't actually work likeaccent_phrases
. That's the reason why I removed most of the parameters as well as thetext
andis_kana
.