-
Notifications
You must be signed in to change notification settings - Fork 206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Guided Synthesis #252
Guided Synthesis #252
Conversation
すごい成果だと思います!!!!!! まず、juliusがエラーを出すことに関して。 品質が低いのは、おそらく音高の抽出方法がVOICEVOXコアの想定と違うためです。 Have a great year! |
あけましておめでとうございます! After changing the normalization algorithm, it turns out to be better than I thought, check out this example: example1.mp4And I also got the accent phrases part implemented, which also provides a pretty decent result: usage.mp4example2.mp4As for the Forced Alignment part, unfortunately, switching it to your fork doesn't seem to help much 😢. Julius still throws exceptions kind of frequently, but I'm starting to consider it as acceptable since it's just how unreliable its ASR is. I added a simple error handling to tell the user to change their audio file when Julius crashes, guess it's enough in practice until someone kind enough to improve this part comes 🙏 As a result, I'm marking this PR as ready to be reviewed, feel free to bring up any questions. |
ふむ、さすがに変更箇所が大きくて大変ですね。 juilus4segを別ライブラリとして切り出すことはできそうでしょうか。 Hmm, the changes are indeed very large and hard. Is it possible to extract juilus4seg as an independent library? |
Oh again? Buhhhh... No, I don't think I'm capable to do that, neither to make it a GitHub submodule nor creating a python module that can be downloaded and installed by pip. Excluding fastapi's Form() from flake8 took me almost one hour before I gave up, I just don't wanna jump into another rabbit hole to find myself end up spending a bunch of hours in figuring out that tons of configurations. The julius4seg folder has literally NOTHING changed since I copied it from the original repository, you can simply ignore it in reviewing, making the workload a half. If you want me to divide the two APIs (guided synthesis and guided accent phrases) into separate pull requests, can do in five minutes. |
It's been a week, how's it going now? |
run.py
Outdated
except ParseKanaError: | ||
print(traceback.format_exc()) | ||
raise HTTPException( | ||
status_code=500, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using 422 instead of 500 for the status code is better.
ref #91
self, | ||
query: AudioQuery, | ||
speaker_id: int, | ||
audio_file: Optional[IO], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[QUESTION] Why is the audio_file
argument set to Optional
?
I think an error will occur if audio_file
is None
.
https://docs.python.org/3/library/typing.html#typing.Optional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I don't think I'm capable to do that, neither to make it a GitHub submodule nor creating a python module that can be downloaded and installed by pip.
なるほどです、承知しました。
では、guided synthesis機能は一旦experimental機能ということにしましょう!
voicevox_engine/experimental
のようなディレクトリを作り、guided_extractor.py
とjulius4seg
をこのディレクトリの中に移動してください。
マージされてから、よりクールになるように修正していきましょう!
Okay, I get it.
So, let's define the guided synthesis feature as an experimental feature for now!
Please create a directory like voicevox_engine/experimental
and move guided_extractor.py
and julius4seg
into this directory.
Once they're merged, we can modify them to be cooler!
voicevox_engine/guided_extractor.py
Outdated
def get_normalize_scale(engine, kana: str, f0: np.ndarray, speaker_id: int): | ||
f0_avg = _no_nan(np.average(f0[f0 != 0])) | ||
predicted_phrases, _ = parse_kana(kana, False) | ||
engine.replace_mora_data(predicted_phrases, speaker_id=speaker_id) | ||
pitch_list = [] | ||
for phrase in predicted_phrases: | ||
for mora in phrase.moras: | ||
pitch_list.append(mora.pitch) | ||
pitch_list = np.array(pitch_list, dtype=np.float64) | ||
predicted_avg = _no_nan(np.average(pitch_list[pitch_list != 0])) | ||
return predicted_avg / f0_avg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ターゲット話者のピッチ平均値を実際に一度作成する、というのは面白いアイデアですね!
ここでは、ターゲット話者のピッチ平均値を計算し、インプット話者のピッチをその平均値に合わせるということをしたいのでしょうか。
ピッチを合わせる正確な手法は、平均値をスケールするのではなく、平均値の差の加算です。
なのでここはpredicted_avg - f0_avg
を返すようにし、利用側でpitch += diff
とするのが正しい計算式になります。
関数名もget_pitch_diff
とかにするとよりクールだと思います。
|
||
def guided_accent_phrases( | ||
self, | ||
query: AudioQuery, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
この関数はquery
は必要とせず、list[AccentPhrase]
で良いと思います。こうすることで呼び出し側はquery
を作成する手間を省略できます。
kana
はlist[AccentPhrase]
から作成できます。
def create_kana(accent_phrases: List[AccentPhrase]) -> str: |
run.py
Outdated
@@ -206,6 +207,63 @@ def accent_phrases( | |||
enable_interrogative=enable_interrogative, | |||
) | |||
|
|||
@app.post( | |||
"/guided_accent_phrase", | |||
response_model=AudioQuery, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ここはList[AccentPhrase]
が正しそうです。
Line 161 in bdf712f
response_model=List[AccentPhrase], |
run.py
Outdated
def guided_accent_phrase( | ||
kana: str = Form(...), # noqa: B008 | ||
speaker_id: int = Form(...), # noqa: B008 | ||
normalize: int = Form(...), # noqa: B008 | ||
audio_file: UploadFile = File(...), # noqa: B008 | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kana
は「ひらがな」の意味ではなく、「AquesTalk記法のテキスト」という意味で用いています。
他のAPIと形式を合わせておくと、ユーザーにとって使い勝手が良さそうです。
こちらとAPI形式を合わせて、このようにしてください。
def guided_accent_phrase(
text: str,
speaker: int,
is_kana: bool = False,
enable_interrogative: bool = enable_interrogative_query_param(), # noqa B008,
audio_file: UploadFile = File(...), # noqa: B008
):
Pull Request Test Coverage Report for Build 1641048449
💛 - Coveralls |
# Conflicts: # .gitignore # voicevox_engine/dev/synthesis_engine/mock.py
# Conflicts: # run.py # voicevox_engine/dev/synthesis_engine/mock.py # voicevox_engine/synthesis_engine/synthesis_engine.py # voicevox_engine/synthesis_engine/synthesis_engine_base.py
Should be okay now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
すみません、ちょっと今週立て込んでいるので来週まで待って頂ければ・・・!
(新しいキャラクターが4人増えます)
Sorry, I'm a little busy this week, so please wait until next week...
Four new characters will be added. :->
Okay, I'll be dealing with the GUI these days.
That's good, but I'm a bit concerned with the speed characters are joining in, TTS with characters itself is a niche market and too much products flooding in may destroy the balance in which customers take the time to accept a new character... Just a thought. |
@Patchethium san
I see... |
お待たせしました、レビューしてみました! |
Should be okay now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!!!
READMEもありがとうございます!!
GUIの実装、とても楽しみにしています!
Contents
As discussed in #231, I got
julius4segment
for the forced alignment and it seems to be working at least, here's an example (audio included):guided_good.mp4
While most of the time it's just throwing out various exceptions and poorly-synthesized voices. I feel it necessary to share this progress and maybe get some help from developers who are more familiar with audios and signals.
I have some problems here:
scipy
to resample audios in order to match Julius' 16khz requirement, but sometimes it fails in testing while the one Audacity resampled from the same file works, this is pretty confusing.guided_bad.mp4
Don't know what's going on here...
4. I'm using a simple min-max to normalize the f0 extracted, I guess there should be some better methods...
Currently I only got the second method I mentioned in #231 implemented, hopefully the first one may perform better. Before I get that done, I'll keep this PR a WIP.
Issue
#231