Two questions about DiffVC #31

huangf79 · 2023-09-14T03:06:42Z

Hello, thank you for sharing this excellent work. After briefly browsing the code, I have two questions:
(1) What is the use of x_ref ? During training it seems to be a different fragment of the same mel-spectrogram as x. And to which part of the paper does it correspond?
(2) Why do we need to perform a weighted summation of mean and x? Does this mean that the reverse diffusion during inference starts from the weighted mean_x?
I'm new to diffusion models and don't quite understand the theory in the paper, so sorry if I asked some stupid questions.

The text was updated successfully, but these errors were encountered:

li1jkdaw · 2024-08-23T17:37:00Z

Hi!

The speaker encoder uses this x_ref (different fragment of the same mel-spectrogram as x) as additional input to the trainable speaker conditioning network denoted by g_t(Y) in the paper. Different inputs to this network are compared in Table 1.
Yes, reverse diffusion starts from mean_x = self.decoder.compute_diffused_mean(x, x_mask, mean, 1.0), which is in fact very close to mean (because we have t=1.0 in this case). It is "average voice" mel-spectrogram denoted by X^{bar} in the paper.
At training, weighted summation of mean and x is necessary since it is related to the forward diffusion (see formula (3) in the paper), and at final time t=1.0 forward diffusion ends in the prior N(X^{bar}, I).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Two questions about DiffVC #31

Two questions about DiffVC #31

huangf79 commented Sep 14, 2023 •

edited

Loading

li1jkdaw commented Aug 23, 2024

Two questions about DiffVC #31

Two questions about DiffVC #31

Comments

huangf79 commented Sep 14, 2023 • edited Loading

li1jkdaw commented Aug 23, 2024

huangf79 commented Sep 14, 2023 •

edited

Loading