You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, thank you for sharing this excellent work. After briefly browsing the code, I have two questions:
(1) What is the use of x_ref ? During training it seems to be a different fragment of the same mel-spectrogram as x. And to which part of the paper does it correspond?
(2) Why do we need to perform a weighted summation of mean and x? Does this mean that the reverse diffusion during inference starts from the weighted mean_x?
I'm new to diffusion models and don't quite understand the theory in the paper, so sorry if I asked some stupid questions.
The text was updated successfully, but these errors were encountered:
The speaker encoder uses this x_ref (different fragment of the same mel-spectrogram as x) as additional input to the trainable speaker conditioning network denoted by g_t(Y) in the paper. Different inputs to this network are compared in Table 1.
Yes, reverse diffusion starts from mean_x = self.decoder.compute_diffused_mean(x, x_mask, mean, 1.0), which is in fact very close to mean (because we have t=1.0 in this case). It is "average voice" mel-spectrogram denoted by X^{bar} in the paper.
At training, weighted summation of mean and x is necessary since it is related to the forward diffusion (see formula (3) in the paper), and at final time t=1.0 forward diffusion ends in the prior N(X^{bar}, I).
Hello, thank you for sharing this excellent work. After briefly browsing the code, I have two questions:
(1) What is the use of
x_ref
? During training it seems to be a different fragment of the same mel-spectrogram asx
. And to which part of the paper does it correspond?(2) Why do we need to perform a weighted summation of
mean
andx
? Does this mean that the reverse diffusion during inference starts from the weightedmean_x
?I'm new to diffusion models and don't quite understand the theory in the paper, so sorry if I asked some stupid questions.
The text was updated successfully, but these errors were encountered: