Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] about intersperse function. #75

Open
chep0k opened this issue Sep 4, 2023 · 2 comments
Open

[Question] about intersperse function. #75

chep0k opened this issue Sep 4, 2023 · 2 comments

Comments

@chep0k
Copy link

chep0k commented Sep 4, 2023

Hi!
During preprocessing, when add_blank is True in hparams, some weird intersperse function (here) intersperses an index, which is out of vocabulary bounds (item=len(symbols)), between each pair of adjacent tokens.
My first guess was that this token plays the role of some pauses between tokens, as pause token was not presented in vocabulary. So while training, all pauses sift to this token.
Then, as it's name state, I treated it as some blank token, which is needed to absorb all "noises" between adjacent tokens, as for other tokens to present more clear phonemes. There I thought it may also be used to learn transformations from one phoneme to another, which is not a part of any of two adjacant phonemes itself, but a separate part. but if so, why is it a common token for all gaps?
So, what is the real purpose of this blank token?
this question is more addressed to the authors, but any guesses are welcome.
thanks in advance.

@chnk58hoang
Copy link

Can someone explain the real purpose of interperse function ? I'm confuse with it a little bit.

@chep0k
Copy link
Author

chep0k commented Sep 9, 2024

Can someone explain the real purpose of interperse function ? I'm confuse with it a little bit.

as long as I have been working with GradTTS I treated the interspersed token, for which item argument stands in the respective function, as kind of "space" token, inserted between each two adjacent phonemes and denoting the amount of, say, "silence" between them which model should learn to pronounce. thus, each non-"space" token should be filled only with sound immediately relevant to this token, while all pauses, skips and spaces should be delegated to this token. moreover, in case of noisy data, all irrelevant (background) buzz could be fed into this tokens hence purging it from other actual phonemes. otherwise, id est if this "space" token is omitted, all the noise and silence would have no other choice but to be memorised as parts of actual phonemes tokens hence contaminating them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants