Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the reason of filtering "_" and "~" symbols? #125

Open
kremnik opened this issue Dec 19, 2023 · 0 comments
Open

What is the reason of filtering "_" and "~" symbols? #125

kremnik opened this issue Dec 19, 2023 · 0 comments

Comments

@kremnik
Copy link

kremnik commented Dec 19, 2023

Hi @rafaelvalle

In this line:

return s in _symbol_to_id and s is not '_' and s is not '~'

you are filtering "_" and "~" symbols.
Also, one of the main advice to improve alignment map convergence is to add special symbols to start and end of every sentence. Usually these symbols are exactly "_" and "~" (ex: _What is your name?~). But you filter out exactly these symbols and do not add them anywhere in the code.

It is interesting, that you've included the "_" symbol here:

_special = '_@©°½—₩€$'

but anyway filter it out next in sentence preprocessing.

So the questions are:

  1. What is the reason of filtering out "_" and "~" symbols?
  2. Why don't you use them as start and end symbols in sentences?
@kremnik kremnik closed this as completed Dec 19, 2023
@kremnik kremnik reopened this Dec 19, 2023
@kremnik kremnik changed the title What is the reason of filtering "_" and "~"? What is the reason of filtering "_" and "~" symbols? Dec 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant