Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect source language with langdetect package #37

Open
awalker88 opened this issue Apr 22, 2021 · 5 comments
Open

Detect source language with langdetect package #37

awalker88 opened this issue Apr 22, 2021 · 5 comments
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed

Comments

@awalker88
Copy link

The langdetect has worked well for me in the past for language detection problems. How would you feel about allowing users to pass 'auto' as an option for source? I could see some pros and cons:

Pros

  • Users don't need to be able to recognize a language to translate
  • Eliminates pre-classification of languages if your dataset contains multiple languages

Cons

I'm a little new to open source but I would love to contribute 🙂 Of course, if you feel this doesn't fit this package's mission that's totally understandable.

@xhluca
Copy link
Owner

xhluca commented Apr 23, 2021

Hey langdetect is cool! However it seems there's many options for language detection, including fasttext and langid.py. Each option will have a certain accuracy (none of them are 100%) and speed - so I feel it might be difficult to choose for the end user.

Also since we are now using m2m100 by default, it might create confusion with users that try to auto-detect a language that's not available with the chosen detection algorithm (but available in m2m100).

I think a good option would be to start with a section in the user guide showing how to use any (or all) of the language detection libraries. Then from there, we could build a util function along the lines of:

src = dlt.lang.detect(source_text, backend="fasttext")  # or backend="langdetect" or backend="langid"
mt.translate(source_text, source=src,...)

Which will throw an error that requires a user to install the library if they want to use a specific backend.

@awalker88
Copy link
Author

Those are some good points, I agree it would be confusing to have the library detect a language but not translate it. I'll take a look into writing something that could potentially put into the user guide.

@xhluca
Copy link
Owner

xhluca commented Apr 27, 2021

Thank you. Once we have something in the user guide I'd welcome another PR that'd update dlt.utils or dlt.lang as well, if you wish!

@banyous
Copy link

banyous commented Oct 13, 2021

Hi, Any updates about this issue. Is there any hint for making language source auto-detected?

@xhluca
Copy link
Owner

xhluca commented Oct 16, 2021

@banyous Feel free to contribute a section in the user guide about using language detection, and from there, if we feel a wrapper around fasttext would make life easier, then I'm happy to welcome a PR to add language detection to dlt.utils or dlt.lang

I think this is a decent starting point: https://fasttext.cc/docs/en/language-identification.html

@xhluca xhluca added help wanted Extra attention is needed enhancement New feature or request good first issue Good for newcomers labels Jan 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants