Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer Internationalization - German #4

Open
clusterfudge opened this issue Jan 8, 2016 · 11 comments
Open

Tokenizer Internationalization - German #4

clusterfudge opened this issue Jan 8, 2016 · 11 comments

Comments

@clusterfudge
Copy link
Collaborator

We should test to see if the EnglishTokenizer impl is sufficient for German, and if not, add an additional tokenizer. EnglishTokenizer is based on porter stemmer.

@freundTech
Copy link

I had a quick look at the tokenizer and it looks like it should work with german if abbreviations_list gets adjusted and clitics get removed.

Formal german doesn't have any clitics, so that won't be needed (German dialects are a different topic ;) )

@clusterfudge
Copy link
Collaborator Author

Awesome! Happy to review any pull requests. I don't have a process in place for reviewing localizations in languages I don't understand, so this will likely be a bit of a process.

@hinzundcode
Copy link

hinzundcode commented May 18, 2016

In German there are "compound verbs" like "ausschalten" (to turn off) or "herunterladen" (to download) and you have to seperate them as you conjugate them. For example:

"Schalte das Licht aus" (turn the light off)
"Lade die Datei herunter" (download this file)

Now I want to define an Intent that listens for "Schalte das Licht aus" and "würdest du bitte das Licht ausschalten" (would you like to turn off the light), so I'd like to define "ausschalten" as an entity and "schalte" + "aus". Is it possible to combine two seperate words to a single entity with the current version of adapt?

@timaschew
Copy link

Any news on this?

@clusterfudge
Copy link
Collaborator Author

Hey guys, sorry, this totally fell off my radar. It sounds like from @hinzundcode 's post that potentially the english tokenizer is not sufficient for his case. If true, we'd need a couple of things to get this working

  1. a german tokenizer, associated tests
  2. documentation for upstream systems on how to consume the new tokenizer (I'm largely thinking of @mycroft here)

I don't have the expertise to work on the former, so we'd be looking for support for the community here.

@acidjunk
Copy link

acidjunk commented Aug 9, 2019

The same is true for Dutch. #13

@acidjunk
Copy link

acidjunk commented Aug 9, 2019

All the other language seem stuck, Spanish seems to have progressed the most. #5
But when I look at the current src there is only English, without any hooks/guidance how to start translating them. I can very easily translate the english strings in https://github.com/MycroftAI/adapt/blob/master/adapt/tools/text/tokenizer.py but multilingual support seems missing, judging on the layout of folders and files.

Shouldn't there be a way to see that tokenizer.py is the EN variant? (except for the EnglishTokenizer class the rest of the file contains top level vars that have content in it with translations needs).

E.g. I would expect somthing like: adapt/tools/text/en/tokenizer.py and adapt/tools/text/de/tokenizer.py

More then happy to create a PR with a NL tokenizer (and I might even be able to help with a German version), but without multilingual support it feels a bit useless. I might be missing some essential design clue, any pointers in the right direction are appreciated.

@acidjunk
Copy link

Also not a single word about translating this in the docs: https://mycroft.ai/documentation/adapt/

Not sure how to continue, without (community) support.

@clusterfudge : Would be nice to at least remove the "READY" labels as they are somewhat confusing.

@ghost
Copy link

ghost commented Mar 1, 2020

I'm also interested in a German tokenizer as I'm localizing one of my applications to German at the moment. I'm forced to disable intent parsing for the German version which is a pitty because it does add value to my product.

Looks like this project is dead though any alternatives? @acidjunk could you solve it?

@clusterfudge
Copy link
Collaborator Author

Sorry for the delayed followup here; removing the READY label is probably a good call, @acidjunk .

In order to hit READY, what we likely need is a well-specified interface for Tokenizer. There's also likely a chunk of project-management work on my part to lay out the work for each language. I'll attempt to put something like that up in the next week.

One field on this tracking table will be indicating whether or not bag-of-words classification works for the language in question. This will require language-fluency and a good comprehension of bag-of-words confirmation. If anyone feels like they meet the criteria for this, feel free to speak up!

@acidjunk
Copy link

I'm not very actively following MyCroft stuff anymore (mostly due to the lack of delivery of the MkII: my domotica is already complete controllable by Siri in Dutch for the last 2 years.

I'm fluent in English, Ducth and German so if there is work with a good defined scope regarding getting language support better: I can help. Just shoot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants