Arabic language support #7146
Replies: 10 comments 5 replies
-
Hi! There is some very basic support for Arabic, implemented here: https://github.com/explosion/spaCy/tree/master/spacy/lang/ar If you haven't seen it yet, this thread by Ines highlights the different steps and possible enhancements to improve the support of a particular language within spaCy. We're always very happy with community contributions of native speakers, as the core spaCy team only speaks so many languages ;-) |
Beta Was this translation helpful? Give feedback.
-
I'm willing to prototype a spaCy language model for Arabic (SMA)1. Identified the UD TreeBank to use for training the models Probably the best UD data would be the Arabic-NYUAD Treebank; the annotation is licenced as CC BY-SA 4.0, but to get the complete data you need to be member of the LDC Consortium or to negotiate a specific agreement; I'm not in position to do that; possibly you can, at explosion.ai. 2. First attempt at training the models ran into a problem of poor quality tokenization
Most serious warning refers to the tokenization:
3. Finding a tokenizer compatible with the analysis in the training set 4. Complying with the conservative (non-destructive) tokenization requirement 5. Some details 6. Getting a misalignment exception [DELETED] 7. Cannot explain the exception motivation [DELETED] 8. Some references 9. The full exception traceback [DELETED] 10. The config.cfg configuration
11. Results of debug data with the introduction of the new tokenizer
|
Beta Was this translation helpful? Give feedback.
-
I'm willing to prototype a spaCy language model for Arabic (SMA) - continued12. Not able to train the models
13. On the time performance
|
Beta Was this translation helpful? Give feedback.
-
(Please, note that the issue #13248, Cannot train Arabic models with a custom tokenizer, makes reference to this discussion.) 15. Rewriting the custom tokenizer in Cython I realized that my pure Python version of the custom tokenizer wasn't feeding the vocabulary with the strings associated to the generated tokens.
16. The problem related to parser training persists If I restore the full configuration, which included the pipeline
I again get the exception previously encountered, which concerns the parser.
|
Beta Was this translation helpful? Give feedback.
-
Hello gtoffoli, I want to use a Python code for text analysis and its using: en_core_web_lg In your post you mentioned: https://github.com/gtoffoli/spacy-cameltokenizer How I can use it so that its something like: ar_core_web_lg ? Or its a totally for different purpose? Thank you |
Beta Was this translation helpful? Give feedback.
-
Hi, getData123, no, it was, it is just my purpose. I'm working on this thread only part-time. Also, during last weeks I took a break for a few reasons; among them:
Please note that I don't know the Arabic language; I'm just currently learning a bit of it. |
Beta Was this translation helpful? Give feedback.
-
Recap of my comments above and some updatesSince a few month, in the spare time, I'm trying to develop a tokenizer for the Arabic language, limited to MSA (modern standard arabic), that is useful for training the basic spaCy pipeline. I'm willing to further investigate if I can develop a viable Arabic tokenizer, given the resources available to me (the annotated corpus) for training the language model and the constraints posed by the spaCy architecture. The spaCy requirements
The annotated corpus
Below is a small excerpt from the training set, file ar_padt-ud-train.conllu:
(note that the "vocalized" forms of the arabic tokens in the last two lines, 5 and 6, provided as the attribute Vform, differ slightly, while the non-vocalized forms, just after the numeric ids, are identical in this case) As you can see, the tokenization that was performed by the annotators of the Arabic-PADT corpus is "destructive", in the spaCy sense; the word in line labeled 5-6, of length 3, is splitted in the two tokens (lines 5 and 6), whose surface text (first field after the numeric id) has total length 4. I've just started to convert to the spaCy declarative style, in the form of "tokenizer_exceptions", some rules that I had previously formulated in procedural mode, through python code. This is a small excerpt of the module tokenizer_exceptions.py:
You can see that I'm trying to tokenize in non-destructive mode the word in the corpus line labelled 5-6 (compare them with my first rule), confident that the downstream learning algorithms would exploit the information provided against the preposition min (ORTH and NORM), even if the token text (ORTH) is different, because it is truncated, than when the preposition itself is not fused with a pronoun. If I run the native spaCy tokenizer with just the extension above (three rules), the rules work, that is the words matching their keys are correctly splitted each in two tokens; in the case of the first rule, the first token is the preposition min in its truncated form and the second one is the pronoun man. (1) (from #12247) (2) (from https://spacy.io/usage/linguistic-features) (3) my current tokenizer
|
Beta Was this translation helpful? Give feedback.
-
I've just added the README to the package https://github.com/gtoffoli/spacy-cameltokenizer |
Beta Was this translation helpful? Give feedback.
-
Thank you very much, really good news, I will try it. |
Beta Was this translation helpful? Give feedback.
-
I've just published on GitHub the package https://github.com/gtoffoli/spacy-ar_core_news_md with a tentative README. |
Beta Was this translation helpful? Give feedback.
-
Are there any plans to provide support for Arabic language in the near future?
We are ready and eager to support any effort to make it happen!
Beta Was this translation helpful? Give feedback.
All reactions