Arabic language support #7146

jeknov · 2021-02-21T18:10:53Z

jeknov
Feb 21, 2021

Are there any plans to provide support for Arabic language in the near future?
We are ready and eager to support any effort to make it happen!

svlandeg · 2021-02-22T12:42:23Z

svlandeg
Feb 22, 2021
Maintainer

Hi! There is some very basic support for Arabic, implemented here: https://github.com/explosion/spaCy/tree/master/spacy/lang/ar

If you haven't seen it yet, this thread by Ines highlights the different steps and possible enhancements to improve the support of a particular language within spaCy. We're always very happy with community contributions of native speakers, as the core spaCy team only speaks so many languages ;-)

0 replies

gtoffoli · 2024-01-11T12:55:51Z

gtoffoli
Jan 11, 2024

I'm willing to prototype a spaCy language model for Arabic (SMA)

1. Identified the UD TreeBank to use for training the models

Probably the best UD data would be the Arabic-NYUAD Treebank; the annotation is licenced as CC BY-SA 4.0, but to get the complete data you need to be member of the LDC Consortium or to negotiate a specific agreement; I'm not in position to do that; possibly you can, at explosion.ai.
Another option is to use the Arabic-PADT treebank; it seems that Stanza has been successfully trained with it; the annotation is licensed under the terms of CC BY-NC-SA 3.0; I downloaded it.
See: https://universaldependencies.org/treebanks/ar_padt/index.html

2. First attempt at training the models ran into a problem of poor quality tokenization
Using spaCy 3.7, I followed the procedure to develop a basic pipeline (tagger, morphologizer, trainable_lemmatizer, parser), and got the following output:

============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'tagger', 'morphologizer', 'trainable_lemmatizer', 'parser']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS TAGGER  LOSS MORPH...  LOSS TRAIN...  LOSS PARSER  TAG_ACC  POS_ACC  MORPH_ACC  LEMMA_ACC  DEP_UAS  DEP_LAS  SENTS_F  SCORE
---  ------  ------------  -----------  -------------  -------------  -----------  -------  -------  ---------  ---------  -------  -------  -------  ------  
...
 50   18800      79529.46      3919.46        4015.89        2897.51     23721.29    70.82    76.32      70.90      71.89    47.64    42.75    56.62    0.66
The data debugging provided the following
================================== Summary ==================================
✔ 7 checks passed
⚠ 12 warnings

Most serious warning refers to the tokenization:

- ============================== Vocab & Vectors ==============================
- ℹ 223881 total word(s) in the data (21915 unique)
- ⚠ 32397 misaligned tokens in the training data
- ⚠ 4382 misaligned tokens in the dev data
- ...

3. Finding a tokenizer compatible with the analysis in the training set
Just looking at the 2nd sentence in the dev set (file ar_padt-ud-dev.conllu), I realized that, while the doc produced by spacy.blank("ar") contains 31 tokens, the annotation refers to 36 tokens: it identifies and treats as additional tokens 3 prefixes and 2 suffixes.
After some search, I've found, inside the CamelTools package, data and modules which, used together, seem to solve the problem.
With a few lines of code, I've developed an algorithm that, getting as input the raw tokenization (unaware of clitics), ouputs one appearing, at a first look, consistent with the one used in annotating the Arabic-PADT treebank.
I only had to patch it very slightly.

4. Complying with the conservative (non-destructive) tokenization requirement
This is a very difficult requirement for Arabic.
4.1 In the training set, many words include one or more occurrences of the Tatweel/Kashida character (u'\u0640'), which has a purely aesthetic role in corsive arabic text, in that it serves to elongate the word and achieve a more pleasing layout. Due to the non-destructive requirement of spaCy, normalization cannot be applied by the tokenizer; therefore, I had to apply it upstream, to the training set files.
4.2 I took the opportunity for removing from the training set also the character Superscript Alef (u'\u0670'): although quite rare (it is missing on standard keyboards), a few occurrences confuse the tokenizer.
4.3 Finally, I replaced manually (with Notepad++), inside the training set, the character sequence '-(' with '- (', that is I added a space between the closing round bracket and the minus/hyphen sign in the RTL sentence: I had verified that the original sequence '-(x', with x representing an alphanumeric character, was producing a single token, in place of three, not only in my Arabic tokenizer but also in the "blank" tokenizers for Arabic, English and Italian. On the other hand, I'm not able to twickle the spaCy punctuation.py modules to avoid that.

5. Some details
In order to train an Arabic model with the tokenizer code above, I need to replace the raw spaCy tokenizer with that mentioned in point 3 above.
I added to the "debug data" and "train" commands a "--code ./functions.py parameter" and slightly modified config.cfg (see below)
The functions.py file contains only 1 line:
import nlp.spacy_custom.ar.tokenizer
where the module tokenizer.py both defines an MsaTokenizer class and registers the tokenizer with the "msa_tokenizer" name.
Besides invoking the tokenizer software of CamelTools, according to the configuration set in the class initialization, the tokenize method of my MsaTokenizer class incrementally performs some checks on the results produced, in order to early detect text misalignments and possibly terminate with a diagnostic.
After removing some features of the text in the training set confusing the tokenizer (point 4. above), in order to avoid residual "destructive" tokenization cases altogether, I made a small patch to the tokenize method of the class MorphologicalTokenizer in the module camel_tools.tokenizers.morphological; said patch blocks the split into multiple tokens of a token coming from the raw tokenizer, in the (few) cases where the text of the derived tokens has different total length than the original token.

6. Getting a misalignment exception [DELETED]

7. Cannot explain the exception motivation [DELETED]

8. Some references
"nlp" is the name of a library in site-packages, from one of my Django projects; its address on GitHub is:
https://github.com/gtoffoli/commons-language/tree/master/nlp
The address of Camel Tools on GitHub is:
https://github.com/CAMeL-Lab/camel_tools
This is the debug data command line in context:
C:\_Tecnica\AI\CL\spacy\training\ar>C:\language310\Scripts\python -m spacy debug data config.cfg --code ./functions.py --verbose --paths.train ./ar_padt-ud-train.spacy --paths.dev ./ar_padt-ud-dev.spacy
And this is the train command line:
C:\_Tecnica\AI\CL\spacy\training\ar>C:\language310\Scripts\python -m spacy train config.cfg --code ./functions.py --paths.train ./ar_padt-ud-train.spacy --paths.dev ./ar_padt-ud-dev.spacy
C:\language310 is the Python venv, whose Scripts folder includes Python-3.10 and whose site-packages folder contains spaCy, camel_tools, Django and a lot of other libraries
C:_Tecnica\AI\CL\spacy\training\ar is the directory containing the conllu files, config.cfg and functions.py

9. The full exception traceback [DELETED]

10. The config.cfg configuration

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "ar"
pipeline = ["tok2vec","tagger","morphologizer","trainable_lemmatizer","parser"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
# tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
tokenizer = {"@tokenizers":"msa_tokenizer"}
vectors = {"@vectors":"spacy.Vectors.v1"}

[components]

[components.morphologizer]
factory = "morphologizer"
extend = false
label_smoothing = 0.05
overwrite = true
scorer = {"@scorers":"spacy.morphologizer_scorer.v1"}

[components.morphologizer.model]
@architectures = "spacy.Tagger.v2"
nO = null
normalize = false

[components.morphologizer.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"

[components.parser]
factory = "parser"
learn_tokens = false
min_action_freq = 30
moves = null
scorer = {"@scorers":"spacy.parser_scorer.v1"}
update_with_oracle_cut_size = 100

[components.parser.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "parser"
extra_state_tokens = false
hidden_width = 128
maxout_pieces = 3
use_upper = true
nO = null

[components.parser.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"

[components.tagger]
factory = "tagger"
label_smoothing = 0.05
neg_prefix = "!"
overwrite = false
scorer = {"@scorers":"spacy.tagger_scorer.v1"}

[components.tagger.model]
@architectures = "spacy.Tagger.v2"
nO = null
normalize = false

[components.tagger.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
rows = [5000,1000,2500,2500]
include_static_vectors = false

[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3

[components.trainable_lemmatizer]
factory = "trainable_lemmatizer"
backoff = "orth"
min_tree_freq = 3
overwrite = false
scorer = {"@scorers":"spacy.lemmatizer_scorer.v1"}
top_k = 1

[components.trainable_lemmatizer.model]
@architectures = "spacy.Tagger.v2"
nO = null
normalize = false

[components.trainable_lemmatizer.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null
before_update = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
tag_acc = 0.26
pos_acc = 0.12
morph_acc = 0.12
morph_per_feat = null
lemma_acc = 0.26
dep_uas = 0.12
dep_las = 0.12
dep_las_per_type = null
sents_p = null
sents_r = null
sents_f = 0.0

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

11. Results of debug data with the introduction of the new tokenizer

✔ Corpus is loadable

=============================== Training stats ===============================
Language: ar
Training pipeline: tok2vec, tagger, morphologizer, trainable_lemmatizer, parser
608 training docs
91 evaluation docs
✔ No overlap between training and evaluation data
⚠ Low number of examples to train a new pipeline (608)
It's recommended to use at least 2000 examples (minimum 100)

============================== Vocab & Vectors ==============================
ℹ 223881 total word(s) in the data (21760 unique)
⚠ 5705 misaligned tokens in the training data
⚠ 889 misaligned tokens in the dev data
10 most common words: 'و' (12982), '.' (6081), 'في' (5833), 'ل' (5681), 'ب'
(4986), 'من' (4331), '،' (3813), 'ه' (3198), '"' (3108), 'ها' (3042)
ℹ No word vectors present in the package

=========================== Part-of-speech Tagging ===========================
ℹ 343 label(s) in train data
ℹ 0.6173621047877236 is the normalised label entropy
⚠ Some model labels are not present in the train data. The model performance may
be degraded for these labels after training: (omissis)

========================= Morphologizer (POS+Morph) =========================
ℹ 345 label(s) in train data
⚠ Some model labels are not present in the train data. The model performance may
be degraded for these labels after training: (omissis)

============================= Dependency Parsing =============================
ℹ Found 6006 sentence(s) with an average length of 37.3 words.
ℹ Found 313 nonprojective train sentence(s)
ℹ 36 label(s) in train data
ℹ 149 label(s) in projectivized train data
(omissis)

⚠ Low number of examples for label 'csubj:pass' (1)
⚠ Low number of examples for label 'discourse' (1)
⚠ Low number of examples for 105 label(s) in the projectivized dependency trees
used for training. You may want to projectivize labels such as punct before
training in order to improve parser performance.
⚠ Projectivized labels with low numbers of examples: (omissis)

...
============================ Trainable Lemmatizer ============================
ℹ 4330 lemmatizer trees generated from training data
ℹ 1937 lemmatizer trees generated from dev data
ℹ 265 lemmatizer trees (13.7% of dev trees) were found exclusively in the dev
data.
✔ All training docs have lemma annotations.
✔ All dev docs have lemma annotations.
✔ All training docs have complete lemma annotations.
✔ All dev docs have complete lemma annotations.

================================== Summary ==================================
✔ 7 checks passed
⚠ 11 warnings

0 replies

gtoffoli · 2024-01-13T00:24:37Z

gtoffoli
Jan 13, 2024

I'm willing to prototype a spaCy language model for Arabic (SMA) - continued

12. Not able to train the models
First, I added in the command the option for the output directory, which I had forgotten.
python -m spacy train config.cfg --code ./functions.py --output ./output --paths.train ./ar_padt-ud-train.spacy --paths.dev ./ar_padt-ud-dev.spacy
Then, I realized that the custom tokenizer MsaTokenizer must expose an interface matching that of the original one; therefore, I made it a subclass of spacy.tokenizer.Tokenizer; also, in the init method, I put a call to the correspondent method of the superclass, even though it may not have been necessary.
This time, the training started and subdirectories were created and populated in the output directory, with data concerning the different pipeline components, but for the parser: an exception was generated just in relation to the parser. The traceback follows.

⚠ Aborting and saving the final best model. Encountered exception:
KeyError("[E900] Could not run the full pipeline for evaluation. If you
specified frozen components, make sure they were already initialized and
trained. Full pipeline: ['tok2vec', 'tagger', 'morphologizer',
'trainable_lemmatizer', 'parser']")
Traceback (most recent call last):
  File "C:\language310\lib\site-packages\spacy\training\loop.py", line 298, in evaluate
    scores = nlp.evaluate(dev_corpus(nlp))
  File "C:\language310\lib\site-packages\spacy\language.py", line 1459, in evaluate
    for eg, doc in zip(examples, docs):
  File "C:\language310\lib\site-packages\spacy\language.py", line 1618, in pipe
    for doc in docs:
  File "C:\language310\lib\site-packages\spacy\util.py", line 1685, in _pipe
    yield from proc.pipe(docs, **kwargs)
  File "spacy\pipeline\transition_parser.pyx", line 255, in pipe
  File "C:\language310\lib\site-packages\spacy\util.py", line 1704, in raise_error
    raise e
  File "spacy\pipeline\transition_parser.pyx", line 252, in spacy.pipeline.transition_parser.Parser.pipe
  File "spacy\pipeline\transition_parser.pyx", line 345, in spacy.pipeline.transition_parser.Parser.set_annotations
  File "spacy\pipeline\_parser_internals\nonproj.pyx", line 176, in spacy.pipeline._parser_internals.nonproj.deprojectivize
  File "spacy\pipeline\_parser_internals\nonproj.pyx", line 181, in spacy.pipeline._parser_internals.nonproj.deprojectivize
  File "spacy\strings.pyx", line 160, in spacy.strings.StringStore.__getitem__
KeyError: "[E018] Can't retrieve string for hash '8206900633647566924'. This usually refers to an issue with the `Vocab` or `StringStore`."

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\language310\lib\site-packages\spacy\__main__.py", line 4, in <module>
    setup_cli()
  File "C:\language310\lib\site-packages\spacy\cli\_util.py", line 87, in setup_cli
    command(prog_name=COMMAND)
  File "C:\language310\lib\site-packages\click\core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "C:\language310\lib\site-packages\typer\core.py", line 778, in main
    return _main(
  File "C:\language310\lib\site-packages\typer\core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "C:\language310\lib\site-packages\click\core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "C:\language310\lib\site-packages\click\core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\language310\lib\site-packages\click\core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "C:\language310\lib\site-packages\typer\main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "C:\language310\lib\site-packages\spacy\cli\train.py", line 54, in train_cli
    train(config_path, output_path, use_gpu=use_gpu, overrides=overrides)
  File "C:\language310\lib\site-packages\spacy\cli\train.py", line 84, in train
    train_nlp(nlp, output_path, use_gpu=use_gpu, stdout=sys.stdout, stderr=sys.stderr)
  File "C:\language310\lib\site-packages\spacy\training\loop.py", line 135, in train
    raise e
  File "C:\language310\lib\site-packages\spacy\training\loop.py", line 118, in train
    for batch, info, is_best_checkpoint in training_step_iterator:
  File "C:\language310\lib\site-packages\spacy\training\loop.py", line 243, in train_while_improving
    score, other_scores = evaluate()
  File "C:\language310\lib\site-packages\spacy\training\loop.py", line 300, in evaluate
    raise KeyError(Errors.E900.format(pipeline=nlp.pipe_names)) from e
KeyError: "[E900] Could not run the full pipeline for evaluation. If you specified frozen components, make sure they were already initialized and trained. Full pipeline: ['tok2vec', 'tagger', 'morphologizer', 'trainable_lemmatizer', 'parser']"

13. On the time performance
Probably the long time spent, already experimented while debugging the data, can be attributed to the low performance of the morphological tokenizer, besides the characteristics of my notebook (Intel(R) Core(TM) i9-8950HK CPU @ 2.90 GHz with 32,0 GB RAM). I don't know if I can find a way of speeding up it a little the tokenizer. I also started looking for an alternative, but haven't found a better one so far.
14. Attempt to train a shortened pipeline
Since the error in the training loop seemed to concern the parser, I tried to exclude it. I redefined the configuration, keeping tagger, morphologizer and trainable_lemmatizer, and the previous customization related to the tokenizer. No more errors occurred, but the results appeared quite bad from the beginning; there was no significant progress and the overall score stopped at 0.13, as you can see from the printout here below.

=========================== Initializing pipeline ===========================
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'tagger', 'morphologizer', 'trainable_lemmatizer']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS TAGGER  LOSS MORPH...  LOSS TRAIN...  TAG_ACC  POS_ACC  MORPH_ACC  LEMMA_ACC  SCORE
---  ------  ------------  -----------  -------------  -------------  -------  -------  ---------  ---------  ------
  0       0          0.00       263.59         263.59         292.87     8.26     6.41      22.21       6.25    0.10
  0     200       2243.59     61328.41       61533.10       69324.00    12.07    14.20      12.12      13.80    0.13
  0     400       4939.96     61597.46       61808.88       69801.99    12.28    14.41      12.33      13.95    0.13
  0     600       7006.80     58008.39       58194.35       65569.73    12.28    14.41      12.33      13.95    0.13
  1     800       9064.61     60404.81       60601.96       68423.80    12.28    14.41      12.33      13.95    0.13
  1    1000      10517.97     59133.36       59332.95       66864.73    12.28    14.41      12.33      13.95    0.13
  1    1200      11955.53     60253.80       60464.19       68272.52    12.28    14.41      12.33      13.95    0.13
  2    1400      13075.49     59867.80       60067.94       67851.36    12.28    14.41      12.33      13.95    0.13
  2    1600      14210.66     59661.01       59864.44       67416.25    12.28    14.41      12.33      13.95    0.13
  2    1800      15619.50     60891.98       61097.71       68971.02    12.27    14.39      12.34      13.93    0.13
  3    2000      16587.05     59403.62       59614.98       67394.87    12.33    14.54      12.37      13.95    0.13
  3    2200      18186.02     59392.93       59596.89       67323.80    12.33    14.54      12.37      13.93    0.13
  3    2400      19838.15     60490.63       60676.79       68314.79    12.33    14.54      12.37      13.95    0.13
  4    2600      21191.47     61053.83       61277.69       69176.18    12.40    14.65      12.44      13.98    0.13
  4    2800      22278.21     59152.00       59338.79       66915.89    12.40    14.65      12.44      13.95    0.13
  4    3000      23713.32     59549.63       59753.45       67398.50    12.40    14.65      12.44      13.95    0.13
  5    3200      26140.78     62477.05       62693.23       70812.70    12.40    14.65      12.44      13.95    0.13
  5    3400      26776.32     58702.22       58897.17       66379.17    12.40    14.65      12.44      13.95    0.13
  5    3600      28011.05     60296.14       60508.84       68352.01    12.40    14.65      12.44      13.95    0.13
  6    3800      29124.25     60435.39       60639.39       68527.21    12.40    14.65      12.44      13.95    0.13
  6    4000      30782.47     58571.56       58768.05       66287.35    12.40    14.65      12.44      13.95    0.13
  6    4200      32780.37     59563.61       59773.22       67348.04    12.40    14.65      12.44      13.95    0.13
✔ Saved pipeline to output directory
output\model-last

0 replies

gtoffoli · 2024-02-07T11:08:28Z

gtoffoli
Feb 7, 2024

(Please, note that the issue #13248, Cannot train Arabic models with a custom tokenizer, makes reference to this discussion.)

15. Rewriting the custom tokenizer in Cython

I realized that my pure Python version of the custom tokenizer wasn't feeding the vocabulary with the strings associated to the generated tokens.
Then, I tried to write a Cython version of it, which is partly inspired by the standard spaCy tokenizer; see: https://github.com/gtoffoli/commons-language/tree/master/nlp/spacy_custom/ar.
For the reduced configuration of the pipeline (which excludes the parser), I obtained a significant improvement in the scores, compared to those obtained with the native tokenizer. Presumably, this improvement is due to a better token alignment with the reference tokenization (in the training set), which was already revealed by the output of the debug data command.
I include here below the output of the training.

C:\_Tecnica\AI\CL\spacy\training\ar>\language310\Scripts\python -m spacy train config.cfg --code ./functions.py --output ./output --paths.train ./ar_padt-ud-train.spacy --paths.dev ./ar_padt-ud-dev.spacy
ℹ Saving to output directory: output
ℹ Using CPU

=========================== Initializing pipeline ===========================
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'tagger', 'morphologizer', 'trainable_lemmatizer']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS TAGGER  LOSS MORPH...  LOSS TRAIN...  TAG_ACC  POS_ACC  MORPH_ACC  LEMMA_ACC  SCORE
---  ------  ------------  -----------  -------------  -------------  -------  -------  ---------  ---------  ------
  0       0          0.00       263.59         263.59         292.87    16.57    33.42      18.74      25.89    0.23
  0     200       3262.11     36358.51       36859.45       49349.43    70.98    83.54      71.36      54.52    0.68
  0     400       5898.67     20535.52       20874.65       34030.46    77.05    87.13      77.32      67.06    0.76
  0     600       6599.50     17306.80       17582.94       25491.51    79.02    88.45      79.32      73.09    0.79
  1     800       6686.97     14316.64       14552.90       20113.25    80.29    89.15      80.52      76.94    0.81
  1    1000       7071.70     13700.11       13947.44       17212.57    80.61    89.35      80.92      78.91    0.82
  1    1200       7750.03     13718.12       13969.62       16283.88    82.04    89.75      82.33      80.29    0.83
  2    1400       6810.41     11410.04       11636.22       12691.22    82.42    89.79      82.69      81.16    0.83
  2    1600       7573.56     11614.51       11837.81       12155.15    82.29    89.72      82.58      82.03    0.84
  2    1800       7560.56     11266.44       11506.66       11400.90    82.90    90.19      83.16      83.02    0.84
  3    2000       6358.65      9120.63        9312.31        8907.60    83.03    90.18      83.27      83.24    0.84
  3    2200       7131.51      9663.81        9853.19        9105.22    83.43    90.38      83.67      83.66    0.85
  3    2400       8064.30     10270.43       10493.97        9628.27    83.72    90.51      83.98      84.01    0.85
  4    2600       6858.55      8619.96        8841.47        7746.83    83.81    90.54      84.05      83.99    0.85
  4    2800       7638.91      8994.53        9174.90        7983.40    83.52    90.35      83.79      84.34    0.85
  4    3000       7385.66      8757.50        8924.33        7536.40    83.99    90.68      84.23      84.60    0.85
  5    3200       6879.64      7982.26        8161.69        6525.11    84.02    90.68      84.21      84.47    0.85
  5    3400       7014.29      7757.36        7911.33        6417.27    84.07    90.43      84.33      84.52    0.85
  5    3600       8035.54      8359.48        8574.10        7046.51    84.05    90.56      84.29      84.79    0.85
  6    3800       6694.71      7122.33        7300.16        5587.67    83.88    90.44      84.12      84.62    0.85
  6    4000       6894.74      7106.39        7276.50        5462.59    84.32    90.69      84.57      84.82    0.86
  6    4200       7928.11      7397.23        7596.32        6231.09    84.42    90.66      84.64      85.16    0.86
  7    4400       6755.06      6592.40        6761.66        4984.96    84.16    90.60      84.38      85.18    0.86
  7    4600       7178.46      6496.45        6667.48        5143.18    84.44    90.77      84.67      85.21    0.86
  7    4800       8122.97      7198.26        7391.89        5607.87    84.50    90.90      84.74      85.25    0.86
  8    5000       6576.93      5952.14        6105.87        4458.36    84.43    90.65      84.63      85.26    0.86
  8    5200       7016.39      6112.55        6304.18        4455.58    84.31    90.72      84.55      85.54    0.86
  8    5400       7854.96      6540.75        6741.39        5002.45    84.32    90.65      84.55      85.27    0.86
  9    5600       6925.33      5628.01        5799.70        4298.01    84.66    90.76      84.88      85.47    0.86
  9    5800       7040.48      5680.48        5822.91        4142.11    84.66    90.79      84.86      85.32    0.86
  9    6000       8237.94      6380.14        6562.39        4778.65    84.48    90.71      84.71      85.26    0.86
 10    6200       6936.38      5320.60        5462.90        3965.12    84.61    90.72      84.86      85.54    0.86
 10    6400       7239.99      5563.42        5724.65        3915.47    84.80    90.82      85.03      85.46    0.86
 10    6600       8083.48      5857.53        6014.98        4339.20    84.95    90.84      85.15      85.45    0.86
 11    6800       6690.68      4839.22        5029.49        3620.48    84.46    90.80      84.72      85.22    0.86
 11    7000       7458.39      5313.89        5469.51        3755.81    84.64    90.78      84.83      85.61    0.86
 11    7200       7804.93      5271.73        5431.49        3924.28    84.65    90.60      84.85      85.22    0.86
 12    7400       7552.92      5144.32        5333.72        3734.88    84.85    90.83      85.07      85.72    0.86
 12    7600       7102.70      4838.21        4998.56        3296.64    84.86    90.81      85.11      85.61    0.86
 12    7800       7598.85      4970.47        5126.69        3538.45    84.52    90.83      84.78      85.27    0.86
 13    8000       7368.58      4789.37        4927.22        3400.01    84.45    90.65      84.71      85.37    0.86
 13    8200       6604.34      4324.76        4469.53        2914.96    84.57    90.66      84.80      85.49    0.86
 13    8400       8457.94      5003.42        5164.14        3752.27    84.88    90.78      85.08      85.55    0.86
 14    8600       6454.62      4082.01        4269.02        2849.52    84.74    90.82      84.97      85.63    0.86
 14    8800       7540.39      4534.11        4685.17        3129.70    84.87    90.86      85.10      85.51    0.86
 14    9000       8161.44      4675.95        4845.82        3460.23    84.86    90.76      85.03      85.59    0.86
✔ Saved pipeline to output directory
output\model-last

16. The problem related to parser training persists

If I restore the full configuration, which included the pipeline

["tok2vec","tagger","morphologizer","trainable_lemmatizer","parser"]

I again get the exception previously encountered, which concerns the parser.
I include here below the output of the traceback.

C:\_Tecnica\AI\CL\spacy\training\ar>time /t
00:09

C:\_Tecnica\AI\CL\spacy\training\ar>\language310\Scripts\python -m spacy train config.cfg --code ./functions.py --output ./output --paths.train ./ar_padt-ud-train.spacy --paths.dev ./ar_padt-ud-dev.spacy
ℹ Saving to output directory: output
ℹ Using CPU

=========================== Initializing pipeline ===========================
('ar', 'ar')
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'tagger', 'morphologizer', 'trainable_lemmatizer',
'parser']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS TAGGER  LOSS MORPH...  LOSS TRAIN...  LOSS PARSER  TAG_ACC  POS_ACC  MORPH_ACC  LEMMA_ACC  DEP_UAS  DEP_LAS  SENTS_F  SCORE
---  ------  ------------  -----------  -------------  -------------  -----------  -------  -------  ---------  ---------  -------  -------  -------  ------
  0       0          0.00       263.59         263.59         292.87       571.94    16.57    33.42      18.74      25.89    11.63     5.40     0.00    0.19
⚠ Aborting and saving the final best model. Encountered exception:
KeyError("[E018] Can't retrieve string for hash '14416339484304832919'. This
usually refers to an issue with the `Vocab` or `StringStore`.")
Traceback (most recent call last):
  File "C:\python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\language310\lib\site-packages\spacy\__main__.py", line 4, in <module>
    setup_cli()
  File "C:\language310\lib\site-packages\spacy\cli\_util.py", line 87, in setup_cli
    command(prog_name=COMMAND)
  File "C:\language310\lib\site-packages\click\core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "C:\language310\lib\site-packages\typer\core.py", line 778, in main
    return _main(
  File "C:\language310\lib\site-packages\typer\core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "C:\language310\lib\site-packages\click\core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "C:\language310\lib\site-packages\click\core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\language310\lib\site-packages\click\core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "C:\language310\lib\site-packages\typer\main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "C:\language310\lib\site-packages\spacy\cli\train.py", line 54, in train_cli
    train(config_path, output_path, use_gpu=use_gpu, overrides=overrides)
  File "C:\language310\lib\site-packages\spacy\cli\train.py", line 84, in train
    train_nlp(nlp, output_path, use_gpu=use_gpu, stdout=sys.stdout, stderr=sys.stderr)
  File "C:\language310\lib\site-packages\spacy\training\loop.py", line 135, in train
    raise e
  File "C:\language310\lib\site-packages\spacy\training\loop.py", line 118, in train
    for batch, info, is_best_checkpoint in training_step_iterator:
  File "C:\language310\lib\site-packages\spacy\training\loop.py", line 220, in train_while_improving
    nlp.update(
  File "C:\language310\lib\site-packages\spacy\language.py", line 1193, in update
    proc.update(examples, sgd=None, losses=losses, **component_cfg[name])  # type: ignore
  File "spacy\pipeline\transition_parser.pyx", line 411, in spacy.pipeline.transition_parser.Parser.update
  File "spacy\pipeline\transition_parser.pyx", line 671, in spacy.pipeline.transition_parser.Parser._init_gold_batch
  File "spacy\pipeline\_parser_internals\arc_eager.pyx", line 659, in spacy.pipeline._parser_internals.arc_eager.ArcEager.init_gold
  File "spacy\pipeline\_parser_internals\arc_eager.pyx", line 683, in spacy.pipeline._parser_internals.arc_eager.ArcEager._replace_unseen_labels
  File "spacy\strings.pyx", line 160, in spacy.strings.StringStore.__getitem__
KeyError: "[E018] Can't retrieve string for hash '14416339484304832919'. This usually refers to an issue with the `Vocab` or `StringStore`."

C:\_Tecnica\AI\CL\spacy\training\ar>time /t
00:33

1 reply

gtoffoli Mar 22, 2024

(I take up the numbering of the notes from my previous comments)

17. A few updates on the setup of my experimentation

Since my last comment, I have made significant progress, but I have also restructured the code to make my experiments more easily reproducible:

moved the code that extends the original spaCy tokenizer for Arabic to a new dedicated GitHub repository: https://github.com/gtoffoli/spacy-cameltokenizer , which does not depend on my other projects;
the code now calls directly classes, methods and utility functions of the pypi package https://pypi.org/project/camel-tools/ , while my previous code included a patched partial copy of the CAMeL Tools code and data ;
changed the min_action_freq parameter in the parser section of config.cfg, assigning it the value 1 (min_action_freq = 1), as I already mentioned in the issue 13248, Cannot train Arabic models with a custom tokenizer.

This last change allowed to overcome the bug related to the parser training code (see point 16. in my previous comment).

18. Adding extra processing before and after calling the Morphological Tokenizer of CAMeL Tools

Before: the new function fix_raw_tokens, in the module cameltokenizer.tokenizer.pyx , handles problems mostly involving missing split of punctuation, operators and decimal strings, some of which could occurr also in other spaCy tokenizers.

After: the new function improve_morphological_tokenization splits some more prefixes and suffixes; I wrote it with a try and test approach, with the sole objective of minimizing the number of misaligned tokens in the training and dev data, found in the Vocab & Vectors section of the debug data output; that is, I took the tokenization performed by the annotators of the conllu files as the source of truth.

19. Improvements in debug data results

As of now, I was able to drastically reduce the number of misaligned tokens:

Base spaCy tokenizer

- ============================== Vocab & Vectors ==============================
- ℹ 223881 total word(s) in the data (21915 unique)
- ⚠ 32397 misaligned tokens in the training data
- ⚠ 4382 misaligned tokens in the dev data

Adding the Morphological Tokenizer of CAMeL Tools

============================== Vocab & Vectors ==============================
ℹ 223881 total word(s) in the data (21760 unique)
⚠ 5705 misaligned tokens in the training data
⚠ 889 misaligned tokens in the dev data

With my fixes and improvements

============================== Vocab & Vectors ==============================
ℹ 223881 total word(s) in the data (21760 unique)
⚠ 2231 misaligned tokens in the training data
⚠ 278 misaligned tokens in the dev data

20. Improvements in train models results

As of now, I was able to improve the scores in the ouput of the train command:

Base spaCy tokenizer

============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'tagger', 'morphologizer', 'trainable_lemmatizer', 'parser']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS TAGGER  LOSS MORPH...  LOSS TRAIN...  LOSS PARSER  TAG_ACC  POS_ACC  MORPH_ACC  LEMMA_ACC  DEP_UAS  DEP_LAS  SENTS_F  SCORE
---  ------  ------------  -----------  -------------  -------------  -----------  -------  -------  ---------  ---------  -------  -------  -------  ------  
...
 50   18800      79529.46      3919.46        4015.89        2897.51     23721.29    70.82    76.32      70.90      71.89    47.64    42.75    56.62    0.66

Adding the Morphological Tokenizer of CAMeL Tools

============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'tagger', 'morphologizer', 'trainable_lemmatizer', 'parser']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS TAGGER  LOSS MORPH...  LOSS TRAIN...  LOSS PARSER  TAG_ACC  POS_ACC  MORPH_ACC  LEMMA_ACC  DEP_UAS  DEP_LAS  SENTS_F  SCORE
---  ------  ------------  -----------  -------------  -------------  -----------  -------  -------  ---------  ---------  -------  -------  -------  ------
  0       0          0.00       263.59         263.59         292.87       579.49    16.57    33.42      18.74      25.89    14.80     4.49     0.17    0.20
...
 18   11200      29111.47      3980.86        4082.18        2811.51     19214.91    85.09    90.97      85.29      85.51    75.19    68.04    58.55    0.83
✔ Saved pipeline to output directory

With my fixes and improvements

============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'tagger', 'morphologizer', 'trainable_lemmatizer', 'parser']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS TAGGER  LOSS MORPH...  LOSS TRAIN...  LOSS PARSER  TAG_ACC  POS_ACC  MORPH_ACC  LEMMA_ACC  DEP_UAS  DEP_LAS  SENTS_F  SCORE
---  ------  ------------  -----------  -------------  -------------  -----------  -------  -------  ---------  ---------  -------  -------  -------  ------
  0       0          0.00       268.99         268.99         298.87       595.37    17.68    34.09      19.52      27.97    14.72     5.08     0.00    0.21
...
 16   10000      30490.97      4807.53        4944.57        3546.49     22000.64    87.75    93.72      87.99      88.50    79.88    72.53    83.62    0.86
✔ Saved pipeline to output directory

getData123 · 2024-04-15T05:55:14Z

getData123
Apr 15, 2024

Hello gtoffoli,

I want to use a Python code for text analysis and its using: en_core_web_lg

In your post you mentioned: https://github.com/gtoffoli/spacy-cameltokenizer

How I can use it so that its something like: ar_core_web_lg ?

Or its a totally for different purpose?

Thank you

0 replies

gtoffoli · 2024-04-15T14:33:58Z

gtoffoli
Apr 15, 2024

Hi, getData123,

no, it was, it is just my purpose.
I'm not interested in the best tokenization for language translation or even for generative AI.
As you, I'm interested in text analysis, preferably text analysis useful for language education.
I think that soon I will resume this thread of work with the aim of getting to something like ar_core_news_md.

I'm working on this thread only part-time. Also, during last weeks I took a break for a few reasons; among them:

the inherent difficulty of adapting the spaCy pipeline, in particular the tokenizer, to the Arabic language: a hard requirement of spaCy is that the concatenation of the text of the tokens has the same length of the input text; but, a morphological tokenizer outputs a lot of sub-tokens, mainly suffixes, which often are longer than the corresponding sub-strings in the "raw" tokens from which they derive;
the intention to examine in depth how affixes are handled in the CONLL/CONLLU format files that I tried to use for training, the Arabic-PADT treebank: for me they constitute the "source of truth", regarding tokenization; starting with the "alignment," that is, the match of the tokens output by the tokenizer with those identified by the annotators of the CONLL corpus itself;
last, but not least, no interest shown by others, no suggestions got from the spaCy developers.

Please note that I don't know the Arabic language; I'm just currently learning a bit of it.
Best, Giovanni

1 reply

getData123 Apr 16, 2024

Thank you for your explanation.
"I'm not interested in the best tokenization for language translation or even for generative AI"
I am the same.
From your post it sound really difficult.
Please note I am willing to help in Arabic if you need.

gtoffoli · 2024-04-25T21:39:31Z

gtoffoli
Apr 25, 2024

Recap of my comments above and some updates

Since a few month, in the spare time, I'm trying to develop a tokenizer for the Arabic language, limited to MSA (modern standard arabic), that is useful for training the basic spaCy pipeline.
As reported in my previous comments, and in issue #13248, I've reached some results, both in reducing the number of misaligned tokens in the output of the debug data command and in improving the scores in the output of the train command.
However, my tokenizer cannot be integrated in a spaCy distribution; it does not include the serialization methods (to_bytes and to_disk), nor I can implement them, since it is not based on standard rules for prefix search, suffix search and exception handling; see reference (3).

I'm willing to further investigate if I can develop a viable Arabic tokenizer, given the resources available to me (the annotated corpus) for training the language model and the constraints posed by the spaCy architecture.

The spaCy requirements

get a small proportion of misaligned tokens; this is computed with the debug data command; see reference (1) at the end of this comment;
ensure that the tokenization is non-destructive; see reference (2) at the end of this comment;
include in the distributable package the same tokenizer that was used for training the pipeline components.

The annotated corpus

the Arabic-PADT treebank ( https://universaldependencies.org/treebanks/ar_padt/index.html ).

Below is a small excerpt from the training set, file ar_padt-ud-train.conllu:

# newpar id = assabah.20040930.0016:p8
# sent_id = assabah.20040930.0016:p8u1
# text = وحمل عدد اخر ممن قدموا استقالاتهم القوات الامريكية مسؤولية تردي الاوضاع الامنية في المدينة بسبب محاصرتها لها وتمركزها في مركز الشرطة الواقع بوسط المدينة مما ادى الى قطع الطريق الرئيسي في المدينة والطرق الفرعية المحيطة بمركز الشرطة.
# orig_file_sentence ASB_ARB_20040930.0016#8
1-2	وحمل	_	_	_	_	_	_	_	_
1	و	وَ	CCONJ	C---------	_	0	root	0:root	Vform=وَ|Gloss=and|Root=wa|Translit=wa|LTranslit=wa
2	حمل	حَمَّل	VERB	VP-A-3MS--	Aspect=Perf|Gender=Masc|Number=Sing|Person=3|Voice=Act	1	parataxis	1:parataxis	Vform=حَمَّلَ|Gloss=charge_(_with_responsibility_),blame,impose,make_carry|Root=.h_m_l|Translit=ḥammala|LTranslit=ḥammal
3	عدد	عَدَد	NOUN	N------S1I	Case=Nom|Definite=Ind|Number=Sing	2	nsubj	2:nsubj	Vform=عَدَدٌ|Gloss=number,quantity,issue|Root=`_d_d|Translit=ʿadadun|LTranslit=ʿadad
4	اخر	آخَر	ADJ	A-----MS1I	Case=Nom|Definite=Ind|Gender=Masc|Number=Sing	3	amod	3:amod	Vform=آخَرُ|Gloss=other,another|Root='__h_r|Translit=ʾāḫaru|LTranslit=ʾāḫar
5-6	ممن	_	_	_	_	_	_	_	_
5	من	مِن	ADP	P---------	AdpType=Prep	3	case	3:case	Vform=مِن|Gloss=from|Root=min|Translit=min|LTranslit=min
6	من	مَن	DET	S---------	_	3	det	3:det	Vform=مَن|Gloss=who|Root=man|Translit=man|LTranslit=man

(note that the "vocalized" forms of the arabic tokens in the last two lines, 5 and 6, provided as the attribute Vform, differ slightly, while the non-vocalized forms, just after the numeric ids, are identical in this case)

As you can see, the tokenization that was performed by the annotators of the Arabic-PADT corpus is "destructive", in the spaCy sense; the word in line labeled 5-6, of length 3, is splitted in the two tokens (lines 5 and 6), whose surface text (first field after the numeric id) has total length 4.

I've just started to convert to the spaCy declarative style, in the form of "tokenizer_exceptions", some rules that I had previously formulated in procedural mode, through python code. This is a small excerpt of the module tokenizer_exceptions.py:

from ..tokenizer_exceptions import BASE_EXCEPTIONS

# _exc = {}
_exc = {
	"ممن": [{ORTH: "م", NORM: "من"}, {ORTH: "من"}], # min man (of whom, from whom, among those who)
	"منا": [{ORTH: "م", NORM: "من"}, {ORTH: "نا"}], # min na (of us, from us, among us)
	"عما": [{ORTH: "ع", NORM: "عن"}, {ORTH: "ما"}], # ʕan mā (about what, from what)
}

...
TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc)

You can see that I'm trying to tokenize in non-destructive mode the word in the corpus line labelled 5-6 (compare them with my first rule), confident that the downstream learning algorithms would exploit the information provided against the preposition min (ORTH and NORM), even if the token text (ORTH) is different, because it is truncated, than when the preposition itself is not fused with a pronoun.

If I run the native spaCy tokenizer with just the extension above (three rules), the rules work, that is the words matching their keys are correctly splitted each in two tokens; in the case of the first rule, the first token is the preposition min in its truncated form and the second one is the pronoun man.
Running the debug data command, I expected to find a slight decrease in the number of misaligned tokens, since that composite word occurs 9 times in the training set. Instead, that number remained the same!
Then, I tried to act also "upstream", that is to modify the corpus itself, applying to the surface text of the tokens (first field after the id) the same changes (truncation, in the example above) specified inside the exception rules. That worked as a way to eliminate sources of misaligned tokenization; but this solution requires repeating a cycle of modifying (and converting to the .spacy format) the .conllu files every time you want to introduce and test a new rule; maybe a little too awkward!

(1) (from #12247)
Misaligned tokenization is when the tokens predicted by the tokenizer aren't the same as the tokens in the training data ... In general spacy pipeline components will ignore the annotation on misaligned tokens while training, since the model can't know how the annotation on the training tokens should be converted for use with the predicted tokens.

(2) (from https://spacy.io/usage/linguistic-features)
spaCy’s tokenization is non-destructive, which means that you’ll always be able to reconstruct the original input from the tokenized output. Whitespace information is preserved in the tokens and no information is added or removed during tokenization. This is kind of a core principle of spaCy’s Doc object: doc.text == input_text should always hold true ... When splitting tokens, the subtoken texts always have to match the original token text – or, put differently "".join(subtokens) == token.text always needs to hold true. If this wasn’t the case, splitting tokens could easily end up producing confusing and unexpected results that would contradict spaCy’s non-destructive tokenization policy.

(3) my current tokenizer

starts from the "raw" output of the native spaCy tokenizer;
splits in sub-tokens some of the "raw" tokens;
calls the MorphologicalTokenizer of CamelTools (https://github.com/CAMeL-Lab/camel_tools), which does the bulk of the work;
refines the output of the MorphologicalTokenizer, splitting some more prefixes and suffixes;
recomputes the list of spaces for the Doc constructor, consistent with the new list of words (token texts).

2 replies

gtoffoli May 31, 2024

Further updates

Finally I was able to produce a prototype of Arabic language model for spaCy, and to start some experimentation with it.

The architecture of the prototype

Our solution includes two additional packages

a package named cameltokenizer, which implements a tokenizer extension;
a package named ar_core_news_md, similar to other similarly named language packages;
plus
some initialization code, which registers a pipeline component to be called first, so that it gets its input (a Doc), from the standard tokenizer;
a customization of the punctuation module and of init.py inside spacy.lang.ar

The cameltokenizer package

cameltokenizer wraps part of the camel_tools library and extends it, to perform morphological tokenization downstream the standard spacy tokenizer; cameltokenizer reconfigures itself, based on the kind of input it gets, in order to work ..

.. as a complete tokenizer, inside the training pipeline, by subclassing the standard Tokenizer class;
.. as an extension of the production pipeline.

The Arabic model package

I created manually the package ar_core_news_md inside the site-packages directory of my Python virtual environment and put inside it a symlink to the output/model-best directory produced by the training pipeline, which contains the individual trained models, such tagger, lemmatizer and parser.
ar_core_news_md contains also the usual init.py and meta.json files.

Initialization and registration of the tokenizer extension

In a settings module (in my case it is the settings.py of a Django app) I put the following code:

	import spacy
	from cameltokenizer import tokenizer

	ar = spacy.load('ar_core_news_md')
	cameltokenizer = tokenizer.CamelTokenizer(ar.vocab)

	@Language.component("cameltokenizer")
	def tokenizer_extra_step(doc):
		return cameltokenizer(doc)

	ar.add_pipe("cameltokenizer", name="cameltokenizer", first=True)

Changes to the native Arabic tokenizer

It seemed to me that the native configuration of the standard Arabic tokenizer was very poor in dealing with punctuation; then, I tried to copy-paste to it most of the regular expressions of the English tokenizer and I got significantly better results.

Some recaps and updates on the tokenizer extension (see previous posts in this discussion)

The model training is based on the Universal Dependencies Arabic-PADT treebank. We "cleaned" it in some way, running the function ar_padt_fix_conllu, which is defined in the utils module of cameltokenizer; it

removes the Tatweel/Kashida Unicode character (it is only typographic stuff) and the Superscript Alef character from the sentences;
removes the vocalization diacritics from the sentences and from the tokens text, to improve the token alignment when running the debug data command; eventually, we removed them from the lemmas as well, thus obtaining some minor improvements;
"fixes" the splitting of a few composite particles, to avoid cases of "destructive" tokenization.

I think that, without the morphological tokenization, no parsing would be possible at all. For that, I chose the MorphologicalTokenizer of CamelTools. It mainly looks for prefixed prepositions and suffixed pronouns, not declension and conjugation affixes.
We apply to its input and its output, in a dirty and inefficient way, a lot of small "fixes" aimed at reducing the number of "misaligned tokens", that is to better match the tokenization done by the annotators of the training set (the .conllu files); I don't know if said fixes could result in some "overfitting". The poor time performance of the overall algorithm largely depends on the heavy task of the MorphologicalTokenizer.

Some results

Herebelow are some results from the outputs of the commands debug data and train, respectively.

============================== Vocab & Vectors ==============================
ℹ 223881 total word(s) in the data (21367 unique)
⚠ 2057 misaligned tokens in the training data
⚠ 274 misaligned tokens in the dev data

============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'tagger', 'morphologizer', 'trainable_lemmatizer', 'parser']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS TAGGER  LOSS MORPH...  LOSS TRAIN...  LOSS PARSER  TAG_ACC  POS_ACC  MORPH_ACC  LEMMA_ACC  DEP_UAS  DEP_LAS  SENTS_F  SCORE
---  ------  ------------  -----------  -------------  -------------  -----------  -------  -------  ---------  ---------  -------  -------  -------  ------
...
 15    9600      37730.01      4613.79        4738.25        1878.66     21044.97    88.19    93.79      88.39      94.01    80.24    73.22    85.86    0.88
...

Some remarks

debug data - I think that the percentage of misaligned token, slightly more than 1%, is now acceptable; it was very bad at the start, without the tokenizer extensions: over 16%.
train - The overall best score of the training is 0.88. It was 0.66 at the start, and 0.86 before the last refinements: removing the diacritics from the lemmas in the training set has significantly improved the partial score of the lemmatizer.

The experimentation environment

I'm just trying to test it in some text-analysis tasks; namely in tasks that could be useful for linguistic education.
For that purpose, I use a stack of web applications, at the core of which is an environment for collaborative learning built with Django (https://github.com/gtoffoli/commons).
The multilingual text-analysis component (https://github.com/gtoffoli/commons-textanalysis) includes vocabularies of common and useful words, based on usage statistics and/or human vetting, for 8 European languages. For Arabic , we are adapting a vocabulary, developed by the European project KELLY a few years ago (https://spraakbanken.gu.se/en/projects/kelly), where lemma-POS combinations are annotated with CEFR levels (A1, A2, B1, B2, C1, C2); we had done before something alike for the Greek language.

gtoffoli Jun 10, 2024

Accuracy (amended after running the benchmark on the right data)

Here below are the first lines of the report I got by running the spacy benchmark accuracy command:

ℹ Using CPU
================================== Results ==================================
TOK      99.19
TAG      88.63
POS      94.09
MORPH    88.83
LEMMA    94.53
UAS      80.24
LAS      73.01
SENT P   60.60
SENT R   71.03
SENT F   65.40
SPEED    1607

Values above are worse than those of spaCy language models for most European languages. They must be analyzed to understand what they depend on and whether anything can be done to improve them.

gtoffoli · 2024-06-02T16:46:37Z

gtoffoli
Jun 2, 2024

I've just added the README to the package https://github.com/gtoffoli/spacy-cameltokenizer

0 replies

getData123 · 2024-06-10T14:02:05Z

getData123
Jun 10, 2024

Thank you very much, really good news, I will try it.
Thank you for making it available
Really appreciate it

1 reply

gtoffoli Jun 11, 2024

Thank you!

The screens below (not yet aware of the RTL writing direction) are intended to give an idea of the kind of learning-related applications inside which I plan to use the Arabic model, together with several other language models.
As I expected, the parse contains several errors.

I've taken the story of the rabbit Arnoob from the video https://www.youtube.com/watch?v=GFJATd868iI
I had to remove all diacritics. As you can note, the parse contains several errors.

gtoffoli · 2024-06-14T10:40:45Z

gtoffoli
Jun 14, 2024

I've just published on GitHub the package https://github.com/gtoffoli/spacy-ar_core_news_md with a tentative README.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arabic language support #7146

{{title}}

Replies: 10 comments 5 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Arabic language support #7146

Replies: 10 comments · 5 replies

svlandeg Feb 22, 2021 Maintainer

I'm willing to prototype a spaCy language model for Arabic (SMA)

I'm willing to prototype a spaCy language model for Arabic (SMA) - continued

Recap of my comments above and some updates

Further updates

Replies: 10 comments 5 replies

svlandeg
Feb 22, 2021
Maintainer