Need to make Arabic language model and Arabic ner work under Spacy #6

YanLiang1102 · 2018-04-27T23:08:57Z

Things need to do:

Train Arabic language Model

we need stopwords
infix, prefix and surfix
may include the lemmatizer that Khaled had so far. --Khaled
collect Arabic wiki articles and together with our lexisnexis arabic data using gensim to train the word vectors that needed to train the parser and tagger with Spacy, --Yan
implement the necessary class and get language model trained!

Train Arabic Ner Model

using ontoNotes together with the prodigy data we have, we should be able to get like 66K records of training data, we need to writ e a customized ner model for Arabic in Spacy and get it trained.

@ahalterman @khaledJabr

YanLiang1102 · 2018-04-27T23:13:32Z

spaCy tasks

Tokenizer

For Khaled

need to hand-define prefixes, suffixes, "infixes"
need to define tokenizer exceptions
need stopwords (https://github.com/mohataher/arabic-stop-words/blob/master/list.txt)
(see if existing deterministic lemmatizer exists. Or is possible...)
need example sentences
spaCy questions:
does the tokenizer go only left to right?
how do you update the prefixes and suffixes? (or better to not do at all...?)
(they live in spacy/spacy/lang/punctuation.py)
custom ones live in e.g. spacy/spacy/da/punctuation.py
trainable tokenizer? (#2220). Skip for now, come back later.
Conceptual:
do you split off prepositions? When you tokenize?

init model?

Do you need to do this first? Or does it just create the folders?
(seems to just be a nice way of setting things up)

pre-defined embeddings and frequencies

Need to make word frequency files and vectors.
what are the word frequencies for?
what format does the spacy-dev-resources code expect for text files?
how do you get vectors for your actual tokens? The init stuff uses gensim's tokenizer,
not spaCy's. Mismatches between word_freqs.py and init-model explosion/spaCy#1794 (comment)
put all Gigaword, LexisNexis, and Arabic Wikipedia into text files

NER and dependency parse

Can you train a model on e.g. named entities, and then later train it just on dependencies?
Or does that mess stuff up? Can you train with missing values for some of the fields?
Get UD Arabic working for spaCy.
Have coders make unit tests?

Other stuff

"SHOULD I EVER UPDATE THE GLOBAL DATA?"
(link is broken)
create_tokenizer. What's the "vocab Vocab"

possible to read in annotations from Prodigy? Yan got a weird formatting issue.
This is good: Mismatches between word_freqs.py and init-model explosion/spaCy#1794 (comment)
Andy:
should even try to split off preps? For tokenizing?
spacy things: how to mix training of different tasks?
UD Arabic.
code for reading in labeled NER.

YanLiang1102 · 2018-05-26T00:59:29Z

https://stackoverflow.com/questions/47219639/spacy-2-0-ner-training
something might be useful

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need to make Arabic language model and Arabic ner work under Spacy #6

Need to make Arabic language model and Arabic ner work under Spacy #6

YanLiang1102 commented Apr 27, 2018 •

edited

Loading

YanLiang1102 commented Apr 27, 2018 •

edited

Loading

YanLiang1102 commented May 26, 2018

Need to make Arabic language model and Arabic ner work under Spacy #6

Need to make Arabic language model and Arabic ner work under Spacy #6

Comments

YanLiang1102 commented Apr 27, 2018 • edited Loading

Train Arabic language Model

Train Arabic Ner Model

YanLiang1102 commented Apr 27, 2018 • edited Loading

spaCy tasks

Tokenizer

init model?

pre-defined embeddings and frequencies

NER and dependency parse

Other stuff

YanLiang1102 commented May 26, 2018

YanLiang1102 commented Apr 27, 2018 •

edited

Loading

YanLiang1102 commented Apr 27, 2018 •

edited

Loading