Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need to make Arabic language model and Arabic ner work under Spacy #6

Open
YanLiang1102 opened this issue Apr 27, 2018 · 2 comments
Open

Comments

@YanLiang1102
Copy link
Collaborator

YanLiang1102 commented Apr 27, 2018

Things need to do:

Train Arabic language Model

  1. we need stopwords
  2. infix, prefix and surfix
  3. may include the lemmatizer that Khaled had so far. --Khaled
  4. collect Arabic wiki articles and together with our lexisnexis arabic data using gensim to train the word vectors that needed to train the parser and tagger with Spacy, --Yan
  5. implement the necessary class and get language model trained!

Train Arabic Ner Model

using ontoNotes together with the prodigy data we have, we should be able to get like 66K records of training data, we need to writ e a customized ner model for Arabic in Spacy and get it trained.

@ahalterman @khaledJabr

@YanLiang1102
Copy link
Collaborator Author

YanLiang1102 commented Apr 27, 2018

spaCy tasks

Tokenizer

For Khaled

  • need to hand-define prefixes, suffixes, "infixes"
  • need to define tokenizer exceptions
  • need stopwords (https://github.com/mohataher/arabic-stop-words/blob/master/list.txt)
  • (see if existing deterministic lemmatizer exists. Or is possible...)
  • need example sentences
    spaCy questions:
  • does the tokenizer go only left to right?
  • how do you update the prefixes and suffixes? (or better to not do at all...?)
    (they live in spacy/spacy/lang/punctuation.py)
    custom ones live in e.g. spacy/spacy/da/punctuation.py
  • trainable tokenizer? (#2220). Skip for now, come back later.
    Conceptual:
  • do you split off prepositions? When you tokenize?

init model?

Do you need to do this first? Or does it just create the folders?
(seems to just be a nice way of setting things up)

pre-defined embeddings and frequencies

  • Need to make word frequency files and vectors.
  • what are the word frequencies for?
  • what format does the spacy-dev-resources code expect for text files?
  • how do you get vectors for your actual tokens? The init stuff uses gensim's tokenizer,
    not spaCy's. Mismatches between word_freqs.py and init-model explosion/spaCy#1794 (comment)
  • put all Gigaword, LexisNexis, and Arabic Wikipedia into text files

NER and dependency parse

  • Can you train a model on e.g. named entities, and then later train it just on dependencies?
    Or does that mess stuff up? Can you train with missing values for some of the fields?
  • Get UD Arabic working for spaCy.
  • Have coders make unit tests?

Other stuff

"SHOULD I EVER UPDATE THE GLOBAL DATA?"
(link is broken)
create_tokenizer. What's the "vocab Vocab"

@YanLiang1102
Copy link
Collaborator Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant