Arabic data issue and potential fixes #15

khaledJabr · 2018-07-02T22:07:56Z

I had the chance to look at the training data we are using for this, and there are two main issues with the data:

The training data includes diacritics. Diacritics are extra short vowels added to Arabic words to help with pronunciation and differentiating the means of two matching words or more, and usually this is needed on the lemma level. Diacritics are not used in modern Arabic writing. This includes our news sources, and the the data we collected from the coders using the prodigy interface. I am suspecting that this might be one of the reasons hurting the NER model. One really important thing to look at here is to see whether the word embeddings that we are using were trained on data with diacritics or not. I don't have a clear answer of how this have or could have affected our training, but my main intuition is normalizing/standardizing our data as much as we can is a always a good thing.
Aside from the diacritics, I have noticed that most (if not all) of the tokens (the actual tokens, the ones stored as orth) have an extra space at the end of them, plus a lot of them have weird extra characters. Here are some examples :

'orth': 'ال{ِسْتِسْلامُ '
'orth': '-مُعالَجَةِ '
'orth': '-{ِعْتِباراتِ- '

Although many of these have a ner label of o ,i still think they are worth fixing. Here is how I would go about fixing both issues (there are other ways, but this is the first thing that comes to mind):

import re 
import pyarabic.araby as araby 

text =   '-آمِلَةً '
no_diacritics = araby.strip_tashkeel(text) # removes all diacritics 
just_arabic_text = re.sub(r'\W+', '',no_diacritics ) # removes everything else but the word. This assumes there's only one word in orth 
just_arabic_text


output : 
آملة

One last thing, do we have a key or a table somewhere that lists the labels we are using in our big NER dataset (the combined one)?

The text was updated successfully, but these errors were encountered:

YanLiang1102 · 2018-07-05T16:24:02Z

u mean what label it could be? @khaledJabr

YanLiang1102 · 2018-07-05T16:25:36Z

#10
@khaledJabr check this issue all the NER class are here

ahalterman · 2018-07-13T17:47:22Z

It sounds like the thing to do is to rerun one of the simple models with some simple changes on the Arabic text:

remove the diacritics
remove leading/trailing spaces
remove other junk like hyphens.

Khaled's code above does all that, so I think we should run that over all the orths, retrain the model, and see how it goes. (We should get much better word embedding coverage after doing that)

YanLiang1102 · 2018-07-16T20:02:46Z

well only fixed the "ort" token will bring us into exception since the algo looks at the position of the those token if we get rid of the extra space or whatever stuff from the token but the original text not change with it we will come into this error, I look into the training data, it does not have the start and end index in it, but it still uses it somehow, so when we delete the "useless" stuff, it is not working,
@khaledJabr @ahalterman
wonder if you would like to jump in and clean the raw text Khaled since I could not read Arabic, will not able to do this, I will point those to u, so the data you need to clean is here:
data is under here on ### hanover

training:
/home/yan/arabicNER/nerdata/cleaned_combined_removed.json
eval data:
/home/yan/arabicNER/nerdata/ar_eval_all_cleaned_removed.json

YanLiang1102 · 2018-07-17T13:07:14Z

here is the code that I have to clean up the token if you want to take a look @khaledJabr https://github.com/oudalab/Arabic-NER/blob/master/explore_traingdata.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arabic data issue and potential fixes #15

Arabic data issue and potential fixes #15

khaledJabr commented Jul 2, 2018 •

edited

Loading

YanLiang1102 commented Jul 5, 2018

YanLiang1102 commented Jul 5, 2018

ahalterman commented Jul 13, 2018 •

edited

Loading

YanLiang1102 commented Jul 16, 2018 •

edited

Loading

YanLiang1102 commented Jul 17, 2018

Arabic data issue and potential fixes #15

Arabic data issue and potential fixes #15

Comments

khaledJabr commented Jul 2, 2018 • edited Loading

YanLiang1102 commented Jul 5, 2018

YanLiang1102 commented Jul 5, 2018

ahalterman commented Jul 13, 2018 • edited Loading

YanLiang1102 commented Jul 16, 2018 • edited Loading

YanLiang1102 commented Jul 17, 2018

khaledJabr commented Jul 2, 2018 •

edited

Loading

ahalterman commented Jul 13, 2018 •

edited

Loading

YanLiang1102 commented Jul 16, 2018 •

edited

Loading