Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arabic data issue and potential fixes #15

Open
khaledJabr opened this issue Jul 2, 2018 · 5 comments
Open

Arabic data issue and potential fixes #15

khaledJabr opened this issue Jul 2, 2018 · 5 comments

Comments

@khaledJabr
Copy link

khaledJabr commented Jul 2, 2018

I had the chance to look at the training data we are using for this, and there are two main issues with the data:

  1. The training data includes diacritics. Diacritics are extra short vowels added to Arabic words to help with pronunciation and differentiating the means of two matching words or more, and usually this is needed on the lemma level. Diacritics are not used in modern Arabic writing. This includes our news sources, and the the data we collected from the coders using the prodigy interface. I am suspecting that this might be one of the reasons hurting the NER model. One really important thing to look at here is to see whether the word embeddings that we are using were trained on data with diacritics or not. I don't have a clear answer of how this have or could have affected our training, but my main intuition is normalizing/standardizing our data as much as we can is a always a good thing.

  2. Aside from the diacritics, I have noticed that most (if not all) of the tokens (the actual tokens, the ones stored as orth) have an extra space at the end of them, plus a lot of them have weird extra characters. Here are some examples :

'orth': 'ال{ِسْتِسْلامُ '
'orth': '-مُعالَجَةِ '
'orth': '-{ِعْتِباراتِ- '

Although many of these have a ner label of o ,i still think they are worth fixing. Here is how I would go about fixing both issues (there are other ways, but this is the first thing that comes to mind):

import re 
import pyarabic.araby as araby 

text =   '-آمِلَةً '
no_diacritics = araby.strip_tashkeel(text) # removes all diacritics 
just_arabic_text = re.sub(r'\W+', '',no_diacritics ) # removes everything else but the word. This assumes there's only one word in orth 
just_arabic_text


output : 
آملة

One last thing, do we have a key or a table somewhere that lists the labels we are using in our big NER dataset (the combined one)?

@YanLiang1102
Copy link
Collaborator

u mean what label it could be? @khaledJabr

@YanLiang1102
Copy link
Collaborator

#10
@khaledJabr check this issue all the NER class are here

@ahalterman
Copy link
Collaborator

ahalterman commented Jul 13, 2018

It sounds like the thing to do is to rerun one of the simple models with some simple changes on the Arabic text:

  1. remove the diacritics
  2. remove leading/trailing spaces
  3. remove other junk like hyphens.

Khaled's code above does all that, so I think we should run that over all the orths, retrain the model, and see how it goes. (We should get much better word embedding coverage after doing that)

@YanLiang1102
Copy link
Collaborator

YanLiang1102 commented Jul 16, 2018

image

well only fixed the "ort" token will bring us into exception since the algo looks at the position of the those token if we get rid of the extra space or whatever stuff from the token but the original text not change with it we will come into this error, I look into the training data, it does not have the start and end index in it, but it still uses it somehow, so when we delete the "useless" stuff, it is not working,
@khaledJabr @ahalterman
wonder if you would like to jump in and clean the raw text Khaled since I could not read Arabic, will not able to do this, I will point those to u, so the data you need to clean is here:
data is under here on ### hanover

training:
/home/yan/arabicNER/nerdata/cleaned_combined_removed.json
eval data:
/home/yan/arabicNER/nerdata/ar_eval_all_cleaned_removed.json

@YanLiang1102
Copy link
Collaborator

here is the code that I have to clean up the token if you want to take a look @khaledJabr https://github.com/oudalab/Arabic-NER/blob/master/explore_traingdata.ipynb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants