-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arabic data issue and potential fixes #15
Comments
u mean what label it could be? @khaledJabr |
#10 |
It sounds like the thing to do is to rerun one of the simple models with some simple changes on the Arabic text:
Khaled's code above does all that, so I think we should run that over all the |
well only fixed the "ort" token will bring us into exception since the algo looks at the position of the those token if we get rid of the extra space or whatever stuff from the token but the original text not change with it we will come into this error, I look into the training data, it does not have the start and end index in it, but it still uses it somehow, so when we delete the "useless" stuff, it is not working,
|
here is the code that I have to clean up the token if you want to take a look @khaledJabr https://github.com/oudalab/Arabic-NER/blob/master/explore_traingdata.ipynb |
I had the chance to look at the training data we are using for this, and there are two main issues with the data:
The training data includes diacritics. Diacritics are extra short vowels added to Arabic words to help with pronunciation and differentiating the means of two matching words or more, and usually this is needed on the lemma level. Diacritics are not used in modern Arabic writing. This includes our news sources, and the the data we collected from the coders using the prodigy interface. I am suspecting that this might be one of the reasons hurting the NER model. One really important thing to look at here is to see whether the word embeddings that we are using were trained on data with diacritics or not. I don't have a clear answer of how this have or could have affected our training, but my main intuition is normalizing/standardizing our data as much as we can is a always a good thing.
Aside from the diacritics, I have noticed that most (if not all) of the tokens (the actual tokens, the ones stored as
orth
) have an extra space at the end of them, plus a lot of them have weird extra characters. Here are some examples :Although many of these have a ner label of
o
,i still think they are worth fixing. Here is how I would go about fixing both issues (there are other ways, but this is the first thing that comes to mind):One last thing, do we have a key or a table somewhere that lists the labels we are using in our big NER dataset (the combined one)?
The text was updated successfully, but these errors were encountered: