You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The idea is to do semantic analysis and detect category from Arabic Tweets. For every tweet , we should classify it as positive or negative or natural regrading to COVID219 Vaccine. The second task to do classification based on cateogry for same tweets.We have 7 types: Personal,News..etc
This serves as the project for the Natural Language Processing course taught to juniors in CUFE for 2023. We ranked 3 in our class
We defined our own tokenizer splitting on whitespaces & punctuations.
Cleaning tokens from punctuation.Removing tashkeel. Replacing variants of all arabic letters to their original parents (ؤ to و and أ to ا and so on).
Removing all extra Arabic characters (like المد). Remove any non-arabic charactersDefined our own set of stop words consisting of 2 & 3 characters (so as not to lose much semantics), then removed them.We used NLTK Stemmer to stem our output tokens.
We created 11 variants of our dataset, each with either Downsampling or Upsampling or both. Each with either equalised Stance Classes, or equalised Category Classes. All those Variants along with our Original Dataset without any up/downSampling.
Feature Extraction:
Classical features:
Sometimes , we used SMOTE to oversampling features to solve the problem of overfitting
TF-IDF
N-gram
Trainable features/embeddings:
Word2Vec: CBOW using gensim models.
Word2Vec: Skip-Gram using gensim models.
Classical Classifiers:
Multinomial Naive Bayes
SVM Linear
Random Forest
Logistic Regression
Model parameters are optimised using GridSearchCV like Random Forest
10-Fold Cross validation with random forest but it was overfitting
Deep Learning models:
Pytorch’s RNN
AraBert’s Transformer (Using transfer learning to train the last Fully Connected layer)
XML Roberta (Using transfer learning to train the last Fully Connected layer)
Qarib
MarBert
Final models:
For both the stance & category detection, we ended up using two arabert models trained on different variants of the dataset (Train 3 for stance & train 10 for category).
They were chosen because they gave the best F1-scores & especially the macro acreage scores. They ended up yielding nearly the best accuracy along with actually learning the dataset and predicting based on the tweet contents, not just predicting a single class & plainly overfitting.
Best Results
Arabert with stance
Train data
f1-score
accuracy
Macro avg
Train 3
0.57,0.90,0.53
0.81
0.66
Arabert with category
Train data
accuracy
Macro avg
Train 10
0.588
.438
Train 11
0.522
0.392
EXtra Results
Features: TF-IDF Features
Classifier
Naive Bayes
Logistic Regression
SVM linear
Random Forest (100)
Random Forest(300)
stance
69.6
67.6
67.4
69.3
71.0
category
56.6
63.08
65.0
65.3
64.5
Grid Search RandomForest for stance 0.67 accuracy , with f1-score :0.36,0.4,0.79
and macro avg :0.52