Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sentiment Analysis with BERT and RoBERTa #933

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added .DS_Store
Binary file not shown.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@

Sentiment Analysis Model/.DS_Store
Sentiment Analysis Model/.DS_Store
Binary file added Sentiment Analysis Model/.DS_Store
Binary file not shown.
Binary file added Sentiment Analysis Model/Dataset/.DS_Store
Binary file not shown.
5,001 changes: 5,001 additions & 0 deletions Sentiment Analysis Model/Dataset/Test.csv

Large diffs are not rendered by default.

43 changes: 42 additions & 1 deletion Sentiment Analysis Model/Model/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,32 +2,49 @@

## PROJECT TITLE

Sentiment Analysis Model
Sentiment Analysis Model,Sentiment Analysis with BERT/RoBERTa

## GOAL

The main goal of this project is to analyse the iMDB review sentiment.
The main goal of this project is to perform sentiment analysis, classifying textual data into positive or negative sentiments.

## DATASET

The dataset used for this project can be found at [link to dataset](https://www.kaggle.com/columbine/imdb-dataset-sentiment-analysis-in-csv-format/).
The dataset consists of three CSV files:

Train.csv: Training data with two columns, text and label
Test.csv: Testing data with two columns, text and label
Valid.csv: Validation data with two columns, text and label
The label column contains binary values (0 or 1), representing the sentiment of each text sample.

## DESCRIPTION

This project aims to perform a sentiment analysis on the iMDB reviews. This sentiment analysis is done using Simple Neural Network, CNN and RNN (LSTM).
This project utilizes pre-trained BERT and RoBERTa models from the Hugging Face transformers library to classify sentiment. The models are fine-tuned on a custom dataset for binary sentiment classification.



## WHAT I HAD DONE

1. Data collection: From the link of the dataset given above.
2. Data preprocessing: Done stopword removal and word embedding.
3. Model selection: Chose Simple Neural Network, CNN and RNN (LSTM) for better sentiment analysis.
4. Comparative analysis: Compared the accuracy score of all the techniques.
Data Collection: Prepared Train.csv, Test.csv, and Valid.csv datasets.
Data Preprocessing: Tokenized the text using the BERT tokenizer.
Model Selection: Fine-tuned BERT (bert-base-uncased) and RoBERTa (roberta-base) for binary sentiment classification.
Training: Trained the models using the Trainer API from Hugging Face.
Evaluation: Evaluated model performance using metrics such as accuracy, precision, recall, and F1 score.

## MODELS USED

1. Simple Neural Network
2. Convolutional Neural Network
3. Long short term memory Neural Network (Recurrent Neural Network)
4.BERT (Bidirectional Encoder Representations from Transformers)
5.RoBERTa (A robustly optimized BERT pretraining approach)


## LIBRARIES NEEDED
Expand All @@ -40,6 +57,12 @@ The following libraries are required to run this project:
- pandas==1.5.0
- matplotlib==3.6.0
- tensorflow==2.6.0
-Python 3.8+
-PyTorch
-Transformers
-Pandas
-scikit-learn


## VISUALIZATION
![image](https://github.com/achrekarom12/DL-Simplified/assets/88442486/5ecbe61f-d183-4463-b766-41fb9c72d1b0)
Expand All @@ -53,6 +76,12 @@ The evaluation metrics I used to assess the models:
- Accuracy
- Loss
- Confusion Matrix
The evaluation metrics used to assess the models:

Accuracy
F1 Score
Precision
Recall

## RESULTS
Results on Val dataset:
Expand All @@ -62,6 +91,13 @@ Results on Val dataset:
| SNN | 0.7178 | 0.9930 |
| CNN | 0.8472 | 0.8535 |
| LSTM | 0.8634 | 0.4711 |
Results on the validation and test datasets:

Model Accuracy Precision Recall F1 Score
BERT 0.885 0.872 0.881 0.877
RoBERTa 0.893 0.882 0.890 0.886



#### Confusion Matrix:
![image](https://github.com/achrekarom12/DL-Simplified/assets/88442486/88dc73b8-20f1-41d6-8827-c5af3abe9c3f)
Expand All @@ -78,3 +114,8 @@ Based on results we can draw following conclusions:
3. LSTM (Long Short-Term Memory) achieved the highest accuracy among the three models with 86.34% and a loss of 0.4711. LSTM's ability to retain and process information from longer sequences helped it capture the temporal dependencies in the text data, resulting in improved sentiment analysis performance.

In summary, the LSTM model outperformed the SNN and CNN models in terms of accuracy on the IMDB sentiment analysis task. This suggests that the sequential nature of the text data is crucial for accurately predicting sentiment.
Based on the results, the following conclusions can be drawn:

BERT achieved a solid performance with 88.5% accuracy. Its balance between precision and recall shows it effectively captures patterns in the sentiment data.

RoBERTa slightly outperformed BERT with an accuracy of 89.3%. This improvement is attributed to RoBERTa’s robust optimization, leading to better generalization.
54 changes: 41 additions & 13 deletions Sentiment Analysis Model/Model/Sentiment Analysis Model.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,25 @@
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"execution_count": 9,
"metadata": {
"jupyter": {
"is_executing": true
}
},
"outputs": [
{
"ename": "ModuleNotFoundError",
"evalue": "No module named 'keras.preprocessing.text'",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mModuleNotFoundError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[9], line 9\u001b[0m\n\u001b[1;32m 6\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m \u001b[38;5;21;01mseaborn\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m \u001b[38;5;21;01msns\u001b[39;00m\n\u001b[1;32m 7\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m \u001b[38;5;21;01mnltk\u001b[39;00m\n\u001b[0;32m----> 9\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mkeras\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mpreprocessing\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mtext\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m one_hot, Tokenizer\n\u001b[1;32m 10\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mkeras\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mpreprocessing\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01msequence\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m pad_sequences\n\u001b[1;32m 11\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mkeras\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mmodels\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m Sequential\n",
"\u001b[0;31mModuleNotFoundError\u001b[0m: No module named 'keras.preprocessing.text'"
]
}
],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
Expand All @@ -45,9 +61,26 @@
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": 10,
"metadata": {},
"outputs": [],
"outputs": [
{
"ename": "FileNotFoundError",
"evalue": "[Errno 2] No such file or directory: '../Dataset/Train.csv'",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mFileNotFoundError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[10], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m df_train \u001b[38;5;241m=\u001b[39m pd\u001b[38;5;241m.\u001b[39mread_csv(\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m../Dataset/Train.csv\u001b[39m\u001b[38;5;124m'\u001b[39m)\n\u001b[1;32m 2\u001b[0m df_test \u001b[38;5;241m=\u001b[39m pd\u001b[38;5;241m.\u001b[39mread_csv(\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m../Dataset/Test.csv\u001b[39m\u001b[38;5;124m'\u001b[39m)\n\u001b[1;32m 3\u001b[0m df_valid \u001b[38;5;241m=\u001b[39m pd\u001b[38;5;241m.\u001b[39mread_csv(\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m../Dataset/Valid.csv\u001b[39m\u001b[38;5;124m'\u001b[39m)\n",
"File \u001b[0;32m/opt/anaconda3/lib/python3.12/site-packages/pandas/io/parsers/readers.py:1026\u001b[0m, in \u001b[0;36mread_csv\u001b[0;34m(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)\u001b[0m\n\u001b[1;32m 1013\u001b[0m kwds_defaults \u001b[38;5;241m=\u001b[39m _refine_defaults_read(\n\u001b[1;32m 1014\u001b[0m dialect,\n\u001b[1;32m 1015\u001b[0m delimiter,\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 1022\u001b[0m dtype_backend\u001b[38;5;241m=\u001b[39mdtype_backend,\n\u001b[1;32m 1023\u001b[0m )\n\u001b[1;32m 1024\u001b[0m kwds\u001b[38;5;241m.\u001b[39mupdate(kwds_defaults)\n\u001b[0;32m-> 1026\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m _read(filepath_or_buffer, kwds)\n",
"File \u001b[0;32m/opt/anaconda3/lib/python3.12/site-packages/pandas/io/parsers/readers.py:620\u001b[0m, in \u001b[0;36m_read\u001b[0;34m(filepath_or_buffer, kwds)\u001b[0m\n\u001b[1;32m 617\u001b[0m _validate_names(kwds\u001b[38;5;241m.\u001b[39mget(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mnames\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;28;01mNone\u001b[39;00m))\n\u001b[1;32m 619\u001b[0m \u001b[38;5;66;03m# Create the parser.\u001b[39;00m\n\u001b[0;32m--> 620\u001b[0m parser \u001b[38;5;241m=\u001b[39m TextFileReader(filepath_or_buffer, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwds)\n\u001b[1;32m 622\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m chunksize \u001b[38;5;129;01mor\u001b[39;00m iterator:\n\u001b[1;32m 623\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m parser\n",
"File \u001b[0;32m/opt/anaconda3/lib/python3.12/site-packages/pandas/io/parsers/readers.py:1620\u001b[0m, in \u001b[0;36mTextFileReader.__init__\u001b[0;34m(self, f, engine, **kwds)\u001b[0m\n\u001b[1;32m 1617\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39moptions[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mhas_index_names\u001b[39m\u001b[38;5;124m\"\u001b[39m] \u001b[38;5;241m=\u001b[39m kwds[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mhas_index_names\u001b[39m\u001b[38;5;124m\"\u001b[39m]\n\u001b[1;32m 1619\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mhandles: IOHandles \u001b[38;5;241m|\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[0;32m-> 1620\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_engine \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_make_engine(f, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mengine)\n",
"File \u001b[0;32m/opt/anaconda3/lib/python3.12/site-packages/pandas/io/parsers/readers.py:1880\u001b[0m, in \u001b[0;36mTextFileReader._make_engine\u001b[0;34m(self, f, engine)\u001b[0m\n\u001b[1;32m 1878\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mb\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;129;01min\u001b[39;00m mode:\n\u001b[1;32m 1879\u001b[0m mode \u001b[38;5;241m+\u001b[39m\u001b[38;5;241m=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mb\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[0;32m-> 1880\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mhandles \u001b[38;5;241m=\u001b[39m get_handle(\n\u001b[1;32m 1881\u001b[0m f,\n\u001b[1;32m 1882\u001b[0m mode,\n\u001b[1;32m 1883\u001b[0m encoding\u001b[38;5;241m=\u001b[39m\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39moptions\u001b[38;5;241m.\u001b[39mget(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mencoding\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;28;01mNone\u001b[39;00m),\n\u001b[1;32m 1884\u001b[0m compression\u001b[38;5;241m=\u001b[39m\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39moptions\u001b[38;5;241m.\u001b[39mget(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mcompression\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;28;01mNone\u001b[39;00m),\n\u001b[1;32m 1885\u001b[0m memory_map\u001b[38;5;241m=\u001b[39m\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39moptions\u001b[38;5;241m.\u001b[39mget(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mmemory_map\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;28;01mFalse\u001b[39;00m),\n\u001b[1;32m 1886\u001b[0m is_text\u001b[38;5;241m=\u001b[39mis_text,\n\u001b[1;32m 1887\u001b[0m errors\u001b[38;5;241m=\u001b[39m\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39moptions\u001b[38;5;241m.\u001b[39mget(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mencoding_errors\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mstrict\u001b[39m\u001b[38;5;124m\"\u001b[39m),\n\u001b[1;32m 1888\u001b[0m storage_options\u001b[38;5;241m=\u001b[39m\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39moptions\u001b[38;5;241m.\u001b[39mget(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mstorage_options\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;28;01mNone\u001b[39;00m),\n\u001b[1;32m 1889\u001b[0m )\n\u001b[1;32m 1890\u001b[0m \u001b[38;5;28;01massert\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mhandles \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[1;32m 1891\u001b[0m f \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mhandles\u001b[38;5;241m.\u001b[39mhandle\n",
"File \u001b[0;32m/opt/anaconda3/lib/python3.12/site-packages/pandas/io/common.py:873\u001b[0m, in \u001b[0;36mget_handle\u001b[0;34m(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)\u001b[0m\n\u001b[1;32m 868\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(handle, \u001b[38;5;28mstr\u001b[39m):\n\u001b[1;32m 869\u001b[0m \u001b[38;5;66;03m# Check whether the filename is to be opened in binary mode.\u001b[39;00m\n\u001b[1;32m 870\u001b[0m \u001b[38;5;66;03m# Binary mode does not support 'encoding' and 'newline'.\u001b[39;00m\n\u001b[1;32m 871\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m ioargs\u001b[38;5;241m.\u001b[39mencoding \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mb\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;129;01min\u001b[39;00m ioargs\u001b[38;5;241m.\u001b[39mmode:\n\u001b[1;32m 872\u001b[0m \u001b[38;5;66;03m# Encoding\u001b[39;00m\n\u001b[0;32m--> 873\u001b[0m handle \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mopen\u001b[39m(\n\u001b[1;32m 874\u001b[0m handle,\n\u001b[1;32m 875\u001b[0m ioargs\u001b[38;5;241m.\u001b[39mmode,\n\u001b[1;32m 876\u001b[0m encoding\u001b[38;5;241m=\u001b[39mioargs\u001b[38;5;241m.\u001b[39mencoding,\n\u001b[1;32m 877\u001b[0m errors\u001b[38;5;241m=\u001b[39merrors,\n\u001b[1;32m 878\u001b[0m newline\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m\"\u001b[39m,\n\u001b[1;32m 879\u001b[0m )\n\u001b[1;32m 880\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m 881\u001b[0m \u001b[38;5;66;03m# Binary mode\u001b[39;00m\n\u001b[1;32m 882\u001b[0m handle \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mopen\u001b[39m(handle, ioargs\u001b[38;5;241m.\u001b[39mmode)\n",
"\u001b[0;31mFileNotFoundError\u001b[0m: [Errno 2] No such file or directory: '../Dataset/Train.csv'"
]
}
],
"source": [
"df_train = pd.read_csv('../Dataset/Train.csv')\n",
"df_test = pd.read_csv('../Dataset/Test.csv')\n",
Expand Down Expand Up @@ -1576,7 +1609,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.10.10 ('work')",
"display_name": "base",
"language": "python",
"name": "python3"
},
Expand All @@ -1590,14 +1623,9 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.10"
"version": "3.12.4"
},
"orig_nbformat": 4,
"vscode": {
"interpreter": {
"hash": "f6becb90ea85e501e0c5dc0cf472a45ace99c50f8d1426b3da8c341d18623653"
}
}
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
Expand Down
Loading