abhisheks008 · sagar27sahu · Oct 20, 2024 · Oct 20, 2024 · Oct 25, 2024 · Oct 25, 2024
diff --git a/.DS_Store b/.DS_Store
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,3 @@
+
+Sentiment Analysis Model/.DS_Store
+Sentiment Analysis Model/.DS_Store
diff --git a/Sentiment Analysis Model/.DS_Store b/Sentiment Analysis Model/.DS_Store
diff --git a/Sentiment Analysis Model/Dataset/.DS_Store b/Sentiment Analysis Model/Dataset/.DS_Store
diff --git a/Sentiment Analysis Model/Dataset/Test.csv b/Sentiment Analysis Model/Dataset/Test.csv
diff --git a/Sentiment Analysis Model/Model/README.md b/Sentiment Analysis Model/Model/README.md
@@ -2,32 +2,49 @@
 
 ## PROJECT TITLE
 
-Sentiment Analysis Model
+Sentiment Analysis Model,Sentiment Analysis with BERT/RoBERTa
 
 ## GOAL
 
 The main goal of this project is to analyse the iMDB review sentiment.
+The main goal of this project is to perform sentiment analysis, classifying textual data into positive or negative sentiments.
 
 ## DATASET
 
 The dataset used for this project can be found at [link to dataset](https://www.kaggle.com/columbine/imdb-dataset-sentiment-analysis-in-csv-format/). 
+The dataset consists of three CSV files:
+
+Train.csv: Training data with two columns, text and label
+Test.csv: Testing data with two columns, text and label
+Valid.csv: Validation data with two columns, text and label
+The label column contains binary values (0 or 1), representing the sentiment of each text sample.
 
 ## DESCRIPTION
 
 This project aims to perform a sentiment analysis on the iMDB reviews. This sentiment analysis is done using Simple Neural Network, CNN and RNN (LSTM).
+This project utilizes pre-trained BERT and RoBERTa models from the Hugging Face transformers library to classify sentiment. The models are fine-tuned on a custom dataset for binary sentiment classification.
+
+
 
 ## WHAT I HAD DONE
 
 1. Data collection: From the link of the dataset given above. 
 2. Data preprocessing: Done stopword removal and word embedding.
 3. Model selection: Chose Simple Neural Network, CNN and RNN (LSTM) for better sentiment analysis.
 4. Comparative analysis: Compared the accuracy score of all the techniques.
+Data Collection: Prepared Train.csv, Test.csv, and Valid.csv datasets.
+Data Preprocessing: Tokenized the text using the BERT tokenizer.
+Model Selection: Fine-tuned BERT (bert-base-uncased) and RoBERTa (roberta-base) for binary sentiment classification.
+Training: Trained the models using the Trainer API from Hugging Face.
+Evaluation: Evaluated model performance using metrics such as accuracy, precision, recall, and F1 score.
 
 ## MODELS USED
 
 1. Simple Neural Network
 2. Convolutional Neural Network
 3. Long short term memory Neural Network (Recurrent Neural Network)
+4.BERT (Bidirectional Encoder Representations from Transformers)
+5.RoBERTa (A robustly optimized BERT pretraining approach)
 
 
 ## LIBRARIES NEEDED
@@ -40,6 +57,12 @@ The following libraries are required to run this project:
 - pandas==1.5.0
 - matplotlib==3.6.0
 - tensorflow==2.6.0
+-Python 3.8+
+-PyTorch
+-Transformers
+-Pandas
+-scikit-learn
+
 
 ## VISUALIZATION
 ![image](https://github.com/achrekarom12/DL-Simplified/assets/88442486/5ecbe61f-d183-4463-b766-41fb9c72d1b0)
@@ -53,6 +76,12 @@ The evaluation metrics I used to assess the models:
 - Accuracy 
 - Loss
 - Confusion Matrix
+The evaluation metrics used to assess the models:
+
+Accuracy
+F1 Score
+Precision
+Recall
 
 ## RESULTS
 Results on Val dataset:
@@ -62,6 +91,13 @@ Results on Val dataset:
 | SNN    | 0.7178     | 0.9930    |
 | CNN    | 0.8472     | 0.8535    |
 | LSTM    | 0.8634    | 0.4711    |
+Results on the validation and test datasets:
+
+Model	Accuracy	Precision	Recall	F1 Score
+BERT	0.885	  0.872	        0.881        0.877
+RoBERTa	0.893	    0.882	    0.890	    0.886
+
+
 
 #### Confusion Matrix:
 ![image](https://github.com/achrekarom12/DL-Simplified/assets/88442486/88dc73b8-20f1-41d6-8827-c5af3abe9c3f)
@@ -78,3 +114,8 @@ Based on results we can draw following conclusions:
 3. LSTM (Long Short-Term Memory) achieved the highest accuracy among the three models with 86.34% and a loss of 0.4711. LSTM's ability to retain and process information from longer sequences helped it capture the temporal dependencies in the text data, resulting in improved sentiment analysis performance.
 
 In summary, the LSTM model outperformed the SNN and CNN models in terms of accuracy on the IMDB sentiment analysis task. This suggests that the sequential nature of the text data is crucial for accurately predicting sentiment. 
+Based on the results, the following conclusions can be drawn:
+
+BERT achieved a solid performance with 88.5% accuracy. Its balance between precision and recall shows it effectively captures patterns in the sentiment data.
+
+RoBERTa slightly outperformed BERT with an accuracy of 89.3%. This improvement is attributed to RoBERTa’s robust optimization, leading to better generalization.
diff --git a/Sentiment Analysis Model/Model/Sentiment Analysis Model.ipynb b/Sentiment Analysis Model/Model/Sentiment Analysis Model.ipynb
@@ -16,9 +16,25 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 29,
-   "metadata": {},
-   "outputs": [],
+   "execution_count": 9,
+   "metadata": {
+    "jupyter": {
+     "is_executing": true
+    }
+   },
+   "outputs": [
+    {
+     "ename": "ModuleNotFoundError",
+     "evalue": "No module named 'keras.preprocessing.text'",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+      "\u001b[0;31mModuleNotFoundError\u001b[0m                       Traceback (most recent call last)",
+      "Cell \u001b[0;32mIn[9], line 9\u001b[0m\n\u001b[1;32m      6\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m \u001b[38;5;21;01mseaborn\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m \u001b[38;5;21;01msns\u001b[39;00m\n\u001b[1;32m      7\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m \u001b[38;5;21;01mnltk\u001b[39;00m\n\u001b[0;32m----> 9\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mkeras\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mpreprocessing\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mtext\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m one_hot, Tokenizer\n\u001b[1;32m     10\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mkeras\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mpreprocessing\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01msequence\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m pad_sequences\n\u001b[1;32m     11\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mkeras\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mmodels\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m Sequential\n",
+      "\u001b[0;31mModuleNotFoundError\u001b[0m: No module named 'keras.preprocessing.text'"
+     ]
+    }
+   ],
    "source": [
     "import pandas as pd\n",
     "import numpy as np\n",
@@ -45,9 +61,26 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 5,
+   "execution_count": 10,
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "ename": "FileNotFoundError",
+     "evalue": "[Errno 2] No such file or directory: '../Dataset/Train.csv'",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+      "\u001b[0;31mFileNotFoundError\u001b[0m                         Traceback (most recent call last)",
+      "Cell \u001b[0;32mIn[10], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m df_train \u001b[38;5;241m=\u001b[39m pd\u001b[38;5;241m.\u001b[39mread_csv(\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m../Dataset/Train.csv\u001b[39m\u001b[38;5;124m'\u001b[39m)\n\u001b[1;32m      2\u001b[0m df_test \u001b[38;5;241m=\u001b[39m pd\u001b[38;5;241m.\u001b[39mread_csv(\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m../Dataset/Test.csv\u001b[39m\u001b[38;5;124m'\u001b[39m)\n\u001b[1;32m      3\u001b[0m df_valid \u001b[38;5;241m=\u001b[39m pd\u001b[38;5;241m.\u001b[39mread_csv(\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m../Dataset/Valid.csv\u001b[39m\u001b[38;5;124m'\u001b[39m)\n",
+      "File \u001b[0;32m/opt/anaconda3/lib/python3.12/site-packages/pandas/io/parsers/readers.py:1026\u001b[0m, in \u001b[0;36mread_csv\u001b[0;34m(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)\u001b[0m\n\u001b[1;32m   1013\u001b[0m kwds_defaults \u001b[38;5;241m=\u001b[39m _refine_defaults_read(\n\u001b[1;32m   1014\u001b[0m     dialect,\n\u001b[1;32m   1015\u001b[0m     delimiter,\n\u001b[0;32m   (...)\u001b[0m\n\u001b[1;32m   1022\u001b[0m     dtype_backend\u001b[38;5;241m=\u001b[39mdtype_backend,\n\u001b[1;32m   1023\u001b[0m )\n\u001b[1;32m   1024\u001b[0m kwds\u001b[38;5;241m.\u001b[39mupdate(kwds_defaults)\n\u001b[0;32m-> 1026\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m _read(filepath_or_buffer, kwds)\n",
+      "File \u001b[0;32m/opt/anaconda3/lib/python3.12/site-packages/pandas/io/parsers/readers.py:620\u001b[0m, in \u001b[0;36m_read\u001b[0;34m(filepath_or_buffer, kwds)\u001b[0m\n\u001b[1;32m    617\u001b[0m _validate_names(kwds\u001b[38;5;241m.\u001b[39mget(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mnames\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;28;01mNone\u001b[39;00m))\n\u001b[1;32m    619\u001b[0m \u001b[38;5;66;03m# Create the parser.\u001b[39;00m\n\u001b[0;32m--> 620\u001b[0m parser \u001b[38;5;241m=\u001b[39m TextFileReader(filepath_or_buffer, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwds)\n\u001b[1;32m    622\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m chunksize \u001b[38;5;129;01mor\u001b[39;00m iterator:\n\u001b[1;32m    623\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m parser\n",
+      "File \u001b[0;32m/opt/anaconda3/lib/python3.12/site-packages/pandas/io/parsers/readers.py:1620\u001b[0m, in \u001b[0;36mTextFileReader.__init__\u001b[0;34m(self, f, engine, **kwds)\u001b[0m\n\u001b[1;32m   1617\u001b[0m     \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39moptions[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mhas_index_names\u001b[39m\u001b[38;5;124m\"\u001b[39m] \u001b[38;5;241m=\u001b[39m kwds[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mhas_index_names\u001b[39m\u001b[38;5;124m\"\u001b[39m]\n\u001b[1;32m   1619\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mhandles: IOHandles \u001b[38;5;241m|\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[0;32m-> 1620\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_engine \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_make_engine(f, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mengine)\n",
+      "File \u001b[0;32m/opt/anaconda3/lib/python3.12/site-packages/pandas/io/parsers/readers.py:1880\u001b[0m, in \u001b[0;36mTextFileReader._make_engine\u001b[0;34m(self, f, engine)\u001b[0m\n\u001b[1;32m   1878\u001b[0m     \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mb\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;129;01min\u001b[39;00m mode:\n\u001b[1;32m   1879\u001b[0m         mode \u001b[38;5;241m+\u001b[39m\u001b[38;5;241m=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mb\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[0;32m-> 1880\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mhandles \u001b[38;5;241m=\u001b[39m get_handle(\n\u001b[1;32m   1881\u001b[0m     f,\n\u001b[1;32m   1882\u001b[0m     mode,\n\u001b[1;32m   1883\u001b[0m     encoding\u001b[38;5;241m=\u001b[39m\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39moptions\u001b[38;5;241m.\u001b[39mget(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mencoding\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;28;01mNone\u001b[39;00m),\n\u001b[1;32m   1884\u001b[0m     compression\u001b[38;5;241m=\u001b[39m\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39moptions\u001b[38;5;241m.\u001b[39mget(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mcompression\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;28;01mNone\u001b[39;00m),\n\u001b[1;32m   1885\u001b[0m     memory_map\u001b[38;5;241m=\u001b[39m\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39moptions\u001b[38;5;241m.\u001b[39mget(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mmemory_map\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;28;01mFalse\u001b[39;00m),\n\u001b[1;32m   1886\u001b[0m     is_text\u001b[38;5;241m=\u001b[39mis_text,\n\u001b[1;32m   1887\u001b[0m     errors\u001b[38;5;241m=\u001b[39m\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39moptions\u001b[38;5;241m.\u001b[39mget(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mencoding_errors\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mstrict\u001b[39m\u001b[38;5;124m\"\u001b[39m),\n\u001b[1;32m   1888\u001b[0m     storage_options\u001b[38;5;241m=\u001b[39m\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39moptions\u001b[38;5;241m.\u001b[39mget(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mstorage_options\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;28;01mNone\u001b[39;00m),\n\u001b[1;32m   1889\u001b[0m )\n\u001b[1;32m   1890\u001b[0m \u001b[38;5;28;01massert\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mhandles \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[1;32m   1891\u001b[0m f \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mhandles\u001b[38;5;241m.\u001b[39mhandle\n",
+      "File \u001b[0;32m/opt/anaconda3/lib/python3.12/site-packages/pandas/io/common.py:873\u001b[0m, in \u001b[0;36mget_handle\u001b[0;34m(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)\u001b[0m\n\u001b[1;32m    868\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(handle, \u001b[38;5;28mstr\u001b[39m):\n\u001b[1;32m    869\u001b[0m     \u001b[38;5;66;03m# Check whether the filename is to be opened in binary mode.\u001b[39;00m\n\u001b[1;32m    870\u001b[0m     \u001b[38;5;66;03m# Binary mode does not support 'encoding' and 'newline'.\u001b[39;00m\n\u001b[1;32m    871\u001b[0m     \u001b[38;5;28;01mif\u001b[39;00m ioargs\u001b[38;5;241m.\u001b[39mencoding \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mb\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;129;01min\u001b[39;00m ioargs\u001b[38;5;241m.\u001b[39mmode:\n\u001b[1;32m    872\u001b[0m         \u001b[38;5;66;03m# Encoding\u001b[39;00m\n\u001b[0;32m--> 873\u001b[0m         handle \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mopen\u001b[39m(\n\u001b[1;32m    874\u001b[0m             handle,\n\u001b[1;32m    875\u001b[0m             ioargs\u001b[38;5;241m.\u001b[39mmode,\n\u001b[1;32m    876\u001b[0m             encoding\u001b[38;5;241m=\u001b[39mioargs\u001b[38;5;241m.\u001b[39mencoding,\n\u001b[1;32m    877\u001b[0m             errors\u001b[38;5;241m=\u001b[39merrors,\n\u001b[1;32m    878\u001b[0m             newline\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m\"\u001b[39m,\n\u001b[1;32m    879\u001b[0m         )\n\u001b[1;32m    880\u001b[0m     \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m    881\u001b[0m         \u001b[38;5;66;03m# Binary mode\u001b[39;00m\n\u001b[1;32m    882\u001b[0m         handle \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mopen\u001b[39m(handle, ioargs\u001b[38;5;241m.\u001b[39mmode)\n",
+      "\u001b[0;31mFileNotFoundError\u001b[0m: [Errno 2] No such file or directory: '../Dataset/Train.csv'"
+     ]
+    }
+   ],
    "source": [
     "df_train = pd.read_csv('../Dataset/Train.csv')\n",
     "df_test = pd.read_csv('../Dataset/Test.csv')\n",
@@ -1576,7 +1609,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3.10.10 ('work')",
+   "display_name": "base",
    "language": "python",
    "name": "python3"
   },
@@ -1590,14 +1623,9 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.10"
+   "version": "3.12.4"
   },
-  "orig_nbformat": 4,
-  "vscode": {
-   "interpreter": {
-    "hash": "f6becb90ea85e501e0c5dc0cf472a45ace99c50f8d1426b3da8c341d18623653"
-   }
-  }
+  "orig_nbformat": 4
  },
  "nbformat": 4,
  "nbformat_minor": 2