PosiText is a machine learning project aimed at developing a binary classifier that can accurately identify and classify expressions of happiness within textual data using natural language processing (NLP) techniques. The project utilizes the "Emotion Detection from Text" dataset from Kaggle and leverages Python libraries such as Pandas, Scikit-learn, and NLTK to preprocess the data and build the model.
- Text Preprocessing: Tokenization, stop word removal, and TF-IDF vectorization for feature extraction.
- Binary Classification: A Logistic Regression model trained to distinguish between happy and non-happy text.
- Performance Metrics: Comprehensive evaluation using accuracy, precision, recall, and F1-score.
The dataset used for this project is the "Emotion Detection from Text" dataset from Kaggle, consisting of tweets labeled with various emotional states. After filtering, 13,847 entries were used for training and testing.
-
Data Preparation:
- Filtering and preprocessing of the dataset.
- Feature extraction using TF-IDF vectorization.
-
Model Development:
- Model selection and training using Logistic Regression.
- Data split into training (80%) and testing (20%) sets.
-
Evaluation:
- Model performance assessed using accuracy, precision, recall, and F1-score.
The model achieved an overall accuracy of 72%, with the following performance on key metrics:
- Happiness Precision: 0.70
- Happiness Recall: 0.42
- Happiness F1-Score: 0.52
- Non-Happiness F1-Score: 0.81
The model demonstrated higher recall for non-happiness, indicating it is more effective at identifying non-happiness compared to happiness.
- Clone the Repository:
git clone https://github.com/username/PosiText.git cd PosiText
- Install Dependencies:
Make sure you have Python installed, then run:
pip install -r requirements.txt
- Run the Model:
Execute the main script to train and evaluate the model:
python main.py
- Visualize Results:
Use the chart script to visualize performance metrics:
python chart.py
- Python: Pandas, NLTK, Scikit-learn
- NLP: TF-IDF Vectorization
- Model: Logistic Regression
- Improve happiness detection by exploring more advanced models such as neural networks.
- Expand the dataset to include more balanced classes.
- Explore other emotions beyond happiness.
- Afsana Bhuiyan - LinkedIn