RNN Spam Emails Detection

This project implements a Recurrent Neural Network (RNN) model to detect spam emails. The model is trained on text data and leverages the sequential nature of RNNs to effectively classify emails as either spam or not spam. The project highlights how deep learning techniques can be applied to natural language processing (NLP) tasks.

Project Overview

With the increasing volume of email communication, spam detection has become an important task to prevent phishing, fraud, and unwanted advertisements. This project builds a Recurrent Neural Network (RNN) model to classify emails as either spam or not spam using a dataset of labeled email texts. RNNs are particularly effective for NLP tasks due to their ability to learn from sequential data and long-term dependencies in text.

This project focuses on:

Preprocessing text data for use in an RNN model.
Building and training an RNN architecture (with LSTM layers) for spam classification.
Evaluating the performance of the model.

Dataset

The dataset used in this project contains labeled email texts categorized as either "spam" or "ham" (non-spam). It can be sourced from common spam email datasets such as the Enron Email Dataset or SpamAssassin.

Classes:
- Spam: Unsolicited emails, often containing phishing attempts or advertisements.
- Ham: Regular emails that are not considered spam.
Features: Text content of emails, which is processed to feed into the RNN model.

Data Preprocessing

Text data requires extensive preprocessing to make it suitable for use in machine learning models:

Tokenization: Emails are split into tokens (words).
Lowercasing: All text is converted to lowercase to standardize the data.
Stopword Removal: Common words like "the", "is", and "in" are removed as they do not contribute significantly to the classification task.
Stemming/Lemmatization: Words are reduced to their root forms to further reduce the vocabulary size.
Padding and Truncating: Each email text is padded or truncated to a fixed length to ensure uniform input size for the RNN.

Modeling

Recurrent Neural Network (RNN)

An RNN model is built using Long Short-Term Memory (LSTM) layers, which are effective at capturing long-term dependencies in sequential data like text.

Embedding Layer: Transforms word indices into dense vectors of fixed size, creating word embeddings.
LSTM Layers: Learn patterns and dependencies in the sequence of words.
Dense Layers: Map the learned features to the output, which is a binary classification (spam or not spam).
Activation: The output layer uses a sigmoid activation function to produce a probability score for the binary classification.

Libraries Used:

TensorFlow / Keras
NumPy
NLTK (for text preprocessing)
Matplotlib (for visualization)

Evaluation

The performance of the RNN model is evaluated using the following metrics:

Accuracy: The percentage of correct classifications.
Precision: How many of the emails classified as spam were actually spam.
Recall: How many of the actual spam emails were correctly identified.
F1-Score: The harmonic mean of precision and recall, giving a balance between the two.
Confusion Matrix: To provide insights into the number of true positives, false positives, true negatives, and false negatives.

Installation

To run this project on your local machine, follow the steps below:

Clone the repository:

git clone https://github.com/3m0r9/RNN-Spam-Emails-Detection.git

Navigate to the project directory:
```
cd RNN-Spam-Emails-Detection
```

Create a virtual environment and activate it:

python3 -m venv venv
source venv/bin/activate   # On Windows: venv\Scripts\activate

Install the necessary dependencies:
```
pip install -r requirements.txt
```

Usage

Ensure the dataset (Enron or SpamAssassin) is downloaded and placed in the data/ directory.

Preprocess the dataset:

python preprocess_data.py --input data/emails.csv --output data/processed_emails.csv

Train the RNN model:

python train_model.py --input data/processed_emails.csv

Evaluate the model:

python evaluate_model.py --input data/processed_emails.csv

Results

The RNN model achieved the following results:

Accuracy: 95% on the test set.
Precision: 94%.
Recall: 93%.
F1-Score: 93.5%.

Further details, including confusion matrices and classification reports, are available in the results/ directory.

Contributors

Imran Abu Libda - 3m0r9

License

This project is licensed under the MIT License. See the LICENSE file for details.

Let's Connect

GitHub - 3m0r9
LinkedIn - Imran Abu Libda
Email - [email protected]
Medium - Imran Abu Libda

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
rnn-emails-project.ipynb		rnn-emails-project.ipynb
spam.csv		spam.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RNN Spam Emails Detection

Table of Contents

Project Overview

Dataset

Data Preprocessing

Modeling

Recurrent Neural Network (RNN)

Libraries Used:

Evaluation

Installation

Usage

Results

Contributors

License

Let's Connect

About

Releases

Packages

Languages

3m0r9/RNN-Spam-Emails-Detection

Folders and files

Latest commit

History

Repository files navigation

RNN Spam Emails Detection

Table of Contents

Project Overview

Dataset

Data Preprocessing

Modeling

Recurrent Neural Network (RNN)

Libraries Used:

Evaluation

Installation

Usage

Results

Contributors

License

Let's Connect

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages