This repository contains scripts for English text classification using various models such as TextCNN, TextRNN, TextRCNN, and DPCNN. The code includes functionalities for dataset preparation, vocabulary construction, as well as model training and evaluation.
- Python 3.x
- PyTorch
- NumPy
- Pandas
- tqdm
- Gensim (for handling word embeddings)
- Other dependencies listed in requirements.txt
./fasttext
wiki-news-300d-1M.vec
./glove
glove.6B.50d.txt
./GoogleNews-vectors-negative300
GoogleNews-vectors-negative300.bin
./datasets
vocab.pkl
labelled_newscatcher_dataset.csv
- tool.py: Utility functions for cleaning special characters and contractions.
- train_eval.py: Script for training and evaluating the models.
- run.py: run-time file (computing)
- TextRNN.py: The TextRNN model proposed in the reference paper "Recurrent Neural Network for Text Classification with Multi Task Learning"
- DPCNN.py: The DPCNN model proposed in the reference paper "Deep Pyramid Convolutional Neural Networks for Text Categorization"
- README.md: Project documentation.
1.1 Dataset Structure
- train.csv: CSV file containing training data.
- val.csv: CSV file containing validation data.
- test.csv: CSV file containing test data.
1.2 Building Vocabulary
Run the following script to build the vocabulary:
python data_split.py
python dataset_preprocessing.py
python extracting_pre-trained_word_vectors.py
2.1 Configuration
Set the model and embedding type using command line arguments in the train.py script:
python train.py --model TextCNN --embedding pre_trained
or
python train.py --model DPCNN --embedding pre_trained
This example uses pre-trained word embeddings to train the TextCNN model. Adjust the parameters according to your requirements.
This project is licensed under the MIT License - see the LICENSE file for details.