This repository contains the source code and resources for a binary classification project aimed at detecting AI-generated texts. The project is based on the Kaggle competition and utilizes a variety of classical machine learning models as well as a fine-tuned DistilRoBERTa model to achieve its goal.
data/
: Contains pre-processed and post-processed training and test datasets in CSV format. Training datasets can be augmented with custom-generated synthetic data. The test set provided is a placeholder and should be replaced for actual use.model_checkpoints/
: Stores trained models' checkpoints.EDA.ipynb
: Jupyter notebook for exploratory data analysis on the training set.generate_synthetic_essays.ipynb
: Notebook for generating synthetic training data using Mistral-7b instruct.data_processing.py
: Processes the training and test sets, tokenizes and vectorizes texts, and saves the resulting sparse matrices as NPZ files in thedata/
folder.optuna.ipynb
: Contains hyperparameter optimization for classical ML models (Ridge, Multinomial Naive Bayes, SVM, and XGBoost) and visualizations of optimization history and parameter importance.classical_models_training.py
: Trains the four classical ML models and saves them as.pkl
files in themodel_checkpoints/
folder.distilroberta_training.py
: Fine-tunes the pre-trained DistilRoBERTa-base model on the training set and saves the checkpoint to themodel_checkpoints/
folder.inference.py
: Loads trained classical ML models and DistilRoBERTa, ensembles them using weights to make predictions on the test set.pseudo_labeling.py
: Implements advanced pseudo-labeling techniques to leverage accurate predictions for accuracy improvement.
- Clone the Repository:
git clone https://github.com/q-xZzz/ai-text-detection.git
- Install Dependencies:
pip install -r requirements.txt
To get the project up and running, follow these steps:
- (optional)Exploratory Data Analysis: Open
EDA.ipynb
with Jupyter Notebook or JupyterLab to explore the training dataset. - (optional)Generating Synthetic Training Data: Use
generate_synthetic_essays.ipynb
to create additional synthetic data for training. - Data Processing: Run
python data_processing.py
to tokenize and vectorize the datasets, and save the processed data for training and testing. - (optional)Hyperparameter Optimization: Launch
optuna.ipynb
to find the optimal hyperparameters for the classical ML models. - Model Training:
- For classical ML models, execute
python classical_models_training.py
. - For DistilRoBERTa, run
python distilroberta_training.py
.
- For classical ML models, execute
- Inference: Use
python inference.py
to load the trained models, ensemble them, and make predictions on the test set. - Pseudo Labeling: Advanced pseudo-labeling techniques can be applied using
pseudo_labeling.py
to further refine the model's accuracy by leveraging confident predictions.