This project aims to develop a comprehensive Python script that automates the process of running (binary or multi-class) classification problems on any given input data matrix in the form of a standard Feature x Instance matrix (.csv) file.
The project consists of multiple Python scripts, each responsible for a specific step in the machine learning pipeline. The main script orchestrates the execution of these scripts, ensuring a seamless and automated workflow.
-
Preprocessing:
- This script handles the missing value and string conversion of the given main input file.
- User input: classification data in .csv format.
- Output: Preprocessed data matrix.
-
Normalization/Standardization:
- This script handles the normalization or standardization of the input data.
- User input: Type of normalization or standardization.
- Output: Normalized or standardized data matrix.
-
Feature Selection:
- This script performs feature selection on the preprocessed data.
- User input: Feature selection method and parameters.
- Output: Data matrix with selected features.
-
Cross-Validation Script:
- Implements cross-validation on the data.
- User input: Number of folds for cross-validation.
- Output: Cross-validated performance metrics.
-
Machine Learning Modeling Script:
- Executes the machine learning modeling for classification.
- User input: Classification algorithm and hyperparameters.
- Output: Trained machine learning model.
-
Prediction Script:
- Evaluates the predictive capability of the model on a blind dataset.
- Output: Accuracy and other performance metrics on the blind dataset.
-
Main Script:
- Orchestrates the execution of the above scripts.
- User input: File path of the input data matrix (.csv).
- Output: Generates plots, heatmaps, and prints performance metrics in a PDF file.
-
Clone the repository:
git clone https://github.com/yourusername/automated-ml-classification.git cd automated-ml-classification
-
Install dependencies:
pip install fpdf
-
Run the main script:
python main.py
-
Follow the prompts to provide input options for each step of the pipeline.
-
Check the output PDF file for performance metrics and plots.
- Ensure that the input data matrix is in the required format (Feature x Instance matrix in .csv format).
Feel free to contribute, report issues, or suggest improvements. Happy classifying!