The goal of this project is to develop a Python-based program that processes, visualizes, and classifies data from a dataset containing indicators of breast cancer patients. The project involves exploring various Machine Learning models to predict (diagnose) whether a patient is likely to have breast cancer.
- Data Extraction and Visualization: Extract and visualize key characteristics from a dataset of breast cancer indicators.
- Model Development: Develop and compare multiple classification models (SVM, Random Forest, ANN/DNN) to predict whether a patient is likely to have breast cancer.
- Performance Evaluation: Utilize appropriate metrics (e.g., Accuracy, Precision, Recall) to evaluate the performance of the models and determine the best-performing classifier.
The dataset used in this project is named bcdr_f01_features.csv
, which contains 44 variables: 16 integer fields, 27 real (float) fields, and 1 string field. The dataset provides indicators collected from breast cancer patients. Here's a brief description of each variable:
Column Name | Description |
---|---|
patient_id |
Identifier for each patient. |
study_id |
Identifier for each study associated with a patient. |
series |
Series number within the study. |
lesion_id |
Identifier for each lesion within a study. |
segmentation_id |
Identifier for each segmentation of a lesion. |
image_view |
The view in which the image was taken (e.g., craniocaudal, mediolateral). |
mammography_type |
Type of mammography used (e.g., screening or diagnostic). |
mammography_nodule |
Indicates the presence of a nodule (binary). |
mammography_calcification |
Indicates the presence of calcification (binary). |
mammography_microcalcification |
Indicates the presence of microcalcifications (binary). |
mammography_axillary_adenopathy |
Indicates the presence of axillary adenopathy (binary). |
mammography_architectural_distortion |
Indicates the presence of architectural distortion (binary). |
mammography_stroma_distortion |
Indicates the presence of stromal distortion (binary). |
age |
Age of the patient. |
density |
Breast density category. |
i_mean |
Mean intensity value of the image. |
i_std_dev |
Standard deviation of the image intensity. |
i_maximum |
Maximum intensity value in the image. |
i_minimum |
Minimum intensity value in the image. |
i_kurtosis |
Kurtosis of the image intensity distribution. |
i_skewness |
Skewness of the image intensity distribution. |
s_area |
Area of the segmented region. |
s_perimeter |
Perimeter of the segmented region. |
s_x_center_mass |
X-coordinate of the center of mass of the segmented region. |
s_y_center_mass |
Y-coordinate of the center of mass of the segmented region. |
s_circularity |
Circularity of the segmented region. |
s_elongation |
Elongation of the segmented region. |
s_form |
Form factor of the segmented region. |
s_solidity |
Solidity of the segmented region. |
s_extent |
Extent (ratio of area to bounding box area) of the segmented region. |
t_energ |
Texture energy of the segmented region. |
t_contr |
Texture contrast of the segmented region. |
t_corr |
Texture correlation of the segmented region. |
t_sosvh |
Sum of squares variance of the texture. |
t_homo |
Texture homogeneity of the segmented region. |
t_savgh |
Sum average of the texture. |
t_svarh |
Sum variance of the texture. |
t_senth |
Sum entropy of the texture. |
t_entro |
Entropy of the texture. |
t_dvarh |
Difference variance of the texture. |
t_denth |
Difference entropy of the texture. |
t_inf1h |
First information measure of correlation. |
t_inf2h |
Second information measure of correlation. |
classification |
Classification of the lesion as either "Malign" (malignant) or "Benign". |
The target variable in the dataset is classification
, which indicates whether a patient's lesion is "Malign" (malignant) or "Benign".
The program implements a command-line interface (CLI) with the following functionalities:
LOAD
: Load a specified dataset file and display summarized information.LOADF
: Load the providedbcdr_f01_features.csv
file and display summarized information.CLEAR
: Clear the loaded data from memory.QUIT
: Exit the program.DESCRIBE
: Provide a statistical summary of the numerical data and count benign and malignant cases.SORT
: Sort the data by patient ID, handle missing values, and encode the classification labels.CORRELATION
: Remove irrelevant features and visualize correlations using a heatmap.SPLITSCALE
: Split the dataset into training and test sets and scale features.SVM
: Train and test a Support Vector Machine (SVM) classifier (already fine-tuned with Random Search).RANDOMFOREST
: Train and test a Random Forest classifier (already fine-tuned with Random Search).ANN
: Train and test an Artificial Neural Network (ANN) classifier (already fine-tuned with Random Search).METRICS
: Evaluate and display classification performance metrics (Confusion Matrix, Accuracy, Precision, Recall) for all the models developed.
- /scripts: Contains the main script
main.py
that implements the above functionalities. - /data: Directory to store datasets.
- /results: Directory to save results for each run, such as visualizations and model outputs.
- /logs: Directory to save logs for each run, including prints, error messages and code details.
- Dockerfile: Instructions to containerize the application.
- requirements.txt: List of Python dependencies.
- README.md: This file.
The code is written in Python and requires the following libraries:
- Pandas
- Requests
- NumPy
- Scikit-learn
- Matplotlib
- Seaborn
-
Clone the repository:
git clone https://github.com/HugoTex98/Breast-Cancer-Prediction.git cd Breast-Cancer-Prediction
-
Create a virtual environment and activate it:
python3 -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
-
Install the required dependencies:
pip install -r requirements.txt
To build and run the Docker container for this project, follow the steps below:
First, navigate to the root directory of the project (where Dockerfile
is located) and run the following command to build the Docker image:
docker build -t breast-cancer-prediction .
This command builds the Docker image using the instructions in the Dockerfile and tags it as breast-cancer-prediction.
Once the image is built, you can run the Docker container using the following command:
docker run -it --rm breast-cancer-prediction
-it
: Runs the container in interactive mode, allowing you to interact with the terminal inside the container.--rm
: Automatically removes the container once it stops running, keeping your environment clean.breast-cancer-prediction
: The name of the Docker image you built.
Since the project is not a web application and does not expose any ports, the output from the script will be directly visible in the terminal where the container runs. If the script generates output files, they will be saved in the container's file system.
To stop the container while it’s running, you can do so by pressing Ctrl + C in the terminal where the container is running.
Run the program using the command-line interface:
python main.py
Follow the prompts to load data, process it, explore the features, and use Machine Learning models to predict Breast Cancer.
In the future, I plan to implement the following improvements and features to enhance the functionality and performance of this project:
-
Expand Model Selection:
- Integrate additional machine learning models, such as Gradient Boosting Machines (GBM) or XGBoost, to compare performance with the current models.
-
Model Comparison module:
- Implement a module inside
Metrics
program to evaluate which model is the best (considering the best metric for the use case).
- Implement a module inside
-
Hyperparameter Optimization:
- Implement a module to use Optuna to optimize the performance of the classifiers.
-
Data Augmentation:
- Apply data augmentation techniques (SMOTE) to increase the dataset and potentially improve model accuracy.
-
Containerization and Deployment:
- Refine the Docker setup to allow for easy deployment on cloud platforms like AWS or Azure, including CI/CD pipeline integration.
-
Testing and Validation:
- Implement unit tests and continuous integration (CI) pipelines to ensure code quality and detect potential issues before deployment (maybe ML Flow for model monitoring).
-
AutoML
- Maybe implement some AutoML like PyCaret to speed up modelling.