Table of Contents
- Introduction
- 1.1 Program Overview
- Data Source
- 2.1 Dataset Overview
- 2.2 Accessing the Dataset
- Task Specifications
- 3.1 Data Management
- 3.1.1 Data Acquisition
- 3.1.2 Exploratory Data Analysis (EDA)
- 3.1.3 Data Preprocessing
- 3.2 Model Engineering
- 3.2.1 Dataset Splitting
- 3.2.2 Model Architecture
- 3.2.3 Model Training and Validation
- 3.3 Evaluation and Analysis
- 3.3.1 Performance Testing
- 3.3.2 Metrics Reporting
- 3.4 Conclusion and Future Work
- 3.1 Data Management
This project was done as a part of "Bytes of Intelligence: Data Science and AI Internship Program," This program provides a comprehensive learning experience in data science and AI through workshops, challenges, and mentorship. Which is an innovative platform designed to propel aspiring data scientists and AI enthusiasts into the forefront of technological advancement and real-world problem-solving.
The dataset for the Cassava Leaf Disease Classification Challenge is a comprehensive collection of annotated images representing various common diseases affecting cassava plants, one of the most crucial crop resources in tropical and subtropical regions. It includes thousands of high-resolution images categorized into several disease classes, as well as a category for healthy leaves.This porvides 5 folder/class of data, some data are incorrect as inside one named folder one may find data of another folder. Dataset is highly imbalace. Most of them are Cassava Mosaic Disease (CMD).
The dataset is hosted on Kaggle, a popular platform for data science competitions and collaborative projects.
Data was downloaded from Kaggle
Conduct an in-depth EDA to understand the dataset's characteristics:
- Distribution of Classes: There was 5 class:
- 'Cassava Mosaic Disease (CMD)': 10526,
- 'Healthy': 2061,
- 'Cassava Green Mottle (CGM)': 1909,
- 'Cassava Brown Streak Disease (CBSD)': 1751,
- 'Cassava Bacterial Blight (CBB)': 870}
-
Image Quality and Variability: Most of the image was high resulation having 800*600 shape.
-
Data Insights: Some data are incorrect as inside one named folder one may find data of another folder. Dataset is highly imbalace. Most of them are Cassava Mosaic Disease (CMD).
Preparation of the dataset for modeling: Data set was turned into a pandas dataframe having label and image data. Data was balanced and model was tested on both balanced and unbalanced data.
As data was imbalance augmentation was slightly incresing the performance in the compensation of huge amount of time.
- Image Resizing: image was resized to 256*256 Standardize image sizes while maintaining aspect ratios.
- Rescaling: Normalization was done to 0 to 1 range.
- and other augmentation like Randomflip, RandomRotation, RandomZoom, RandomConstrast was done.
Test was done on both custom model and pretrained model.
Mostly used Convolutional and Maxpooling layer were used repetedly. Then in the second part dense and dropout layer used after flattening.
The dataset was divided into three subsets: Train dataset provided by kaggle divided into 2 set
- Training Set: The largest portion, used to train the model. 80% of the Train dataset provided by kaggle was used for training.
- Validation Set: Used to tune model parameters and prevent overfitting. 20% of the Train dataset provided by kaggle was used as validation dataset.
- Test dataset provided by kaggle was reserved for evaluating the model's performance on unseen data.
Layer Structure:
- In the first section 7 Convolutional layer followed by 7 Maxpooling layer used , number of neurons were 32, 64 & 128. kernel size were (3,3) .
- In the second part 2 dense layer were used followed by dropout layer after flattening. In this case number of neurons or filters were 256, 128 and 35% and 30% dropout was done to prevent overfitting.
-
Activation Functions: Except the last layer where "Softmax" activation function was used in all other case "Relu" activation function was used.
-
Transfer Learning: Pretrained EfficientNetB0 model was used to compare the performance.
Model was tranied on both balanced and unbalanced data for 50 epoch.
- Training and validation accuracy by EfficientNetB0 on balanced dataset
- Training and validation accuracy by custom model on balanced dataset
- Training and validation accuracy by custom model on un-balanced dataset
- Training and validation loss by EfficientNetB0 on balanced dataset
- Training and validation loss by custom model on balanced dataset
- Training and validation loss by custom model on un-balanced dataset
- Validation and test accuracy by EfficientNetB0 on balanced dataset
- Validation and test accuracy by custom model on balanced dataset
- Validation and test accuracy by custom model on un-balanced dataset
- Confusion Matrix by EfficientNetB0 on balanced dataset
- Confusion Matrix by custom model on balanced dataset
- Confusion Matrix by custom model on un-balanced dataset
- image prediction by EfficientNetB0
- image prediction by custom model trained on balanced dataset
- image prediction by custom model trained on un-balanced dataset
Though custom model trained on un-balanced dataset seems performed better on accuracy but confusion matrix and image predtion shows that our custom model trained on balanced dataset actually performs batter.Custom model trained on balanced dataset predicts all classes where as Custom model trained on un-balanced dataset predicts on Cassava Mosaic Disease (CMD). On the other hand trainable data of unbalanced dataset was 13693 compared to 8000 balanced data. If Custom model trained on balanced dataset were trained for more epoch and more data surely it would performed better.
Certainly pretrained model performed betten than custom model. Because of lackings of knowledge my custom model is quiet simple.
- precision, recall, and F1 score of EfficientNetB0
- precision, recall, and F1 score of custom model trained on balanced dataset
Though the performance was not satisfactory but as my first project i am happy with that. My custom model was designed to have the best performane by fine tuning the number of neurons, number of layers, drop-out percentage but for various reason performance was average.
However the dataset and model performance was visualized nicely and performance of EfficientNetB0 was quite similar to custom model trained on un-balanced dataset. The dataset was imbalance and my sampling technique was not up-to-date. Hope my future work will be satisfactory.