Skip to content

Join our 2024 Data Science and AI Internship Program at Bytes of Intelligence. Gain practical skills, mentorship, and real-world experience in the dynamic fields of data science and artificial intelligence, preparing you for future success in tech.

License

Notifications You must be signed in to change notification settings

abdullahsakib/Data-Science-and-AI-Internship-Program-2024

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation


Cassava Leaf Disease Classification by Custom Model


Table of Contents

  1. Introduction
    • 1.1 Program Overview
  2. Data Source
    • 2.1 Dataset Overview
    • 2.2 Accessing the Dataset
  3. Task Specifications
    • 3.1 Data Management
      • 3.1.1 Data Acquisition
      • 3.1.2 Exploratory Data Analysis (EDA)
      • 3.1.3 Data Preprocessing
    • 3.2 Model Engineering
      • 3.2.1 Dataset Splitting
      • 3.2.2 Model Architecture
      • 3.2.3 Model Training and Validation
    • 3.3 Evaluation and Analysis
      • 3.3.1 Performance Testing
      • 3.3.2 Metrics Reporting
    • 3.4 Conclusion and Future Work

1. Introduction

1.1 Program Overview

This project was done as a part of "Bytes of Intelligence: Data Science and AI Internship Program," This program provides a comprehensive learning experience in data science and AI through workshops, challenges, and mentorship. Which is an innovative platform designed to propel aspiring data scientists and AI enthusiasts into the forefront of technological advancement and real-world problem-solving.

2. Data Source

2.1 Dataset Overview

The dataset for the Cassava Leaf Disease Classification Challenge is a comprehensive collection of annotated images representing various common diseases affecting cassava plants, one of the most crucial crop resources in tropical and subtropical regions. It includes thousands of high-resolution images categorized into several disease classes, as well as a category for healthy leaves.This porvides 5 folder/class of data, some data are incorrect as inside one named folder one may find data of another folder. Dataset is highly imbalace. Most of them are Cassava Mosaic Disease (CMD).

2.2 Accessing the Dataset

The dataset is hosted on Kaggle, a popular platform for data science competitions and collaborative projects.

3. Task Specifications

3.1 Data Management

3.1.1 Data Acquisition

Data was downloaded from Kaggle

3.1.2 Exploratory Data Analysis (EDA)

Conduct an in-depth EDA to understand the dataset's characteristics:

  • Distribution of Classes: There was 5 class:
    • 'Cassava Mosaic Disease (CMD)': 10526,
    • 'Healthy': 2061,
    • 'Cassava Green Mottle (CGM)': 1909,
    • 'Cassava Brown Streak Disease (CBSD)': 1751,
    • 'Cassava Bacterial Blight (CBB)': 870}

distri_data

  • Image Quality and Variability: Most of the image was high resulation having 800*600 shape.

  • Data Insights: Some data are incorrect as inside one named folder one may find data of another folder. Dataset is highly imbalace. Most of them are Cassava Mosaic Disease (CMD).

    data show

3.1.3 Data Preprocessing

Preparation of the dataset for modeling: Data set was turned into a pandas dataframe having label and image data. Data was balanced and model was tested on both balanced and unbalanced data.

As data was imbalance augmentation was slightly incresing the performance in the compensation of huge amount of time.

  • Image Resizing: image was resized to 256*256 Standardize image sizes while maintaining aspect ratios.
  • Rescaling: Normalization was done to 0 to 1 range.
  • and other augmentation like Randomflip, RandomRotation, RandomZoom, RandomConstrast was done.

3.2 Model Engineering

Test was done on both custom model and pretrained model.

Mostly used Convolutional and Maxpooling layer were used repetedly. Then in the second part dense and dropout layer used after flattening.

model2

3.2.1 Dataset Splitting

The dataset was divided into three subsets: Train dataset provided by kaggle divided into 2 set

  • Training Set: The largest portion, used to train the model. 80% of the Train dataset provided by kaggle was used for training.
  • Validation Set: Used to tune model parameters and prevent overfitting. 20% of the Train dataset provided by kaggle was used as validation dataset.
  • Test dataset provided by kaggle was reserved for evaluating the model's performance on unseen data.

3.2.2 Model Architecture

Layer Structure:

  • In the first section 7 Convolutional layer followed by 7 Maxpooling layer used , number of neurons were 32, 64 & 128. kernel size were (3,3) .

1st

  • In the second part 2 dense layer were used followed by dropout layer after flattening. In this case number of neurons or filters were 256, 128 and 35% and 30% dropout was done to prevent overfitting.

Centered Image

  • Activation Functions: Except the last layer where "Softmax" activation function was used in all other case "Relu" activation function was used.

  • Transfer Learning: Pretrained EfficientNetB0 model was used to compare the performance.

3.2.3 Model Training and Validation

Model was tranied on both balanced and unbalanced data for 50 epoch.

  • Training and validation accuracy by EfficientNetB0 on balanced dataset

acc_g_eff

  • Training and validation accuracy by custom model on balanced dataset

custom_acc

  • Training and validation accuracy by custom model on un-balanced dataset

unb_acc_train

  • Training and validation loss by EfficientNetB0 on balanced dataset

loss_g_eff

  • Training and validation loss by custom model on balanced dataset

custom_loss

  • Training and validation loss by custom model on un-balanced dataset

unb_loss

3.3 Evaluation and Analysis

  • Validation and test accuracy by EfficientNetB0 on balanced dataset

accuracy_effic

  • Validation and test accuracy by custom model on balanced dataset

acbb

  • Validation and test accuracy by custom model on un-balanced dataset

accunb

3.3.1 Performance Testing

  • Confusion Matrix by EfficientNetB0 on balanced dataset

con_eff

  • Confusion Matrix by custom model on balanced dataset

cm_b

  • Confusion Matrix by custom model on un-balanced dataset

cmub

  • image prediction by EfficientNetB0

pred_eff

  • image prediction by custom model trained on balanced dataset

Custom_pred

  • image prediction by custom model trained on un-balanced dataset

pred_unb

Though custom model trained on un-balanced dataset seems performed better on accuracy but confusion matrix and image predtion shows that our custom model trained on balanced dataset actually performs batter.Custom model trained on balanced dataset predicts all classes where as Custom model trained on un-balanced dataset predicts on Cassava Mosaic Disease (CMD). On the other hand trainable data of unbalanced dataset was 13693 compared to 8000 balanced data. If Custom model trained on balanced dataset were trained for more epoch and more data surely it would performed better.

Certainly pretrained model performed betten than custom model. Because of lackings of knowledge my custom model is quiet simple.

3.3.2 Metrics Reporting

  • precision, recall, and F1 score of EfficientNetB0

pre, re, f1 effic

  • precision, recall, and F1 score of custom model trained on balanced dataset

matb

  • precision, recall, and F1 score of custom model trained on un-balanced dataset matunb

3.4 Conclusion and Future Work:

Though the performance was not satisfactory but as my first project i am happy with that. My custom model was designed to have the best performane by fine tuning the number of neurons, number of layers, drop-out percentage but for various reason performance was average.

However the dataset and model performance was visualized nicely and performance of EfficientNetB0 was quite similar to custom model trained on un-balanced dataset. The dataset was imbalance and my sampling technique was not up-to-date. Hope my future work will be satisfactory.

About

Join our 2024 Data Science and AI Internship Program at Bytes of Intelligence. Gain practical skills, mentorship, and real-world experience in the dynamic fields of data science and artificial intelligence, preparing you for future success in tech.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%