Skip to content

Datasets and scripts for a journal paper titled 'Using Machine Learning to Link Climate, Phylogeny and Leaf Area in Eucalypts Through a 50-fold Expansion of Current Leaf Trait Datasets'.

License

Notifications You must be signed in to change notification settings

KarinaGuo/Machine_Learning_in_Computer_Vision_for_Leaf_Traits

Repository files navigation

DOI


This repository includes the key scripts and data files used for for a journal paper titled 'Using Machine Learning to Link Climate, Phylogeny and Leaf Area in Eucalypts Through a 50-fold Expansion of Current Leaf Trait Datasets', and has been archived by release on zenodo. File paths, file names, and such may need to be changed upon running on you rlocal device. Do let me know if anything is missing or isn't working.

/Code - includes code used for the methods i.e. (determining the model's accuracy in its predictions, the creation of the model), and the analysis of the subsequent dataset

/Data - includes files used in code

/leaf_BigLeaf_QC - datasets used to analyse the quality control of our machine learning model, as in Supplementary Information D

/Conda Environments - includes the .yml files of the conda environments used. This enables necessary software dependencies are listed to reproduce the environment used to perform the training/testing/use of the machine learning models, and the extraction of traits from the binary masks of the machine learning model predictions.

Leaf segmentation model

Preparing the data

As this was integrated into a cycle of optimisation, labels of annotated herbarium images for training, validating and testing were changed when two or more labels were merged into one. These annotated images were then trimmed to the bounding box (BB) as stated in the protocol, then converted to a COCO file format.

The conda environment 'labelme' was used for this process

Updating labels. Where test_labels is a directory of the initial unchanged annotated images. test_labels_updatedlabs is the output directory. testmap.csv is a dictionary that indicates which old labels map to which new labels. testmap.log is a variable not in use, and is currently an empty placeholder.

python /home/botml/code/py/updating_labels.py /data/botml/test_labels/ /data/botml/test_labels_updatedlabs/ /data/botml/leaf_dimension/EIGHT_DuplSeven_BS20_ExtTrain/testmap.csv /data/botml/NINE_DuplSeven_BS20_L100_ExtTrain/fb2_vnoUM/testmap.log

An example of testmap.csv, where the labels Leaf90 and Leaf100UM are converted to Leaf100

Leaf90 Leaf100UM
Leaf100 Leaf100

Trimming the annotated images to the bounding boxes. Where train_labels_updatedlabs is the input directory of the annotated images. train_labels_trimmed is the output directory. --focalbox is the label indicating the bounding box. --classes is the desired classes to be included in the output

python /home/botml/code/py/cut_focal_box.py /data/botml/train_labels_updatedlabs/ /data/botml/train_labels_trimmed/ --focalbox BB --classes Leaf100

A portion of these train labels were then moved to validation (a random 20% of all annotated input data). The input data for training, validating and testing were then converted to a COCO file format.

Converting file formats to COCO. Where /data is the input directory. --output is the output file. --classes is the desired training label to be included. --polyORbb is whether the annotation is a polygon or a bounding box.<

python /home/botml/code/py/lm2coco.py /data/botml/leaf_dimension/ELEVEN_DuplTen_ExtSheets/data/ --output /data/botml/leaf_dimension/ELEVEN_DuplTen_ExtSheets/data.json --classes 'Leaf100' --polyORbb 'poly'

python /home/botml/code/py/lm2coco.py /data/botml/leaf_dimension/ELEVEN_DuplTen_ExtSheets/validation/ --output /data/botml/leaf_dimension/ELEVEN_DuplTen_ExtSheets/validation.json --classes 'Leaf100' --polyORbb 'poly'

python /home/botml/code/py/lm2coco.py /data/botml/leaf_dimension/ELEVEN_DuplTen_ExtSheets/test/ --output /data/botml/leaf_dimension/ELEVEN_DuplTen_ExtSheets/test.json --classes 'Leaf100' --polyORbb 'poly'

At this stage, the working directory should include, if they are not present please make an empty directory or download from the data directory in this repository if present:

  • /coco_eval
  • /data
  • /validation
  • /test
  • /pred
  • /code
  • data.json
  • validation.json
  • test.json

Training, validating, and testing the model

The conda environment 'pytorch' was used for this process

Edit the following python scripts according to your set up. This would include changing the training_path and the desired training parameters for the iteration.

Training & validating the model. The following variables are included in the script below, and would likely need to be changed to be relevant.

  • training_path: The path to the training data.
  • training_name: The name of the training directory.
  • validation_name: The name of the validation directory.
  • out_dir: The directory where the trained model will be saved.
  • out_yaml: The name of the YAML file that will be used to save the trained model.
  • in_yaml: The path to the YAML file that contains the model architecture.
  • in_weights: The path to the weights file that will be used to initialize the model.
  • in_yaml_zoo: A boolean value that indicates whether to use the model architecture from the Zoo.
  • in_weights_zoo: A boolean value that indicates whether to use the weights from the Zoo.
  • ims_per_batch: The number of images per batch.
  • base_lr: The base learning rate.
  • max_iter: The maximum number of iterations.
  • num_classes: The number of classes.

python train_leaf.py

The trained model will be saved in the /model directory

Testing the model. The following variables are included in the script below, and may need to be changed to be relevant. Following this, quantitative evaluation metrics were calculated using machine_accuracy_process_v2.R.

  • img_dir: The directory where the test images are located.
  • out_dir: The directory where the predictions will be saved.
  • fext: The file extension of the test images.
  • training_path: The path to the training data.
  • training_name: The name of the training directory.
  • yaml_file: The path to the YAML file that contains the model architecture.
  • weights_file: The path to the weights file that will be used to initialize the model.
  • yaml_zoo: A boolean value that indicates whether to use the model architecture from the Zoo.
  • weights_zoo: A boolean value that indicates whether to use the weights from the Zoo.
  • num_classes: The number of classes.
  • score_thresh: The score threshold for determining whether a prediction is positive.
  • s1: The first scale factor for resizing the images.
  • s2: The second scale factor for resizing the images.
  • groundtruth_name: The name of the ground truth directory.
  • model_predictions_file: The path to the file that contains the model predictions.
  • thresh_iou: The intersection-over-union threshold for determining whether a prediction matches a ground truth instance.
  • matches_out_file: The path to the file where the matches will be saved.
  • instances_out_file: The path to the file where the instance summaries will be saved.

The following functions are used in the Python script:

  • model_tools.visualize_predictions(): This function visualizes the predictions for the test images.
  • model_tools.predict_in_directory(): This function predicts the labels for the test images.
  • model_tools.match_groundtruth_prediction(): This function matches the model predictions to the ground truth instances.
  • model_tools.summarize_predictions(): This function summarizes the model predictions.
python predict_leaf.py

Or alternatively the below can be run on images that have not been annotated. This creates the qualitative metrics only, skipping the quantitative evaluation metrics.

python predict_leaf_vis.py

Leaf classification model

Preparing the data

First, the leaf segmentation model is used on a dataset to create the leaf masks required for the training & validation steps of this model. To do this, the predict_leaf_vis.py script is run, which creates a numpy file that includes the binary masks of each leaf per image. These binary masks are then extracted, resized, padded, recoloured and a connected component analysis with OTSU thresholding applied. This is done with the script below

Extracting leaf masks from the leaf segmentation model predictions. Where the structure is as such

  • _predictions.npy: Numpy file from the output of the segmentation mask
  • /testing_images: Images used on the leaf segmentation model, to generate the numpy file above
  • /classifier_training_testdata: Output file of leaf images

python /data/botml/leaf_dimension_classifier/code/extracting_leaves_cropped_iter_v2.py "/data/botml/leaf_dimension_classifier/testing_images_pred/_predictions.npy" "/data/botml/leaf_dimension_classifier/testing_images/" "/data/botml/leaf_dimension_classifier/input_classifier_data/"

These images are then manually separated into 'valid' and 'invalid' classes in a separate directory as /classed_training_data/N and /classed_training_data/Y. At this stage, the working directory should include, if they are not present please make an empty directory or download from the data directory in this repository if present:

  • /code: The directory of relevant code
  • /model/classifier: Directory where the final model is stored
  • /pred_leaf: Full images of used to create the training dataset
  • /input_classifier_data: The cropped unclassed images used for training/validating, from the predictions of the leaf segmentation model
  • /classed_training_data: The cropped classed images used for training/validating, from the predictions of the leaf segmentation model
  • /classifier_training_testdata: The cropped classed images used for testing, from the predictions of the leaf segmentation model
  • classifier_results_test.csv: A .csv file that will include the predictions of the model

Training, validating, and testing the model

Edit the scripts according to your set up. This would include changing the training_path and the training parameters.

Training & validating the model. The following variables are included in the script below, and would likely need to be changed to be relevant.

  • data_dir: The directory where the classified training data is located.
  • val_ratio: The fraction of the training data to use for validation.
  • num_epochs: The number of epochs to train the model for.
  • model_out: The path to the file where the trained model will be saved.

python classify_leaves.py

The final model is deposited into /model/classifier

The previous steps for preparing the data is repeated for the test leaf images and is placed in /classifier_training_testdata.

Testing the model. The following variables are included in the script below, and would likely need to be changed to be relevant.

  • train_dir: The directory where the classified training data is located.
  • data_dir: The directory where the test data is located.
  • model_file: The path to the file where the trained model will be saved.
  • out_dir: The path to the file where the results will be saved.

python predict_from_classifierv2.py

Evaluation metrics were calculated using machine_accuracy_process_v2.R after manually editing the output of the classifier.

Extracting traits

Traits were extracted using an R script leaf_dimension_calculations.R where each leaf mask is fed in as an argument. This was integrated into the final loop we used over the entire herbarium dataset. Please refer to below for running the leaf trait extraction script.

Using the final models on the entire dataset

The final model outputs were then moved to the directories /model/d2, /model/classifier, nesting in the main working directory of the final run.

The final run was then executed using the bash script running_code.sh in a working directory, it calls tailored codes that can be found in this repository under /code/final_run. This script performs the following operations:

  1. Copies the images to a temporary directory.
  2. Predicts the dimensions of the leaves in the images.
  3. Crops the leaves from the images.
  4. Tracks duplicate leaves.
  5. Predicts the class of the leaves in the images.
  6. Removes invalid files.
  7. Executes R code to generate traits for the leaves.
  8. Deletes the temporary files.
  9. Repeats loop until all leaves on images are extracted
  10. The script then merges the outcomes with the metadata and displays a message indicating that the script has finished executing.

About

Datasets and scripts for a journal paper titled 'Using Machine Learning to Link Climate, Phylogeny and Leaf Area in Eucalypts Through a 50-fold Expansion of Current Leaf Trait Datasets'.

Resources

License

Stars

Watchers

Forks

Packages

No packages published