In October 2017, the National Institute of Health open sourced 112,000+ images of chest chest x-rays. Now known as ChestXray14, this dataset was opened in order to allow clinicians to make better diagnostic decisions for patients with various lung diseases.
- Objective
- Dataset
- Exploratory Data Analysis
- Pipeline
- Preprocessing
- Model (Structured Data)
- Model (Convolutional Neural Network)
- Explanations
- References
- Train a convolutional neural network to detect and classify diagnoses of patients.
- Couple structured and unstructured datasets together into a dual classifier.
The ChestXray14 dataset consists of both images and structured data.
The image dataset consists of 112,000+ images, which consist of 30,000 patients. Some patients have multiple scans, which will be taken into consideration. All images are originally 1024 x 1024 pixels.
Due to data sourcing & corruption issues, my image dataset consists of 10,000 of the original 112,000 images. All data is used for the structured model.
Additionally, structured data is also given to us for each image. This dataset includes features such as age, number of follow up visits, AP vs PA scan, and the patient gender.
When researching the labels, there are 709 original, unique categories present. On further examination, the labels are hierarchical. For example, some labels are only "Emphysema", while others are "Emphysema | Cardiac Issues".
The average age is 58 years old. However, about 400 patients are labeled as months, 1 of them is labeled in days.
Two pipelines were created for each dataset. Each script is labeled as either "Structured" or "CNN", which indicates which data pipeline the script is part of.
Description | Script | Model |
---|---|---|
EDA | eda.py | Structured |
Resize Images | resize_images.py | CNN |
Reconcile Labels | reconcile_labels.py | CNN |
Convert Images to Arrays | image_to_array.py | CNN |
CNN Model | cnn.py | CNN |
Structured Data Model | model.py | Structured |
First, the labels were changed to reflect single categories, as opposed to the hierarchical categorical labels in the original data set. This reduces the number of categories from 709 to 15 categories. The label reduction takes its queue from the Stanford data scientists, who reduced the labels in the same way.
Irrelevant columns were also removed. These columns either had zero variance, or provided minimal information on the patient diagnosis.
Finally, anyone whose age was given in months (M) or days (D) was removed. The amount of data removed is minimal, and does not affect the analysis.
The structured data is trained using a gradient boosted classifier. The random forest classifier was also used. When comparing the results, both were nearly equal. The GBM classifier was used due to its speed over the random forest, and due to producing equal or better results to the random forest.
Measurement | Score |
---|---|
Model | H2O Gradient Boosting Estimator |
Log Loss | 1.670 |
MSE | 0.510 |
RMSE | 0.714 |
R^2 | 0.967 |
Mean Per-Class Error | 0.933 |
The CNN was trained using Keras, with the TensorFlow backend.
The model is similar to the VGG architectures; 2 to 3 convolution layers are used in each set of layers, followed by a pooling layer.
Dropout is used in the fully connected layers only, which slightly improved the results.
Measurement | Score |
---|---|
Accuracy | 0.5456 |
Precision | 0.306 |
Recall | 0.553 |
F1 | 0.394 |
Per the blog post from Luke Oakden-Rayner, there are multiple problems with this dataset. The most notable are the images (and structured data) being labeled incorrectly. He also notes the annotators did not look at the images.
This became evident when training both models. Despite regularization, and rectifying the class imbalances, both models learned to return meaningless predictions. Per the above statement, this can be attributed to the incorrect labeling of the images.
Due to these findings, per Mr. Oakden-Rayner, and my own analysis: "I believe the ChestXray14 dataset, as it exists now, is not fit for training medical AI systems to do diagnostic work."
This doesn't discount convolutional neural networks from being able to predict diseases, but this is dependent on the labels being correct and accurate. Once this becomes rectified, and the images are correctly labeled, further analysis can resume against the ChestXray14 dataset.
Algorithm better at diagnosing pneumonia than radiologists