This repository contains data analysis for Data Science Bowl 2018 kaggle competition. Main purpose is nucleuses identification (segmentation) in varied conditions.
Firstly, exploratory data analysis was conducted to get to know with data. Results can be found here.
It contains observations for:
- Files (filename encoding, directory structure, duplicated files and data format)
- Train and test data (distribution)
- Dimensions (width and height)
- Channels visualisation
- Colour models (division into colour and black&white images based on channels)
- Masks (number of masks and how many pixels are considered as a nucleus)
- Outliers
I have also proved that train and test data is from the same distribution using adversarial validation.