Skip to content

Commit

Permalink
Repo cleanup (#40)
Browse files Browse the repository at this point in the history
* main readme

* idr streams

* format

* normalize data
  • Loading branch information
roshankern authored Mar 5, 2024
1 parent b7abbdf commit 613acbb
Show file tree
Hide file tree
Showing 4 changed files with 10 additions and 10 deletions.
6 changes: 3 additions & 3 deletions 1.idr_streams/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,12 @@

In this module, we use [idrstream](https://github.com/WayScience/IDR_stream) to extract features from the training and control mitocheck data.\

`idrstream` uses various tools to download, preprocess, segment, and extract features from the frames, wells, plates (metadata) curated in [0.locate_data](../0.locate_data).
`idrstream` uses various tools to download, preprocess, segment, extract, and compile features from the frames/wells/plates metadata curated in [0.locate_data](../0.locate_data).
The tool used to extract features, [DeepProfiler](https://github.com/cytomining/DeepProfiler), requires the desired frame along with intermediate files to understand where cells are located and how to extract features.
These files can reach TB of size for feature extraction on larger datasets.
`idrstream` processes IDR data in batches to avoid the need for storing many intermediate files at once.
However, the intermediate files for each batch still need to be stored locally.
The intermediate files for the training and control datasets will be stored in `tmp/`.
The intermediate files for the training and control dataset processing will be stored in `tmp/`.

In [streams/](streams/) we initialize and run `idrstream` for the training, negative control, and positive control data.
The `batch_size` parameter tells `idrstream` how many frames to process in one batch.
Expand All @@ -18,7 +18,7 @@ Because the impact of irregular illumination and PyBasic illumination correction

## Step 1: Set up `idrstream`

`idrstream` is currently in development and needs to be installed via github.
`idrstream` is currently in development and needs to be installed via GitHub.
Clone `idrstream` with the following commands:

```sh
Expand Down
2 changes: 1 addition & 1 deletion 2.format_training_data/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
In this module, we associate features extracted from labeled frames with their Mitocheck-assigned phenotypic class label.
After extracting the features from these labeled frames with `idrstream`, we associate the center coordinates of cells from [features.samples.txt](../mitocheck_metadata/features.samples.txt) with their `idrstream`-derived outlines to assign a phenotypic class (as assigned by Mitocheck) to cell features.

**Note:** `Shape1` and `Shape3` are replaced with binuclear and polylobed respectively, as these their corresponding classes.
**Note:** We replace `Shape1` and `Shape3` with binuclear and polylobed respectively (their corresponding classes).
See [#16](https://github.com/WayScience/mitocheck_data/issues/16) for more details.

## Step 1: Format Training Data
Expand Down
4 changes: 2 additions & 2 deletions 3.normalize_data/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ As shown in the `Metadata_Gene` UMAPS of [raw_data_umaps.ipynb](raw_data_umaps.i
This demonstrates that the biological changes induced by gene pertubations have manifested in the `CellProfiler` and `DeepProfiler` features extracted with `idrstream`.
The other UMAPs in [raw_data_umaps.ipynb](raw_data_umaps.ipynb) suggest that batch effects from plate, well, and frame are not the dominant signal in the feature data.

**Note:** UMAPs were generated with 10% random subsample (without replacement) of data from positive and negative controls.
**Note:** We generate UMAPs with 10% random subsample (without replacement) of data from positive and negative controls.

Next, we derive a normalization scaler with [sklearn.preprocessing.StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) from the negative control features and apply this scaler to all mitosis movie features ([normalize_data.py](normalize_data.py)).
[Caicedo et al, 2017](https://www.nature.com/articles/nmeth.4397) explain why the negative control features are a good normalization population for our use case:
Expand All @@ -21,7 +21,7 @@ In other words, we create one normalization scaler from all negative control fea

Use the commands below to normalize all data.
All normalized data will be saved to [normalized_data/](normalized_data/).
Only the normalized training data has been uploaded to github as the positive and negative control datasets are very large.
Only the normalized training data has been uploaded to GitHub as the positive and negative control datasets are very large.

```sh
# Make sure you are located in 3.normalize_data
Expand Down
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ This repository is structured as follows:
| [0.locate_data](0.locate_data/) | Locate mitosis movies | Find locations (plate, well, frame) for training and control movies |
| [1.idr_streams](1.idr_streams/) | Extract features from mitosis movies | Use `idrstream` to extract features from training and control movies |
| [2.format_training_data](2.format_training_data/) | Format training data | Compile metadata, phenotypic class, and feature data for Mitocheck-labeled movies |
| [3.normalize_data](3.normalize_data/) | Normalize data | Use UMAP to suggest batch effects are not dominant signal and normalize with data with negative controls as normalization population |
| [3.normalize_data](3.normalize_data/) | Normalize data | Use UMAP to suggest batch effects are not dominant signal and normalize with data using negative controls as normalization population |
| [4.analyze_data](4.analyze_data/) | Analyze data | Analyze normalized data |

Other necessary folders/files:
Expand All @@ -42,15 +42,15 @@ This dataset contains the following files:
- [features/](mitocheck_metadata/features) : Mitocheck-assigned object IDs and bounding boxes for cells from a specified frame, well, and plate.

We use `trainingset.dat` to locate the frame, well, and plate of labeled cells in [0.locate_data](0.locate_data/).
After extracting the features from these labeled frames with `idrstream`, we associate the bounding boxes of cells from `features/` with their `idrstream`-derived coordinates to assign cells their phenotypic class (as assigned by Mitocheck).
After extracting the features from these labeled frames with `idrstream`, we associate the bounding boxes of cells from `features/` with their `idrstream`-derived coordinates to assign cells their phenotypic class (as labeled by Mitocheck).

## Control Data

We extract single-cell features from positive and negative controls, which are useful for normalizing all Mitocheck data and suggesting that batch effects are not a dominant signal.

We use [IDR-curated mitocheck metadata](mitocheck_metadata/idr0013-screenA-annotation.csv.gz) to locate the well and plate of each control movie.
Because `idrstream` can only extract features from a single frame, we choose and random frame from the middle 33% of the movie.
Mitocheck mitosis movies are 93 frames long, so a random frame between frames 31 and 62 are chosen to extract features from.
Because `idrstream` can only extract features from a single frame, we choose a random frame from the middle 33% of the movie.
Mitocheck mitosis movies are about 93 frames long, so a random frame between frames 31 and 62 are chosen to extract features from.
Because we cannot exactly align the movies in time, we opt to randomly sample from the middle of the movies.

## Dataset Types
Expand Down

0 comments on commit 613acbb

Please sign in to comment.