Repo cleanup (#40)

* main readme * idr streams * format * normalize data
WayScience · Mar 5, 2024 · 613acbb · 613acbb
1 parent b7abbdf
commit 613acbb
Show file tree

Hide file tree

Showing 4 changed files with 10 additions and 10 deletions.
diff --git a/1.idr_streams/README.md b/1.idr_streams/README.md
@@ -2,12 +2,12 @@
 
 In this module, we use [idrstream](https://github.com/WayScience/IDR_stream) to extract features from the training and control mitocheck data.\
 
-`idrstream` uses various tools to download, preprocess, segment, and extract features from the frames, wells, plates (metadata) curated in [0.locate_data](../0.locate_data).
+`idrstream` uses various tools to download, preprocess, segment, extract, and compile features from the frames/wells/plates metadata curated in [0.locate_data](../0.locate_data).
 The tool used to extract features, [DeepProfiler](https://github.com/cytomining/DeepProfiler), requires the desired frame along with intermediate files to understand where cells are located and how to extract features.
 These files can reach TB of size for feature extraction on larger datasets.
 `idrstream` processes IDR data in batches to avoid the need for storing many intermediate files at once.
 However, the intermediate files for each batch still need to be stored locally.
-The intermediate files for the training and control datasets will be stored in `tmp/`.
+The intermediate files for the training and control dataset processing will be stored in `tmp/`.
 
 In [streams/](streams/) we initialize and run `idrstream` for the training, negative control, and positive control data.
 The `batch_size` parameter tells `idrstream` how many frames to process in one batch.
@@ -18,7 +18,7 @@ Because the impact of irregular illumination and PyBasic illumination correction
 
 ## Step 1: Set up `idrstream`
 
-`idrstream` is currently in development and needs to be installed via github.
+`idrstream` is currently in development and needs to be installed via GitHub.
 Clone `idrstream` with the following commands:
 
 ```sh

diff --git a/2.format_training_data/README.md b/2.format_training_data/README.md
@@ -3,7 +3,7 @@
 In this module, we associate features extracted from labeled frames with their Mitocheck-assigned phenotypic class label.
 After extracting the features from these labeled frames with `idrstream`, we associate the center coordinates of cells from [features.samples.txt](../mitocheck_metadata/features.samples.txt) with their `idrstream`-derived outlines to assign a phenotypic class (as assigned by Mitocheck) to cell features.
 
-**Note:** `Shape1` and `Shape3` are replaced with binuclear and polylobed respectively, as these their corresponding classes.
+**Note:** We replace `Shape1` and `Shape3` with binuclear and polylobed respectively (their corresponding classes).
 See [#16](https://github.com/WayScience/mitocheck_data/issues/16) for more details.
 
 ## Step 1: Format Training Data

diff --git a/3.normalize_data/README.md b/3.normalize_data/README.md
@@ -8,7 +8,7 @@ As shown in the `Metadata_Gene` UMAPS of [raw_data_umaps.ipynb](raw_data_umaps.i
 This demonstrates that the biological changes induced by gene pertubations have manifested in the `CellProfiler` and `DeepProfiler` features extracted with `idrstream`.
 The other UMAPs in [raw_data_umaps.ipynb](raw_data_umaps.ipynb) suggest that batch effects from plate, well, and frame are not the dominant signal in the feature data.
 
-**Note:** UMAPs were generated with 10% random subsample (without replacement) of data from positive and negative controls.
+**Note:** We generate UMAPs with 10% random subsample (without replacement) of data from positive and negative controls.
 
 Next, we derive a normalization scaler with [sklearn.preprocessing.StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) from the negative control features and apply this scaler to all mitosis movie features ([normalize_data.py](normalize_data.py)).
 [Caicedo et al, 2017](https://www.nature.com/articles/nmeth.4397) explain why the negative control features are a good normalization population for our use case:
@@ -21,7 +21,7 @@ In other words, we create one normalization scaler from all negative control fea
 
 Use the commands below to normalize all data.
 All normalized data will be saved to [normalized_data/](normalized_data/).
-Only the normalized training data has been uploaded to github as the positive and negative control datasets are very large.
+Only the normalized training data has been uploaded to GitHub as the positive and negative control datasets are very large.
 
 ```sh
 # Make sure you are located in 3.normalize_data

diff --git a/README.md b/README.md
@@ -21,7 +21,7 @@ This repository is structured as follows:
 | [0.locate_data](0.locate_data/) | Locate mitosis movies | Find locations (plate, well, frame) for training and control movies |
 | [1.idr_streams](1.idr_streams/) | Extract features from mitosis movies | Use `idrstream` to extract features from training and control movies |
 | [2.format_training_data](2.format_training_data/) | Format training data | Compile metadata, phenotypic class, and feature data for Mitocheck-labeled movies |
-| [3.normalize_data](3.normalize_data/) | Normalize data | Use UMAP to suggest batch effects are not dominant signal and normalize with data with negative controls as normalization population |
+| [3.normalize_data](3.normalize_data/) | Normalize data | Use UMAP to suggest batch effects are not dominant signal and normalize with data using negative controls as normalization population |
 | [4.analyze_data](4.analyze_data/) | Analyze data | Analyze normalized data |
 
 Other necessary folders/files:
@@ -42,15 +42,15 @@ This dataset contains the following files:
 - [features/](mitocheck_metadata/features) : Mitocheck-assigned object IDs and bounding boxes for cells from a specified frame, well, and plate.
 
 We use `trainingset.dat` to locate the frame, well, and plate of labeled cells in [0.locate_data](0.locate_data/).
-After extracting the features from these labeled frames with `idrstream`, we associate the bounding boxes of cells from `features/` with their `idrstream`-derived coordinates to assign cells their phenotypic class (as assigned by Mitocheck).
+After extracting the features from these labeled frames with `idrstream`, we associate the bounding boxes of cells from `features/` with their `idrstream`-derived coordinates to assign cells their phenotypic class (as labeled by Mitocheck).
 
 ## Control Data
 
 We extract single-cell features from positive and negative controls, which are useful for normalizing all Mitocheck data and suggesting that batch effects are not a dominant signal.
 
 We use [IDR-curated mitocheck metadata](mitocheck_metadata/idr0013-screenA-annotation.csv.gz) to locate the well and plate of each control movie.
-Because `idrstream` can only extract features from a single frame, we choose and random frame from the middle 33% of the movie.
-Mitocheck mitosis movies are 93 frames long, so a random frame between frames 31 and 62 are chosen to extract features from.
+Because `idrstream` can only extract features from a single frame, we choose a random frame from the middle 33% of the movie.
+Mitocheck mitosis movies are about 93 frames long, so a random frame between frames 31 and 62 are chosen to extract features from.
 Because we cannot exactly align the movies in time, we opt to randomly sample from the middle of the movies.
 
 ## Dataset Types