[ENH] Update Nipoppy docs and add page about trackers (#164)

* update paths to scripts * add note to clarify terms "session" and "session ID" * attempt to fix nested list * add page about trackers * remove old tracking section from MRIQC page * minor changes * try to fix nested list rendering (again) * add notes about `run_dicom_org.py` optional parameters * fix/udpate MRIQC page sample command * make link to digest a clickable link * address Nikhil comments * add links to manifest and doughnut descriptions * add glossary * reorder glossary and update "`session_id` vs `visit_id`" * add recommendation that visits should be a timeline * fix French spelling... * add updates after speaking with Nikhil * add mention of "subject ID" in `participant_id` entry --------- Co-authored-by: Nikhil Bhagwat <[email protected]>
neurobagel · Feb 9, 2024 · dfffd7b · dfffd7b
1 parent 6a46e21
commit dfffd7b
Show file tree

Hide file tree

Showing 16 changed files with 171 additions and 74 deletions.
diff --git a/docs/imgs/code_org.png b/docs/imgs/code_org.png
diff --git a/docs/imgs/data_org.jpg b/docs/imgs/data_org.jpg
diff --git a/docs/imgs/data_org.png b/docs/imgs/data_org.png
diff --git a/docs/imgs/digest.png b/docs/imgs/digest.png
diff --git a/docs/nipoppy/code_org.md b/docs/nipoppy/code_org.md
@@ -8,7 +8,7 @@ The Nipoppy codebase is divided into data processing `workflows` and data availa
 
 **`workflow`**
 
-- MRI data organization (`dicom_org` and `bids_conv`)
+- MRI data organization ([`dicom_org`](./workflow/dicom_org.md) and [`bids_conv`](./workflow/bids_conv.md))
     - Custom script to organize raw DICOMs (i.e. scanner output) into a flat participant-level directory. 
     - Convert DICOMs into BIDS using [Heudiconv](https://heudiconv.readthedocs.io/en/latest/)
 - MRI data processing (`proc_pipe`)

diff --git a/docs/nipoppy/configs.md b/docs/nipoppy/configs.md
@@ -22,6 +22,10 @@ Nipoppy requires two global files for specifying local data/container paths and
       - Information about tabular data (`TABULAR`)
         - Version and path to the data dictionary (`data_dictionary`)
 
+!!! Note
+
+    Nipoppy uses the term "session" to refer to a session ID string with the "ses-" prefix. For example, `ses-01` is a session, and `01` is the session ID associated with this session.
+
 !!! Suggestion
 
     Although not mandatory, for consistency the preferred location would be: `<DATASET_ROOT>/proc/global_configs.json`.
@@ -73,9 +77,11 @@ Nipoppy requires two global files for specifying local data/container paths and
 
 ### Participant manifest: `manifest.csv`
    - This list serves as the **ground truth** for subject and visit (i.e. session) availability
-   - Create the `manifest.csv` in `<DATASET_ROOT>/tabular/` comprising following columns
+   - Create the `manifest.csv` in `<DATASET_ROOT>/tabular/` comprising following columns:
       - `participant_id`: ID assigned during recruitment (at times used interchangeably with `subject_id`)
-      - `visit`: label to denote participant visit for data acquisition (e.g. `"baseline"`, `"m12"`, `"m24"` or `"V01"`, `"V02"` etc.)
+      - `visit`: label to denote participant visit for data acquisition
+        - ***Note***: we recommend that visits describe a timeline if possible, for example `BL`, `M12`, `M24` (for Baseline, Month 12, and Month 24 respectively).
+            - Alternatively, visits should be ordinal and ideally named with the `V` prefix (e.g., `V01`, `V02`)
       - `session`: alternative naming for visit - typically used for imaging data to comply with [BIDS standard](https://bids-specification.readthedocs.io/en/stable/02-common-principles.html)
       - `datatype`: a list of acquired imaging datatype as defined by [BIDS standard](https://bids-specification.readthedocs.io/en/stable/02-common-principles.html)
    - New participant are appended upon recruitment as new rows

diff --git a/docs/nipoppy/data_org.md b/docs/nipoppy/data_org.md
@@ -21,4 +21,4 @@ Directories:
 - `backups`: data backup space (tars)
 - `releases`: data releases (symlinks)
 
-![data_org](../imgs/data_org.png)
+![data_org](../imgs/data_org.jpg)
diff --git a/docs/nipoppy/glossary.md b/docs/nipoppy/glossary.md
@@ -0,0 +1,52 @@
+## Glossary
+
+This page lists some definitions for important/recurring terms used in the Nipoppy framework.
+
+### `participant_id`
+
+**Appears in**: `manifest.csv`, `doughnut.csv`
+
+:   Unique identifier for the participant (i.e., subject ID), as provided by the study.
+
+### `datatype`
+
+**Appears in**: `manifest.csv`
+
+:   A BIDS-compliant "data type" value (see the [BIDS specification website](https://bids-specification.readthedocs.io/en/stable/common-principles.html#definitions) for a comprehensive list). The most common data types for magnetic resonance imaging (MRI) data are `"anat"`, `"func"`, and `"dwi"`.
+
+### `visit`
+
+**Appears in**: `manifest.csv`
+
+:   An identifier for a data collection event, not restricted to imaging data.
+
+See also: [`session` vs `visit`](#session-vs-visit)
+
+### `session`
+
+**Appears in**: `manifest.csv`, `doughnut.csv`
+
+:   A BIDS-compliant session identifier. Consists of the `"ses-"` prefix followed by the [`session_id`](#session_id).
+
+#### [`session`](#session) vs [`visit`](#visit)
+
+Nipoppy uses `session` for imaging data, following the convention established by BIDS. The term `visit`, on the other hand, is used to refer to any data collection event (not necessarily imaging-related). In most cases, `session` and `visit` will be identical (or `session`s will be a subset of `visit`s). However, having two descriptors becomes particularly useful when imaging and non-imaging assessments do not use the same naming conventions.
+
+### `participant_dicom_dir`
+
+**Appears in**: `doughnut.csv`
+
+:   The name of the directory in which the raw DICOM data (before the DICOM organization step) are found. Usually, this is the same as [`participant_id`](#participant_id), but depending on the study it could be different.
+
+### `dicom_id`
+
+**Appears in**: `doughnut.csv`
+
+:   The [`participant_id`](#participant_id), stripped of any non-alphanumerical character. For studies that do not use non-alphanumerical characters in their participant IDs, this is exactly the same as [`participant_id`](#participant_id).
+
+### `bids_id`
+
+**Appears in**: `doughnut.csv`
+
+:   A BIDS-compliant participant identifier. Obtained by adding the `"sub-"` prefix to the [`dicom_id`](#dicom_id), which itself is derived from the [`participant_id`](#participant_id). A participant's raw BIDS data and derived imaging data are stored in directories named after their `bids_id`.
+
diff --git a/docs/nipoppy/installation.md b/docs/nipoppy/installation.md
@@ -9,20 +9,21 @@ The Nipoppy workflow comprises a Nipoppy codebase that operates on a Nipoppy dat
 ### Nipoppy code+env installation
 1. Change directory to where you want to clone this repo, e.g.: `cd /home/<user>/projects/<my_project>/code/`
 2. Create a new [venv](https://realpython.com/python-virtual-environments-a-primer/): `python3 -m venv nipoppy_env`
-   * Alternatively (if using [Anaconda/Miniconda](https://www.anaconda.com/)), create a `conda` environment: `conda create --name nipoppy_env python=3.9`
-3. Activate your env: `source nipoppy_env/bin/activate` 
-   * If using Anaconda/Miniconda: `conda activate nipoppy_env`
+    * Alternatively (if using [Anaconda/Miniconda](https://www.anaconda.com/)), create a `conda` environment: `conda create --name nipoppy_env python=3.9`
+3. Activate your env: `source nipoppy_env/bin/activate`
+    * If using Anaconda/Miniconda: `conda activate nipoppy_env`
 4. Clone this repo: `git clone https://github.com/neurodatascience/nipoppy.git`
-5. Change directory to `nipoppy` 
-6. Install python dependencies: `pip install -e .`  
+5. Change directory to `nipoppy`
+6. Install python dependencies: `pip install -e .`
 
 ### Nipoppy dataset directory setup 
 
-Run `tree.py` to create the Nipoppy dataset directory tree:
+Run [`nipoppy/tree.py`](https://github.com/neurodatascience/nipoppy/blob/main/nipoppy/tree.py) to create the Nipoppy dataset directory tree:
 ```bash
-python tree.py --nipoppy_root <DATASET_ROOT>
+python nipoppy/tree.py --nipoppy_root <DATASET_ROOT>
 ```
-Where
+Where:
+
 - `DATASET_ROOT`: root (starting point) of the Nipoppy structured dataset
 
 !!! Suggestion

diff --git a/docs/nipoppy/overview.md b/docs/nipoppy/overview.md
@@ -1,6 +1,4 @@
-## What is Nipoppy (formerly mr_proc)? 
-
-[*Process long and prosper*](https://en.wikipedia.org/wiki/Vulcan_salute)
+## What is Nipoppy? 
 
 [Nipoppy](https://github.com/neurodatascience/nipoppy) is a lightweight framework for analyzing (neuro)imaging and clinical data. It is designed to help users do the following:
 

diff --git a/docs/nipoppy/trackers.md b/docs/nipoppy/trackers.md
@@ -0,0 +1,66 @@
+## Track data availability status
+
+---
+
+Trackers check the availability of files created during the dataset processing workflow (specifically the BIDS raw data and imaging pipeline derivatives) and assign an availability status (`SUCCESS`, `FAIL`, `INCOMPLETE` or `UNAVAILABLE`).
+
+---
+
+### Key directories and files
+
+- `<DATASET_ROOT>/bids`
+- `<DATASET_ROOT>/derivatives`
+- `<DATASET_ROOT>/derivatives/bagel.csv`
+
+### Running the tracker script
+
+The tracker uses the [`manifest.csv`](./configs.md#participant-manifest-manifestcsv) and [`doughnut.csv`](./workflow/dicom_org.md#procedure) files to determine the participant-session pairs to check. Each available tracker has an associated configuration file (typically called `<pipeline>_tracker.py`), where lists of expected paths for files produced by the pipeline are defined.
+
+For each participant-session pair being tracked, the tracker outputs a `"pipeline_complete"` status. Depending on the configuration for that particular pipeline, the tracker might also output phase and/or stage statuses (e.g., `"PHASE__func"`), which typically refer to sub-pipelines within the full pipeline that may or may not have been run during processing, depending on the input data and/or processing parameters.
+
+The tracker script updates the tabular `<DATASET_ROOT>/derivatives/bagel.csv` file (see the [Understanding the `bagel.csv` output](#understanding-the-bagelcsv-output) for more information).
+
+> Sample command:
+```bash
+python nipoppy/trackers/run_tracker.py \
+    --global_config <global_config_file>
+    --dash_schema nipoppy/trackers/bagel_schema.json
+    --pipelines fmriprep mriqc tractoflow heudiconv
+```
+
+Notes:
+- Currently available image processing pipelines are: `fmriprep`, `mriqc`, and `tractoflow`. See [Adding a tracker](#adding-a-tracker) for the steps to add a new tracker.
+- Use `--pipelines heudiconv` for tracking BIDS data availability
+- An optional `--session_id` parameter can be specified to only track a specific session. By default, the trackers are run for all sessions.
+- Other optional arguments include `--run_id` and `--acq_label`, to help generate expected file paths for BIDS Apps.
+
+### Understanding the `bagel.csv` output
+
+A JSON schema for the `bagel.csv` file produced by the tracker script is available [here](https://github.com/neurobagel/digest/blob/main/schemas/bagel_schema.json).
+
+Here is an example of a `bagel.csv` file:
+
+| bids_id | participant_id | session | has_mri_data | pipeline_name | pipeline_version | pipeline_starttime | pipeline_complete |
+| ------- | -------------- | ------- | ------------ | ------------- | ---------------- | ------------------ | ----------------- |
+| sub-MNI001 | MNI001 | 1 | TRUE | freesurfer | 6.0.1 | 2022-05-24 13:43 | SUCCESS |
+| sub-MNI001 | MNI001 | 2 | TRUE | freesurfer | 6.0.1 | 2022-05-24 13:46 | SUCCESS |
+| sub-MNI001 | MNI001 | 3 | TRUE | freesurfer | 6.0.1 | UNAVAILABLE | INCOMPLETE |
+
+The imaging derivatives bagel has one row for each participant-session-pipeline combination. The pipeline status columns are `"pipeline_complete"`, and any column whose name begins by `"PHASE__"` or `"STAGE__"`. The possible values for these columns are:
+- `"SUCCESS"`: All expected pipeline output files (as configured by pipeline tracker) are present.
+- `"FAIL"`: At least one expected pipeline output is missing.
+- `"INCOMPLETE"`: Pipeline has not been run for the subject session (output directory missing).
+- `"UNAVAILABLE"`: Relevant MRI modality for pipeline not available for subject session (determined by the `datatype` column in the dataset's manifest file).
+
+### Adding a tracker
+
+1. Create a new file in `nipoppy/trackers` called `<new_pipeline>_tracker.py`.
+2. Define a config dictionary `tracker_configs`, with a mandatory key `"pipeline_complete"` whose value is a function that takes as input the path to the subject result directory, as well as the session and run IDs, and outputs one of `"SUCCESS"`, `"FAIL"`, `"INCOMPLETE"`, or `"UNAVAILABLE"`. See the built-in [fMRIPrep tracker](https://github.com/neurodatascience/nipoppy/blob/main/nipoppy/trackers/fmriprep_tracker.py) for an example.
+3. Optionally add additional stages and phases to track. Again, refer to the [fMRIPrep tracker](https://github.com/neurodatascience/nipoppy/blob/main/nipoppy/trackers/fmriprep_tracker.py) for to any other pre-defined tracker configuration for an example.
+4. Modify `nipoppy/trackers/run_tracker.py` to add the new tracker as an option.
+
+### Visualizing availability status with the Neurobagel [`digest`](https://digest.neurobagel.org/)
+
+The `bagel.csv` file written by the tracker can be uploaded to [https://digest.neurobagel.org/](https://digest.neurobagel.org/) (as an "imaging CSV file") for interactive visualizations of processing status.
+
+![digest](../imgs/digest.png)
diff --git a/docs/nipoppy/workflow/bids_conv.md b/docs/nipoppy/workflow/bids_conv.md
@@ -17,17 +17,17 @@ Convert DICOMs to BIDS using [Heudiconv](https://heudiconv.readthedocs.io/en/lat
 ### Procedure
 
 1. Ensure you have the appropriate HeuDiConv container listed in your `global_configs.json`
-2. Use [run_bids_conv.py](https://github.com/neurodatascience/nipoppy/blob/main/nipoppy/workflow/bids_conv/run_bids_conv.py) to run HeuDiConv `stage_1` and `stage_2`.  
+2. Use [nipoppy/workflow/bids_conv/run_bids_conv.py](https://github.com/neurodatascience/nipoppy/blob/main/nipoppy/workflow/bids_conv/run_bids_conv.py) to run HeuDiConv `stage_1` and `stage_2`.  
    - Run `stage_1` to generate a list of available protocols from the DICOM header. These protocols are listed in `<DATASET_ROOT>/bids/.heudiconv/<participant_id>/info/dicominfo_ses-<session_id>.tsv`
 
 > Sample cmd:
 ```bash
-python run_bids_conv.py \
+python nipoppy/workflow/bids_conv/run_bids_conv.py \
    --global_config <global_config_file> \
    --session_id <session_id> \
    --stage 1
 ```
-      
+
 !!! note
 
     If participants have multiple sessions (or visits), these need to be converted separately and combined post-hoc to avoid Heudiconv errors. 
@@ -43,7 +43,7 @@ python run_bids_conv.py \
 
 > Sample cmd:
 ```bash
-python run_bids_conv.py \
+python nipoppy/workflow/bids_conv/run_bids_conv.py \
    --global_config <global_config_file> \
    --session_id <session_id> \
    --stage 2
@@ -52,4 +52,4 @@ python run_bids_conv.py \
 
 !!! note
 
-    Once `heuristic.py` is finalized, only `stage_2` needs to be run peridodically unless new scan protocol is added.
+    Once `heuristic.py` is finalized, only `stage_2` needs to be run periodically unless new scan protocol is added.
diff --git a/docs/nipoppy/workflow/dicom_org.md b/docs/nipoppy/workflow/dicom_org.md
@@ -2,7 +2,7 @@
 
 ---
 
-This is a dataset specific process and needs to be customized based on local scanner DICOM dumps and file naming. This organization should produce, for a given session, participant specific dicom dirs. Each of these participant-dir contains a flat list of dicoms for the participant for all available imaging modalities and scan protocols. The manifest is used to determine which new subject-session pairs need to be processed, and a `doughnut.csv` file is used to track the status for the DICOM reorganization and BIDS conversion steps.
+This is a dataset-specific process and needs to be customized based on local scanner DICOM dumps and file naming. This organization should produce, for a given session, participant specific dicom dirs. Each of these participant-dir contains a flat list of dicoms for the participant for all available imaging modalities and scan protocols. The manifest is used to determine which new subject-session pairs need to be processed, and a `doughnut.csv` file is used to track the status for the DICOM reorganization and BIDS conversion steps.
 
 ---
 ### Key directories and files
@@ -15,7 +15,7 @@ This is a dataset specific process and needs to be customized based on local sca
 
 ### Procedure
 
-1. Run [`workflow/make_doughnut.py`](https://github.com/neurodatascience/nipoppy/blob/main/nipoppy/workflow/make_doughnut.py) to update `doughnut.csv` based on the manifest. It will add new rows for any subject-session pair not already in the file.
+1. Run [`nipoppy/workflow/make_doughnut.py`](https://github.com/neurodatascience/nipoppy/blob/main/nipoppy/workflow/make_doughnut.py) to update `doughnut.csv` based on the manifest. It will add new rows for any subject-session pair not already in the file.
     - To create the `doughnut.csv` for the first time, use the `--empty` argument. If processing has been done without updating `doughnut.csv`, use `--regenerate` to update it based on new files in the dataset.
 
 !!! note
@@ -42,18 +42,18 @@ This is a dataset specific process and needs to be customized based on local sca
 
 !!! note
 
-    It is **okay** for the participant directory to have messy internal subdir tree with DICOMs from multiple modalities. (See [data org schematic](data_org.md) for details). The run script will search and validate all available DICOM files automatically. 
+    It is **okay** for the participant directory to have messy internal subdir tree with DICOMs from multiple modalities. (See [data org schematic](../../imgs/data_org.jpg) for details). The run script will search and validate all available DICOM files automatically. 
 
 
-4. Run [`run_dicom_org.py`](https://github.com/neurodatascience/nipoppy/blob/main/nipoppy/workflow/dicom_org/run_dicom_org.py) to:
+4. Run [`nipoppy/workflow/dicom_org/run_dicom_org.py`](https://github.com/neurodatascience/nipoppy/blob/main/nipoppy/workflow/dicom_org/run_dicom_org.py) to:
     - Search: Find all the DICOMs inside the participant directory. 
-    - Validate: Excludes certain individual dicom files that are invalid or contain scanner-derived data not compatible with BIDS conversion.
-    - Symlink (default) or copy: Creates symlinks from `raw_dicom/` to the `<DATASET_ROOT>/dicom`, where all participant specific dicoms are in a flat list. The symlinks are relative so that they are preserved in containers.
+    - Validate: Excludes certain individual dicom files that are invalid or contain scanner-derived data not compatible with BIDS conversion. Enabled by default, disable by passing `--skip_dcm_check`.
+    - Symlink (default) or copy: Creates symlinks from `raw_dicom/` to the `<DATASET_ROOT>/dicom`, where all participant specific dicoms are in a flat list. The symlinks are relative so that they are preserved in containers. Disable by passing `--no_symlink`.
     - Update status: if successful, set the `organized` column to `True` in `doughnut.csv`.
 
 > Sample cmd:
 ```bash
-python run_dicom_org.py \
+python nipoppy/workflow/dicom_org/run_dicom_org.py \
     --global_config <global_config_file> \
     --session_id <session_id> \
 ```