Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] move definition of raw and derivatives datasets to common principles #1815

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 11 additions & 16 deletions src/common-principles.md
Original file line number Diff line number Diff line change
Expand Up @@ -234,15 +234,13 @@ In some cases, this principle is enforced in the BIDS validator.
## Source vs. raw vs. derived data

BIDS was originally designed to describe and apply consistent naming conventions
to raw (unprocessed or minimally processed due to file format conversion) data.
to [raw datasets](./glossary.md#raw-common_principles) (unprocessed or minimally processed due to file format conversion).
During analysis such data will be transformed and partial as well as final results
will be saved.
Derivatives of the raw data (other than products of DICOM to NIfTI conversion)
MUST be kept separate from the raw data. This way one can protect the raw data
from accidental changes by file permissions. In addition it is easy to
distinguish partial results from the raw data and share the latter.
See [Storage of derived datasets](#storage-of-derived-datasets) for more on
organizing derivatives.
[Derivatives](./glossary.md#derivative-common_principles) of the raw data MUST be kept separate from the raw data.
This way one can protect the raw data from accidental changes by file permissions.
In addition it is easy to distinguish partial results from the raw data and share the latter.
See [Storage of derived datasets](#storage-of-derived-datasets) for more on organizing derivatives.

Similar rules apply to source data, which is defined as data
before harmonization, reconstruction, and/or file format conversion
Expand Down Expand Up @@ -340,12 +338,10 @@ field in `dataset_description.json` of each subdirectory of `derivatives` to:
Derivatives can be stored/distributed in two ways:

1. Under a `derivatives/` subdirectory in the root of the source BIDS dataset
directory to make a clear distinction between raw data and results of data
processing.
directory to make a clear distinction between raw data and results of data processing.
A data processing pipeline will typically have a dedicated directory
under which it stores all of its outputs.
Different components of a pipeline can, however, also be stored under different
subdirectories.
Different components of a pipeline can, however, also be stored under different subdirectories.
There are few restrictions on the directory names;
it is RECOMMENDED to use the format `<pipeline>-<variant>` in cases where
it is anticipated that the same pipeline will output more than one variant
Expand Down Expand Up @@ -377,11 +373,10 @@ Derivatives can be stored/distributed in two ways:
<dataset>/derivatives/spm-preproc/derivatives/spm-stats/sub-0001
```

1. As a standalone dataset independent of the source (raw or derived) BIDS
dataset.
This way of specifying derivatives is particularly useful when the source
dataset is provided with read-only access, for publishing derivatives as
independent bodies of work, or for describing derivatives that were created
1. As a standalone dataset independent of the source (raw or derived) BIDS dataset.
This way of specifying derivatives is particularly useful when the source dataset
is provided with read-only access, for publishing derivatives as independent bodies of work,
or for describing derivatives that were created
from more than one source dataset.
The `sourcedata/` subdirectory MAY be used to include the source dataset(s)
that were used to generate the derivatives.
Expand Down
36 changes: 19 additions & 17 deletions src/derivatives/introduction.md
Original file line number Diff line number Diff line change
@@ -1,30 +1,32 @@
# BIDS Derivatives

Derivatives are outputs of common processing pipelines, capturing data and
meta-data sufficient for a researcher to understand and (critically) reuse those
outputs in subsequent processing.
[Derivatives datasets](../glossary.md#derivative-common_principles) are outputs of common processing pipelines,
capturing data and meta-data sufficient for a researcher
to understand and (critically) reuse those outputs in subsequent processing.
Standardizing derivatives is motivated by use cases where formalized
machine-readable access to processed data enables higher-level processing.

The following sections cover additions to and divergences from "raw" BIDS.
Raw data are data that have been curated into BIDS from a non-BIDS source.
If a dataset is derived from at least one other valid BIDS dataset, then it is a derivative dataset.
The following sections cover additions to and divergences from [raw BIDS datasets](../glossary.md#raw-common_principles).

Examples:
[Raw BIDS datasets](../glossary.md#raw-common_principles) are data that have been curated into BIDS from one or more non-BIDS sources.
If a dataset is derived from at least one other valid BIDS dataset,
then it is a [derivatives datasets](../glossary.md#derivative-common_principles).

A defaced T1w image would typically be made during the curation process and is thus under raw
!!! example

```Text
sourcedata/private/sub-01/anat/sub-01_T1w.nii.gz
sub-01/anat/sub-01_T1w.nii.gz
```
A defaced T1w image would typically be made during the curation process and is thus under raw

A defaced T1w image could also, in theory, be derived from a BIDS dataset and would thus be under derivatives
```Text
sourcedata/private/sub-01/anat/sub-01_T1w.nii.gz
sub-01/anat/sub-01_T1w.nii.gz
```

```Text
sub-01/anat/sub-01_T1w.nii.gz
derivatives/sub-01/anat/sub-01_desc-defaced_T1w.nii.gz
```
A defaced T1w image could also, in theory, be derived from a BIDS dataset and would thus be under derivatives

```Text
sub-01/anat/sub-01_T1w.nii.gz
derivatives/sub-01/anat/sub-01_desc-defaced_T1w.nii.gz
```

## Derivatives storage and directory structure

Expand Down
6 changes: 6 additions & 0 deletions src/schema/objects/common_principles.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,9 @@ dataset:
description: |
A set of neuroimaging and behavioral data acquired for a purpose of a particular study.
A dataset consists of data acquired from one or more subjects, possibly from multiple sessions.
derivative:
display_name: derivative dataset
Comment on lines +49 to +50
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, derivative and derivative dataset are not identical. If we want to define the generic term "derivative", then it should apply to files and datasets. Similarly with "raw".

description: If a dataset is derived from at least one other valid BIDS dataset, then it is a derivative dataset.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why to accent on "valid"? isn't it somewhat assumed since otherwise it is not a BIDS dataset per se?
Also we are talking about BIDS derivative dataset, right?
Also I do not see need for a conditional here.

hence I would suggest

Suggested change
description: If a dataset is derived from at least one other valid BIDS dataset, then it is a derivative dataset.
description: A BIDS dataset derived from at least one other BIDS dataset.

deprecated:
display_name: DEPRECATED
description: |
Expand Down Expand Up @@ -97,6 +100,9 @@ modality:
the technique is sufficiently uniform to define the modalities `eeg`, `meg` and `ieeg`.
When applicable, the modality is indicated in the **suffix**.
The modality may overlap with, but should not be confused with the **data type**.
raw:
display_name: raw dataset
description: A raw BIDS dataset is data that have been curated into BIDS from a non-BIDS source.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at other definitions, they are continuation of the phrase "{display_name} is ..." and hence I would adjust here too to

Suggested change
description: A raw BIDS dataset is data that have been curated into BIDS from a non-BIDS source.
description: A BIDS dataset that have been curated into BIDS from a non-BIDS source(s), for example data from acquisition hardware.

run:
display_name: Run
description: |
Expand Down
2 changes: 2 additions & 0 deletions src/schema/rules/common_principles.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,5 @@
- suffix
- extension
- deprecated
- raw
- derivative