-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add sample entity and samples.tsv file #779
Comments
thanks @mariehbourget for this ! we clearly need something of this type for BEP032 but havn't had a look at it precisely yet... we will take care of this in the next few weeks with @JuliaSprenger and others... |
|
How does pathology/diagnosis overlap with phenotype? |
regarding pathology, this should be annotated at the sample level in that one may have a tumor sample in one case versus on non-tumor location in the same participant. thus pathology goes with sample rather than participant. indeed for human and potentially other species there should be a Dx column for diagnosis. note that this could also vary by date and could therefore be in the sessions.tsv rather than participants.tsv. more generally there should be a conversation about inheritance of properties from participants to sample, when all samples share those properties. also the same samples could be used in multiple sessions. hence having a mechanism to consolidate that would be necessary, and hence samples are similar to participants in that sense. in many cases, samples, rather than participants are often the primary entity in studies. and in keeping with bids it was discussed in the subgroups that a dataset should connect a sample to a participant even if the participant details are unknown. btw, samples could also be used in human MR scans (e.g., left/right hemi ex vivo, brainstem, etc.,.), and hence samples should be considered at a generic concept in bids, rather than specialized just for microscopy/ephys. |
@dorahermes - one example use case could be something like a diagnosis column that says Major Depressive Disorder (or ICD10 code), but the diagnosis itself could have been attached to a phenotype file(s) (e..g, KSADS, HAMD, etc.,.) or simply a clinical evaluation which may not have a phenotypic assessment in many cases. |
The intention is not to require "unique"
So the "unique" identifier is the combination of |
this sounds ok to me! we should just give an example that is a bit more telling that just "sample-1", "sample-2" to be immediately understandable just by looking at it... question: what do experimenters use as user-friendly ids for their samples? |
we had discussion in our last BEP32 meeting about the possibility of adding several entities ('sample', but also 'slice' and 'tissue')... I don't want to deviate the goal of this thread, but maybe we should have this discussion globally here? I mean, asking ourselves how many entities should be added and which ones? or whether adding just the 'sample' entity and dealing with everything else through the 'sample_type' can cover all the targeted usecases? with this latter solution, indeed, the quoted question (i.e "how do we encode the fact that a slice is derived from a block of tissue") should be addressed! |
small detail: although it is just an example / suggestion, the current specification mentions "group" as one of the column in |
If sample labels can be reused across subjects, I think we can do the following:
If the sample labels are the same across subjects, a global
I think @satra's suggestion here is good, and that making
Yes, I think diagnosis as a session-level variable makes sense. As an aside, I don't think we have a principle that says how to do session-level variables for single-session studies that omit the |
We had similar discussions in BEP031 for other additional entities. The way we handled this so far is based on what entities are needed to distinguish between 2 different files of a same subject. For example, metadata like “sample_type” (primary cell, tissue, etc) is a unique attribute of the sample itself and would not change for a same subject_sample. In those cases, we think the information would be best encode in metadata and not in the filename.
I would suggest adding a
I’m not sure to understand you on this. |
@mariehbourget in the SPARC Dataset Structure we also include a "derived_from" (i.e. wasDerivedFromSample) in the samples metadata file: https://docs.google.com/presentation/d/1EQPn1FmANpPsFt3CguU-JOQVMMlJsNXluQAK_gb2qVg/edit#slide=id.p9 |
If the metadata for |
I think it'd be great to hear from @tgbugs here... if we manage to handle all this consistently across BEP31, BEP32 and SPARC, that'd be fantastic to facilitate future inter-operability... (as was just said in the BEP31 meeting ;) ) |
From discussions at the meeting, I think the global |
Here is my write-up with an overview of the problem space, a potential model, and a review of the trade-offs that I see for BIDS based on my experience implementing and maintaining the SDS and its validation pipelines. I'm also dropping this in INCF/neuroscience-data-structure#9. https://github.com/SciCrunch/sparc-curation/blob/master/docs/participants.org If you have targeted questions or comments you can leave them on this commit. |
@effigies your concerns about forcing the reconstruction of the global table are well founded and I discuss the trade-offs in detail. |
Thank you everyone for your comments, suggestions and feedback! @tgbugs thank you very much for your insightful comment in https://github.com/SciCrunch/sparc-curation/blob/master/docs/participants.org. I am responding here so that the discussion stays centralized within a single issue thread (otherwise it is difficult to keep a clear track history of the discussion). A few considerations following previous discussions and comments:
In short, we suggest that:
|
Hi @jcohenadad, thanks for taking a look. Here are my thoughts.
With regard to the suggestions.
Also re: INCF/neuroscience-data-structure#9 |
Hi @tgbugs, thank you for the clarifications. This discussion is touching on some of the core decisions made by the BIDS community. It would be great if some of the BIDS maintainers/steering could chip in as well @effigies @robertoostenveld.
We were advised by the BIDS steering group (@robertoostenveld) to not extend the definition of the We’ve tried different configurations in the early development of the microscopy BEP and we agree that adding many different entities to describe different use cases adds undue complexity to the model. Therefore, we proposed the The advantages are that it retains the definition of
As far as we know, the current BIDS specification does not cover explicitly “collective” participants, hence the suggestion to name the
We understand your concerns in cases where the From an experimental point of view, it also makes sense for people to name their samples the way they want for the same subject without having to take into account
This was addressed earlier in the thread where we suggested to add a
As mentioned earlier, the addition of the |
Hi @jcohenadad, I'll leave some thoughts while awaiting for responses from others. Since the discussion has strayed into cross BEP and core BIDS territory, this is understandable.
Absolutely. However, I wonder if that suggestion was made in a context where identifier type and conceptual type were conflated. Retaining the definition of subject while extending the scope of
However, from a data sharing point of view, they probably should be taken that into account. There are countless Relevant to a later point, the generalization of this reasoning is that
But according to the proposal this in fact already required for samples derived from other samples.
There are many cases where samples and not subjects are shipped from one lab to another and then from the shipped samples further samples are derived. That is to say, there are labs for which someone else's sample is their subject. If we were to apply the logic articulated above for subjects, the experimentalists should likewise not have to care about the fact they derived one sample from another, so long as they keep track of which sample they derived it from and thus that Requiring different practices for identifier generation due to an arbitrary distinction between subject and sample (is a cadaver a sample?) seems like a design flaw. The restriction that only sample ids must be unique and enforcing that only on derived samples but not on samples derived directly from subjects
|
there are presently 4 explicit generic levels over which the acquisition of "data" can be iterated. I won't summarize the definitions here, but they can be found on https://bids-specification.readthedocs.io/en/stable/02-common-principles.html
There are also multiple domain specific levels over which the acquisition of "data" can be iterated. For example over multiple voxels in fMRI, or multiple channels in EEG, or multiple timepoints (in either type of data). For MRI there can also be multiple echo's, or multiple contrast enhancing agents, or tracers. The idea from @tgbugs to "extend the sub- identifier type to be used to name anything in a BIDS dataset that has data about it" leads to the question: why would you not extend the meaning of session, or scan, or run instead? Or should one be allowed to do |
it might be worth splitting this issue into two (or three)
The last two relate also to "stimuli BEP" wannabe issue (see e.g. #751 (comment) I also generalize "similarly") and IMHO orthogonal issues to the first one ("samples" entity) and interrelated within since with reordering you would get top level ".tsv/.json". As for the last one -- we could gain ".tsv/.json" even without any reordering: at large we already have it someone implied by inheritance principle and hence could have |
@robertoostenveld thank you very much. I think that BIDS 2.X is probably the right venue for my suggestions. Given the constraints on 1.X. In that context I only have one suggestion for this thread, which is to require that sample identifiers be unique per dataset not per subject.
The only reason would be if there was a required metadata structure that was associated with some experimental process that could not be capture at one of those levels, or if there were more levels that were required. Otherwise the only reason would be because someone doesn't like the naming of the three levels. In SPARC we have called the abstraction of those three into a single term For the most part these don't need to be extended because they are distinct only in how they are named and in that they support 3 levels of repeated structure. There might be some experimental designs that need slightly more expressivity, or that might need/want to associate slightly different metadata with a particular repeated process, in which case the abstracted solution might help. @yarikoptic I think the 3 can be broken up as you suggest, with a note that there is an interaction between |
I'm also in favor of addressing these issues step by step, this would fit the needs of development of BEP32 (which are strongly overlapping with the ones of BEP31, if not strictly identical)! and the first step (addition of the sample entity) will already allow us to move forward! what's the next step? |
I would think a PR for "addition of _sample- entity" with the opening of this PR. Point to this issue for further info on discussion etc. I would also file a separate issue (or better even a PR) suggesting additional (RECOMMEND) columns to participants.tsv/.json . |
Thank you everyone for your feedback! |
Hi everyone! The first PR (#812) for the addition of the sample entity is now open. |
Closing this since #816 is now merged |
Context and motivation
Hi BIDS community!
As part of the development of the Microscopy BEP (BEP031), we want to add a new
sample
entity to BIDS. Thissample
entity was introduced in order to distinguish different tissue samples from the same subject.The
sample
entity may also be used by the Animal Ephys BEP (BEP032 @SylvainTakerkart) and could benefit other modalities as well.This issue aims to start a discussion about the details of the sample entity between the 2 BEP groups and with the BIDS community. It will also facilitate the breaking down of BEPs in smaller modules by adding the sample entity as a separate PR.
Definition of the sample entity
To ensure compatibility with BIDS other modalities, the
subject
entity should correspond to the participant (e.g. a human, a mouse, etc). To identify multiple tissue samples from the same subject, we define the sample entity in BEP031 as:It is positioned after the optional
session
entity in the filename:samples.tsv file
In BEP031, a samples.tsv file was added at the root of the dataset along with participants.tsv.
The samples.tsv file would have 2 required columns:
sample_id
: corresponding tosample-<label>
of the filenameparticipant_id
: corresponding tosub-<label>
of the filenameAnother column
sample_type
was also suggested as required:sample_type
: kind of sample from ENCODE BiosampleTypeWe should also discuss if (and how) we want to encode an additional identifier when a sample is derived from another sample (e.g., a slice is derived from a block of tissue).
participants.tsv file
As part of the subject vs. sample definitions, we would also like to add 2 columns to the participants.tsv file:
species
: string corresponding to the Binomial species name from NCBI Taxonomy, required when different from “Homo sapiens”We think species should be in participants.tsv and not samples.tsv as it is an attribute of the subject and not the sample.
pathology
: required when different from “Healthy”In that case, pathology could be in either participants.tsv or samples.tsv as appropriate (e.g. healthy and non-healthy biopsy samples from the same subject).
Examples
File hierarchy and naming:
participants.tsv:
samples.tsv:
The text was updated successfully, but these errors were encountered: