Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the suffix and exclusively use entities in the filename. #58

Open
robertoostenveld opened this issue Sep 8, 2023 · 10 comments
Open
Labels
common-principles Proposals to change common principles. entities Changes to entities. impact: medium Estimated medium impact change suffixes Changes to suffixes.

Comments

@robertoostenveld
Copy link

The discussion bids-standard/bids-specification#1602 shows that there is no universal agreement on when some information is to be coded as a suffix (at the end of the filename just prior to the extension, e.g., _bold) or as an entity (like <key>-<value>).

I propose to remove that source of conflict in BIDS 2 by removing the suffix altogether. To me the suffix serves the same purpose as the value in an entity, except that the name has been left out. I.e., I propose that _bold.nii.gz were to become _suffix-bold.nii.gz. Instead of suffix, another name (or names) could be given to these entities.

The consequence would be that the whole filename up to the first period . (which indicates the start of the file extension, see *) can be parsed on the underscore to separate entities, and each entity can be parsed on the dash to split its name and value.

*) The file extension (e.g., .tsv, .h5, .nii.gz) would remain as it is and provide information about how the file is technically to be parsed as an ascii and/or binary stream.

@robertoostenveld robertoostenveld added impact: medium Estimated medium impact change common-principles Proposals to change common principles. entities Changes to entities. suffixes Changes to suffixes. labels Sep 8, 2023
@oesteban
Copy link

oesteban commented Sep 8, 2023

As I mentioned in the other thread, this inevitably leads to discussing that the new entity suffix- if you will, is given priority to appear the last. And with that, a strict definition of the ordering of entities.

In my opinion, removing the suffix damages human readability with a meager return.

@robertoostenveld
Copy link
Author

The BIDS standard already specifies a strict ordering of the entities, and I am not proposing to change that.

The ordering of entities, and whether each is OPTIONAL, REQUIRED, or MUST NOT be specified for a given file type, is specified in the Entity Table.

@oesteban
Copy link

oesteban commented Sep 8, 2023

The BIDS standard already specifies a strict ordering of the entities, and I am not proposing to change that.

The ordering of entities, and whether each is OPTIONAL, REQUIRED, or MUST NOT be specified for a given file type, is specified in the Entity Table.

Sure, that's manageable for BIDS "raw". But the problem scales with the number of entities, and BIDS Derivatives is set out to define a fair bunch.

The discussion bids-standard/bids-specification#1602 shows that there is no universal agreement on when some information is to be coded as a suffix (at the end of the filename just prior to the extension, e.g., _bold) or as an entity (like <key>-<value>).

That proposal does not point at such a problem, only the discussion after it could be interpreted in that way. This proposal (i.e., removing the suffix) does not describe what it is solving. It just opens some flexibility with two goals:

  1. to add a name to the suffix entity so that strong opinions can be tempered and say, "if you don't like modality-, we create one more suffix-like entity (i.e., last required) that is of your liking, what about sampling-"; and
  2. the suffix is not under a controlled vocabulary anymore, so the user has total flexibility over what goes last.

I think (1) is just a countermeasure to open space of agreement on a problem we currently don't have, and (2) leads to total flexibility that will require additional metadata to describe the dataset. (2) is not theoretically a bad idea, but I would honestly move into other alternatives (I said NIDM a bunch of times) with more programmatic and reliable foundations to describe the data. BIDS should offer something easy-to-use and highly readable for humans.

@yarikoptic
Copy link
Contributor

In general I agree with the motivation for the change. I would only vote to not add again semantically meaningless _suffix- but see to which entities current values would need to be mapped, and start from looking at current ones and provide such a mapping at least for a good portion of them. But it would require some thought about semantic meaningful entities. FTR -- ATM we seems to have 103 suffixes within suffixes.yaml. _mod- could have absorbed T1w, inplaneT1 since that is where currently we specify for those suffixes to be placed when creating a derived (e.g. _defacemask) image. But something like _defacemask and _mask would then not be fitting _mod. What would that be?

@oesteban
Copy link

oesteban commented Sep 8, 2023

In general I agree with the motivation for the change.

And what is that motivation? I truly don't know what it is.

@yarikoptic
Copy link
Contributor

yarikoptic commented Sep 8, 2023

ATM suffix has no clear semantic meaning. ATM it aims to be a "human accessible term best describing what is in the file", values for which is a mix of

  • contrast within neuroimaging modalities (all the _T1w, ...)
  • actual modality (e.g., _eeg but not some "bands" within eeg modality),
  • frequently used/needed derived measure (e.g, _defacemask)
  • generic type of some spacial specific derived measures such as quantization (e.g., _dseg),
  • (soon, variety of BEPS) various mixing _???maps per now formalized guidelines
  • etc.

I think it is as a result of this absent semantical clarity, while contemplating new "suffixes" it becomes unclear what should go into the suffix vs some other entity - should a new suffix be created or an entity be created, or a mix of the two, etc. And that is what I think prompted @robertoostenveld to file this issue.
In my memory I remember us stumbling on how to formalize naming of derived files, and that is how IIRC _mod for _mod-T1w_defacemask was born since we had to place existing suffix somewhere.

@TheChymera
Copy link

@oesteban

And with that, a strict definition of the ordering of entities.

Don't we already have this? I've always seen subject and session first, or is this slated to be removed in BIDS2?

@yarikoptic
Copy link
Contributor

yarikoptic commented Feb 20, 2024

yes, we have clear ordering and AFAIK always had so far. What to be done for BIDS 2.0 or either there would be effect from

is yet to be decided about. Not sure what @oesteban had in mind while talking about derivatives since, as @robertoostenveld pointed out the order is universal across modalities and specified in https://github.com/bids-standard/bids-specification/blob/master/src/schema/rules/entities.yaml . Note that _mod which is the closest somewhat in possibly absorbing the suffix, is in the middle of the ordering.

@yarikoptic
Copy link
Contributor

Above I think I have forgotten to mention another potential suffix purpose (or may be the only one to leave in the scope of #54 ) - {entity_plural}.{tsv,json} as to generalize already present

  • participants.{tsv,json} -- we just do not have (yet) subjects under another level of entity, but in the scope of Multi-site/center studies #11 , it could easily be center-DBIC/center-DBIC_participants.tsv
  • *_electrodes.{tsv,json}
  • ... any other similar kind happen we define an entity for it (we do already have some -- e.g. for sessions and descriptions!)
❯ grep '^[^ ]*s:' ./src/schema/objects/suffixes.yaml | grep -v -E 'nirs'
channels:
descriptions:
electrodes:
events:
markers:
optodes:
scans:
sessions:
svs:

@robertoostenveld
Copy link
Author

robertoostenveld commented Dec 19, 2024

Let me revisit this. The reason to raise this issue in the first place was the discussions that pursued in the drafting stage of multiple BEPs that I followed closely, whether something new should become a new suffix, or whether it should be coded as a (new or existing) entity. This unclarity in the different and quite diverse BEP teams shows that people in general don't understand or agree on the difference between a suffix and an entity. Removing the suffix - as I proposed - would be a way to avoid this, albeit a crude way that certain people appear to dislike.

Yarek raises the plurality in {entity_plural}.{tsv,json} as a rationale that is already used in different tabular files in different modalities that contain lists of multiple things (hence plural). Him pointing to participants.tsv triggered another thought with me, namely that considering consistent file naming rules for all files is interesting. Whereas most files use <entities>_<suffix>.<extension>, the participants.tsv stands out, as well as for microscopy the samples.tsv, and possibly some others. Although I never thought about these files in this way, you could say that in that case participants or samples is the suffix, as we would also say that channels and electrodes are the suffixes for the respective files in EEG.

Imagine (note this is a thought experiment, not a real proposal) that we were to drop all entities from any file name in a simple BIDS dataset, as already the case for participants.tsv and samples.tsv. If the files were not to have a suffix to start with, there would not be a file name left, only the file extension. Not having a file name is undesired if you ask me and has unintended consequences on unix file systems where the files would become hidden, so in that case I would say that mri.nii or pet.nii or eeg.edf or meg.fif would be the minimal full filenames that would convey meaning about the file content (the name) and the file format (the extension). This minimal "content+format" rule might help to clarify what goes in a suffix, whereas additional information/metadata to document and distinguish files goes in the key-val entities.

Note that dataset_description.json is an odd one out where my rule does not apply, since if description were the suffix, the preceding dataset is not a list of key-val entities. Note also that the agnostic README, CHANGES and LICENSE file are also inconsistent under the "content+format" rule, but CITATION.cff is consistent again, and for README it is allowed (but not required) to have one of three extensions (txt, rst, md).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common-principles Proposals to change common principles. entities Changes to entities. impact: medium Estimated medium impact change suffixes Changes to suffixes.
Projects
None yet
Development

No branches or pull requests

4 participants