-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow for .tsv (in addition to .tsv.gz) for _physio files #472
Comments
Would this be more like our usual tabular data, with column names as headers in the TSV file, or like |
there is no header in those files, so I guess ideally we should provide those in sidecar .json indeed (which we could just commit to git, no secrets there I hope ;)) |
I am not aware why physio TSVs are required to be gzipped. Rather from the opposite angle, I remember a discussion where we stopped supporting
so from a naive viewpoint I don't see what's wrong with allowing TSV for physio files. But please enlighten me about the drawbacks. |
This is one of the weirder corners of BIDS, and I think it was related to the conventions of some specific (set of) tool(s) that made conforming to the usual TSV conventions problematic, but I wasn't there for these conversations. I'm not sure whether the Anyway, as long as there isn't a change in expectation of what's in the file whether it's gzipped or not, I don't see any reason not to make the gzipping optional. |
One thing y'all should consider is that if there are apps assuming that tsv.gz is the only format this data works with those apps will break if you add another format. |
IIRC the desire for compression came from eyetracking (think: 3 or 4 timeseries, with dense sampling with 1k or 2k Hz covering more than hour per subject). The difference can be substantial. Overhead for decompression is minimal. Tools like |
@robertoostenveld has the following opinion about this:
see https://github.com/bids-standard/bids-validator/issues/990#issuecomment-649619436 |
I was unaware of this. Under BEP020 (#1128) we are proposing a new suffix I'm not strong on it being a .tsv.gz file always, but for now we are writing it that way. @yarikoptic please have a look if you can. |
Agree with @robertoostenveld Forcing ".gz" compression of the .tsv file makes little sense to me. Some physio files (eye-tracking files, for example) have low channels and low sampling frequency, so raw .tsv files should be allowed because they are usually not that big anyway. Also, ".gz" compression is problematic. This is not a standard compression scheme on Windows. BIDS is supposed to be easy to use. The ".gz" compression may force some users (including naive users who only want to open the file in Excel) to install specific software. Preparing a physio file on Windows will be nearly impossible for naive users who do not master the DOS command line or install Cygwin or another similar tool. Forcing people to use a Unix compression scheme on Windows decreases adoption. Let's allow standard .tsv in addition to compressed .tsv files for increased ease of use. @CPernet |
I also agree with @robertoostenveld as it should be optional and allow for either with or without I do appreciate some added complexity to read data from One other "issue" here though that at least
that is what @effigies made me aware about but I so far (on a quick look) failed to find description of such peculiarity in the bids-specification... yet to see more on that and worth fixing for bids 2.0 |
BIDS is an exchange format, so I would lean towards compression (and gz doesn't seem such a bad choice in that regard) over plain text as soon as the number of rows (and potentially columns or channels, to keep the nomenclature) exceeds what a human being can comfortably read on a text terminal. Even at 500 Hz you get 60k rows in two minutes of experiment. In that scenario, I prefer that (naive) user to find it hard to open the file over knocking their computer out by trying to open a large plain-text tsv file with Excel. I'd be fine if BIDS said something like: use tsv.gz for any tabular data surpassing, say, 1,000 lines. However, unless the threshold is chosen differently and much higher, we can find ourselves compressing events files that are thought to be somewhat readable as plain text.
What do you mean by preparing? Are you referring to encoding a BIDS dataset or processing it? Either way, the kind of user you are referring to will probably run some existing tool available to them for the task, and compression will be the least of their concerns. Also, most of the BIDS datasets will have NIfTIs compressed for the reasons above—the idea is to share the data, and it takes less time to transfer a file when it is compressed. I am skeptical that the naive user confused by a tsv.gz file will be able to do anything with it, even if a friend decompresses it for them. Therefore, compression is a concern just for the software application that digests or produces BIDS. Implementing (de)compression does not seem that hard for developers. For BIDS-consuming applications, the developer will likely open these files and change them to a more suitable format anyway.
As far as I understand, GZip is just an open file format widely used in Internet applications that Windows users also operate. If you mean decreasing the adoption of Windows, I'm not very concerned ;) Finally, I believe @yarikoptic's angle in his initial proposal is different. One 'weirdness' of BIDS is that, at the moment, the recordings in a tsv.gz file MUST be continuous (i.e., equispaced and univocal samples), which is incompatible with data types of different nature (what in BEP020 we are calling "physioevents"). If I interpret the goal of his proposal, it tries to cover these "events" that are not task events. The header issue does not seem that bad to me - we can keep the same convention (tsv have header, tsv.gz do not) and everything keeps working (it's quirky but okay). |
BTW - the upcoming release of BIDS is more clear regarding tsv.gz (https://bids-specification.readthedocs.io/en/latest/common-principles.html#compressed-tabular-files) |
my two cents are (1) I do not have an opinion on if it's good or not to use .gz (2) since we do not enforce .nii.gz we should not enforce .tsv.gz to be consistent across 'modalities' |
also +1 @oesteban clearly defining compressed vs non-compressed definitions solves use cases |
Just a couple of notes:
|
@oesteban what is the rational for the header rule? |
I have no idea. The problem, to me, is more about having two encodings rather than what the encoding specifically requires (header/headerless) |
@robertoostenveld @CPernet @effigies @yarikoptic @Remi-Gau @sappelhoff do you know the rationale for the .tsv headerless rule (when .tsv are compressed as .tsv.gz, the header is removed and stored in another file). |
That is the only rationale I know about: https://bids-specification.readthedocs.io/en/latest/common-principles.html#compressed-tabular-files I was not around when this decision was made. |
Thanks, @Remi-Gau. Seems to me like the decision was based on software formats more than user-friendliness. Also, why impose it across all data files in BIDS if it is just two software? |
Have been wondering about this for a while too. |
I think overall verdict would be that this issue is a "no go" for BIDS 1.0 primarily due to internal difference (header/no-header) within .tsv/.tsv.gz and original focus on having support by the tools. I would keep discussion for BIDS 2.0 in please 👍 or 👎 at least to establish the direction ;-) PS feel welcome to reopen if you think there is more to do here |
Since TSV files have header (except for motion), the JSON object does not define a "Columns" entry because the user can peek at the TSV header. However, because the user cannot peek at the TSV.GZ header without decompressing, BIDS moves the definition to the sidecar JSON that is human readable. Forbidding the header in TSV.GZ is logical to preempt inconsistency (having some headers in the TSV.GZ file and have inconsistent "Columns" metadata in the JSON, the same way the "Columns" field should be IMHO forbidden for plain-text TSV). Therefore, this actually looks like a user-centered decision as software tools can easily drop the header if it is there. |
Have you checked our contributing guide? yes (awhile back ;-))
Is your idea backwards compatible? yes
Is there already a group working on your idea? didn't find any issue*
Will your idea potentially require a large effort? not at all, I will submit a PR if there is some approval
ATM we do allow for both
.nii
and.nii.gz
; typically we demand.tsv
(without.gz
) for tabular data and then for_physio
we demand it to be.gz
-ed. No option is given to have them not.gz'ed
.Use case: HCP dataset provides physio recordings a tab separated file (with an extension
.txt
), e.g. as inMNINonLinear/Results/rfMRI_REST1_LR/rfMRI_REST1_LR_Physio_log.txt
. There is an ongoing effort to provide ready to use datalad datasets with HCP data layed out to BIDS. As files are not publicly available, we cannot just gzip them and place under datalad control [*] - they must come "raw" ;)So what would be your opinion on allowing
.tsv
in addition to.tsv.gz
for_physio
?[*] theoretically speaking someone could provide a git-annex special remote opposite of
datalad-archive
which would compress them from uncompressed version "on the fly". Any takers?The text was updated successfully, but these errors were encountered: