Allow for .tsv (in addition to .tsv.gz) for _physio files #472

yarikoptic · 2020-05-15T11:57:56Z

Have you checked our contributing guide? yes (awhile back ;-))
Is your idea backwards compatible? yes
Is there already a group working on your idea? didn't find any issue*
Will your idea potentially require a large effort? not at all, I will submit a PR if there is some approval

ATM we do allow for both .nii and .nii.gz; typically we demand .tsv (without .gz) for tabular data and then for _physio we demand it to be .gz-ed. No option is given to have them not .gz'ed.

Use case: HCP dataset provides physio recordings a tab separated file (with an extension .txt), e.g. as in MNINonLinear/Results/rfMRI_REST1_LR/rfMRI_REST1_LR_Physio_log.txt. There is an ongoing effort to provide ready to use datalad datasets with HCP data layed out to BIDS. As files are not publicly available, we cannot just gzip them and place under datalad control [*] - they must come "raw" ;)

So what would be your opinion on allowing .tsv in addition to .tsv.gz for _physio?

[*] theoretically speaking someone could provide a git-annex special remote opposite of datalad-archive which would compress them from uncompressed version "on the fly". Any takers?

The text was updated successfully, but these errors were encountered:

effigies · 2020-05-15T12:34:08Z

Would this be more like our usual tabular data, with column names as headers in the TSV file, or like physio.tsv.gz, with column names as in JSON sidecars?

yarikoptic · 2020-05-15T13:07:45Z

there is no header in those files, so I guess ideally we should provide those in sidecar .json indeed (which we could just commit to git, no secrets there I hope ;))

sappelhoff · 2020-05-15T16:24:41Z

I am not aware why physio TSVs are required to be gzipped.

Rather from the opposite angle, I remember a discussion where we stopped supporting fif.gz files (now we only allow .fif files) because:

the amount of saved space is negligible (so was the argument)
the increased burden on the user outweighs the benefits

so from a naive viewpoint I don't see what's wrong with allowing TSV for physio files. But please enlighten me about the drawbacks.

effigies · 2020-05-15T16:39:22Z

This is one of the weirder corners of BIDS, and I think it was related to the conventions of some specific (set of) tool(s) that made conforming to the usual TSV conventions problematic, but I wasn't there for these conversations. I'm not sure whether the .gz was because that's what the tool expected or because ASCII-formatted time series can get large and are amenable to compression.

Anyway, as long as there isn't a change in expectation of what's in the file whether it's gzipped or not, I don't see any reason not to make the gzipping optional.

chrisgorgo · 2020-05-15T16:52:58Z

tsv.gz was an attempt to provide compressed format for dense recordings without adding heavy dependency like HDF5. @mih was involved in some of these discussions and can provide another angle.

One thing y'all should consider is that if there are apps assuming that tsv.gz is the only format this data works with those apps will break if you add another format.

mih · 2020-05-15T19:35:17Z

IIRC the desire for compression came from eyetracking (think: 3 or 4 timeseries, with dense sampling with 1k or 2k Hz covering more than hour per subject). The difference can be substantial. Overhead for decompression is minimal. Tools like zrun (from the moreutils package) enable interoperability even with software that cannot handle compressed files.

sappelhoff · 2020-07-01T19:41:08Z

@robertoostenveld has the following opinion about this:

I think it makes sense to make that consistent, i.e. make it optional for all tsv files to be gzipped (or not).

see https://github.com/bids-standard/bids-validator/issues/990#issuecomment-649619436

oesteban · 2024-04-02T08:47:59Z

I was unaware of this. Under BEP020 (#1128) we are proposing a new suffix physioevents.tsv.gz to encode the logs mentioned on top and other types of information (e.g., blinks and saccades models, calibration, and other derived parameters provided by the eyetracker).

I'm not strong on it being a .tsv.gz file always, but for now we are writing it that way.

@yarikoptic please have a look if you can.

arnodelorme · 2024-04-16T19:19:34Z

Agree with @robertoostenveld

Forcing ".gz" compression of the .tsv file makes little sense to me. Some physio files (eye-tracking files, for example) have low channels and low sampling frequency, so raw .tsv files should be allowed because they are usually not that big anyway. Also, ".gz" compression is problematic. This is not a standard compression scheme on Windows. BIDS is supposed to be easy to use. The ".gz" compression may force some users (including naive users who only want to open the file in Excel) to install specific software. Preparing a physio file on Windows will be nearly impossible for naive users who do not master the DOS command line or install Cygwin or another similar tool. Forcing people to use a Unix compression scheme on Windows decreases adoption. Let's allow standard .tsv in addition to compressed .tsv files for increased ease of use. @CPernet

yarikoptic · 2024-04-16T20:59:56Z

I also agree with @robertoostenveld as it should be optional and allow for either with or without .gz.

I do appreciate some added complexity to read data from .gz files, but IMHO it should just be a stimulant for tools to "manage to do that" finally. Easy to do in linux shell, in Python, visidata opens them just fine, ... There should be conveniences in other languages/environments for such trivial operation IMHO.

One other "issue" here though that at least _physio.tsv.gz aren't just .tsv which are gzipped -- they lack header entirely, which I think is in general assumed to be present in .tsv files, e.g.

$> zcat ds002737/sub-01/ses-01/func/sub-01_ses-01_task-Stroop_acq-cf0AP_run-03_physio.tsv.gz | head -n 2
1838	1279	0
1838	1279	0

that is what @effigies made me aware about but I so far (on a quick look) failed to find description of such peculiarity in the bids-specification... yet to see more on that and worth fixing for bids 2.0

oesteban · 2024-04-16T21:01:21Z

Some physio files (eye-tracking files, for example) have low channels and low sampling frequency, so raw .tsv files should be allowed

BIDS is an exchange format, so I would lean towards compression (and gz doesn't seem such a bad choice in that regard) over plain text as soon as the number of rows (and potentially columns or channels, to keep the nomenclature) exceeds what a human being can comfortably read on a text terminal. Even at 500 Hz you get 60k rows in two minutes of experiment. In that scenario, I prefer that (naive) user to find it hard to open the file over knocking their computer out by trying to open a large plain-text tsv file with Excel.

I'd be fine if BIDS said something like: use tsv.gz for any tabular data surpassing, say, 1,000 lines. However, unless the threshold is chosen differently and much higher, we can find ourselves compressing events files that are thought to be somewhat readable as plain text.

Preparing a physio file on Windows will be nearly impossible for naive users who do not master the DOS command line or install Cygwin or another similar tool.

What do you mean by preparing? Are you referring to encoding a BIDS dataset or processing it? Either way, the kind of user you are referring to will probably run some existing tool available to them for the task, and compression will be the least of their concerns. Also, most of the BIDS datasets will have NIfTIs compressed for the reasons above—the idea is to share the data, and it takes less time to transfer a file when it is compressed.

I am skeptical that the naive user confused by a tsv.gz file will be able to do anything with it, even if a friend decompresses it for them. Therefore, compression is a concern just for the software application that digests or produces BIDS. Implementing (de)compression does not seem that hard for developers. For BIDS-consuming applications, the developer will likely open these files and change them to a more suitable format anyway.

Forcing people to use a Unix compression scheme on Windows decreases adoption.

As far as I understand, GZip is just an open file format widely used in Internet applications that Windows users also operate. If you mean decreasing the adoption of Windows, I'm not very concerned ;)

Finally, I believe @yarikoptic's angle in his initial proposal is different. One 'weirdness' of BIDS is that, at the moment, the recordings in a tsv.gz file MUST be continuous (i.e., equispaced and univocal samples), which is incompatible with data types of different nature (what in BEP020 we are calling "physioevents"). If I interpret the goal of his proposal, it tries to cover these "events" that are not task events.

The header issue does not seem that bad to me - we can keep the same convention (tsv have header, tsv.gz do not) and everything keeps working (it's quirky but okay).

oesteban · 2024-04-16T21:05:23Z

BTW - the upcoming release of BIDS is more clear regarding tsv.gz (https://bids-specification.readthedocs.io/en/latest/common-principles.html#compressed-tabular-files)

CPernet · 2024-04-17T06:20:18Z

my two cents are (1) I do not have an opinion on if it's good or not to use .gz (2) since we do not enforce .nii.gz we should not enforce .tsv.gz to be consistent across 'modalities'

CPernet · 2024-04-17T06:22:38Z

also +1 @oesteban clearly defining compressed vs non-compressed definitions solves use cases

oesteban · 2024-04-17T10:54:58Z

(2) since we do not enforce .nii.gz we should not enforce .tsv.gz to be consistent across 'modalities'

Just a couple of notes:

a NIfTI file is never human readable, so for the user there's little difference between nii and nii.gz.

However, I do appreciate the function of tsv.gz to actually signal this point to the user. It's also very convenient with managing the data with DataLad, I typically push tsv.gz to the annex (they are big files, nonreadable by eye) and keep tsv in git.
In addition to that, while nii and nii.gz are completely consistent (nii.gz is just a compressed version of the nii file) this is not true for tsv (and that can also leak into the json sidecar) because of the header-less rule on tsv.gz.

arnodelorme · 2024-04-17T15:55:35Z

@oesteban what is the rational for the header rule?

oesteban · 2024-04-17T16:00:02Z

@oesteban what is the rational for the header rule?

I have no idea. The problem, to me, is more about having two encodings rather than what the encoding specifically requires (header/headerless)

arnodelorme · 2024-04-17T16:07:45Z

@robertoostenveld @CPernet @effigies @yarikoptic @Remi-Gau @sappelhoff do you know the rationale for the .tsv headerless rule (when .tsv are compressed as .tsv.gz, the header is removed and stored in another file).

Remi-Gau · 2024-04-17T16:35:55Z

That is the only rationale I know about:

https://bids-specification.readthedocs.io/en/latest/common-principles.html#compressed-tabular-files

I was not around when this decision was made.

arnodelorme · 2024-04-17T16:50:22Z

Thanks, @Remi-Gau. Seems to me like the decision was based on software formats more than user-friendliness. Also, why impose it across all data files in BIDS if it is just two software?

Remi-Gau · 2024-04-17T17:10:00Z

Have been wondering about this for a while too.

yarikoptic · 2024-04-17T19:11:57Z

I think overall verdict would be that this issue is a "no go" for BIDS 1.0 primarily due to internal difference (header/no-header) within .tsv/.tsv.gz and original focus on having support by the tools. I would keep discussion for BIDS 2.0 in

Uniformize .tsv and .tsv.gz: both to have header + both have "Columns" dictionary (!not list) in sidecar .json bids-2-devel#71

please 👍 or 👎 at least to establish the direction ;-)

PS feel welcome to reopen if you think there is more to do here

oesteban · 2024-04-18T11:39:30Z

@arnodelorme

Seems to me like the decision was based on software formats more than user-friendliness.

Since TSV files have header (except for motion), the JSON object does not define a "Columns" entry because the user can peek at the TSV header.

However, because the user cannot peek at the TSV.GZ header without decompressing, BIDS moves the definition to the sidecar JSON that is human readable. Forbidding the header in TSV.GZ is logical to preempt inconsistency (having some headers in the TSV.GZ file and have inconsistent "Columns" metadata in the JSON, the same way the "Columns" field should be IMHO forbidden for plain-text TSV).

Therefore, this actually looks like a user-centered decision as software tools can easily drop the header if it is there.

yarikoptic mentioned this issue May 15, 2020

add BIDS conversion script datalad-datasets/hcp-functional-connectivity#1

Open

sappelhoff mentioned this issue Jun 25, 2020

Why are "physio.tsv.gz" and its JSON currently not valid for eeg/meg/ieeg ? bids-standard/legacy-validator#990

Closed

sappelhoff added discussion ongoing discussion opinions wanted Please read and offer your opinion on this matter labels Jul 22, 2020

sappelhoff mentioned this issue Mar 26, 2024

[ENH] Add compressed TSV files to the common principles #1749

Merged

Remi-Gau mentioned this issue Apr 16, 2024

Forcing .gz compression on physio files not optimal #1788

Closed

Remi-Gau added the physio label Apr 16, 2024

yarikoptic closed this as completed Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow for .tsv (in addition to .tsv.gz) for _physio files #472

Allow for .tsv (in addition to .tsv.gz) for _physio files #472

yarikoptic commented May 15, 2020

effigies commented May 15, 2020

yarikoptic commented May 15, 2020

sappelhoff commented May 15, 2020

effigies commented May 15, 2020

chrisgorgo commented May 15, 2020

mih commented May 15, 2020

sappelhoff commented Jul 1, 2020

oesteban commented Apr 2, 2024

arnodelorme commented Apr 16, 2024

yarikoptic commented Apr 16, 2024

oesteban commented Apr 16, 2024

oesteban commented Apr 16, 2024

CPernet commented Apr 17, 2024

CPernet commented Apr 17, 2024 •

edited

Loading

oesteban commented Apr 17, 2024

arnodelorme commented Apr 17, 2024

oesteban commented Apr 17, 2024

arnodelorme commented Apr 17, 2024 •

edited

Loading

Remi-Gau commented Apr 17, 2024

arnodelorme commented Apr 17, 2024

Remi-Gau commented Apr 17, 2024

yarikoptic commented Apr 17, 2024

oesteban commented Apr 18, 2024 •

edited

Loading

Allow for .tsv (in addition to .tsv.gz) for _physio files #472

Allow for .tsv (in addition to .tsv.gz) for _physio files #472

Comments

yarikoptic commented May 15, 2020

effigies commented May 15, 2020

yarikoptic commented May 15, 2020

sappelhoff commented May 15, 2020

effigies commented May 15, 2020

chrisgorgo commented May 15, 2020

mih commented May 15, 2020

sappelhoff commented Jul 1, 2020

oesteban commented Apr 2, 2024

arnodelorme commented Apr 16, 2024

yarikoptic commented Apr 16, 2024

oesteban commented Apr 16, 2024

oesteban commented Apr 16, 2024

CPernet commented Apr 17, 2024

CPernet commented Apr 17, 2024 • edited Loading

oesteban commented Apr 17, 2024

arnodelorme commented Apr 17, 2024

oesteban commented Apr 17, 2024

arnodelorme commented Apr 17, 2024 • edited Loading

Remi-Gau commented Apr 17, 2024

arnodelorme commented Apr 17, 2024

Remi-Gau commented Apr 17, 2024

yarikoptic commented Apr 17, 2024

oesteban commented Apr 18, 2024 • edited Loading

CPernet commented Apr 17, 2024 •

edited

Loading

arnodelorme commented Apr 17, 2024 •

edited

Loading

oesteban commented Apr 18, 2024 •

edited

Loading