Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for .tsv (in addition to .tsv.gz) for _physio files #472

Closed
yarikoptic opened this issue May 15, 2020 · 23 comments
Closed

Allow for .tsv (in addition to .tsv.gz) for _physio files #472

yarikoptic opened this issue May 15, 2020 · 23 comments
Labels
discussion ongoing discussion opinions wanted Please read and offer your opinion on this matter physio

Comments

@yarikoptic
Copy link
Collaborator

  • Have you checked our contributing guide? yes (awhile back ;-))

  • Is your idea backwards compatible? yes

  • Is there already a group working on your idea? didn't find any issue*

  • Will your idea potentially require a large effort? not at all, I will submit a PR if there is some approval

ATM we do allow for both .nii and .nii.gz; typically we demand .tsv (without .gz) for tabular data and then for _physio we demand it to be .gz-ed. No option is given to have them not .gz'ed.

Use case: HCP dataset provides physio recordings a tab separated file (with an extension .txt), e.g. as in MNINonLinear/Results/rfMRI_REST1_LR/rfMRI_REST1_LR_Physio_log.txt. There is an ongoing effort to provide ready to use datalad datasets with HCP data layed out to BIDS. As files are not publicly available, we cannot just gzip them and place under datalad control [*] - they must come "raw" ;)

So what would be your opinion on allowing .tsv in addition to .tsv.gz for _physio?

[*] theoretically speaking someone could provide a git-annex special remote opposite of datalad-archive which would compress them from uncompressed version "on the fly". Any takers?

@effigies
Copy link
Collaborator

Would this be more like our usual tabular data, with column names as headers in the TSV file, or like physio.tsv.gz, with column names as in JSON sidecars?

@yarikoptic
Copy link
Collaborator Author

there is no header in those files, so I guess ideally we should provide those in sidecar .json indeed (which we could just commit to git, no secrets there I hope ;))

@sappelhoff
Copy link
Member

I am not aware why physio TSVs are required to be gzipped.

Rather from the opposite angle, I remember a discussion where we stopped supporting fif.gz files (now we only allow .fif files) because:

  • the amount of saved space is negligible (so was the argument)
  • the increased burden on the user outweighs the benefits

so from a naive viewpoint I don't see what's wrong with allowing TSV for physio files. But please enlighten me about the drawbacks.

@effigies
Copy link
Collaborator

This is one of the weirder corners of BIDS, and I think it was related to the conventions of some specific (set of) tool(s) that made conforming to the usual TSV conventions problematic, but I wasn't there for these conversations. I'm not sure whether the .gz was because that's what the tool expected or because ASCII-formatted time series can get large and are amenable to compression.

Anyway, as long as there isn't a change in expectation of what's in the file whether it's gzipped or not, I don't see any reason not to make the gzipping optional.

@chrisgorgo
Copy link
Contributor

tsv.gz was an attempt to provide compressed format for dense recordings without adding heavy dependency like HDF5. @mih was involved in some of these discussions and can provide another angle.

One thing y'all should consider is that if there are apps assuming that tsv.gz is the only format this data works with those apps will break if you add another format.

@mih
Copy link

mih commented May 15, 2020

IIRC the desire for compression came from eyetracking (think: 3 or 4 timeseries, with dense sampling with 1k or 2k Hz covering more than hour per subject). The difference can be substantial. Overhead for decompression is minimal. Tools like zrun (from the moreutils package) enable interoperability even with software that cannot handle compressed files.

@sappelhoff
Copy link
Member

@robertoostenveld has the following opinion about this:

I think it makes sense to make that consistent, i.e. make it optional for all tsv files to be gzipped (or not).

see https://github.com/bids-standard/bids-validator/issues/990#issuecomment-649619436

@sappelhoff sappelhoff added discussion ongoing discussion opinions wanted Please read and offer your opinion on this matter labels Jul 22, 2020
@oesteban
Copy link
Collaborator

oesteban commented Apr 2, 2024

I was unaware of this. Under BEP020 (#1128) we are proposing a new suffix physioevents.tsv.gz to encode the logs mentioned on top and other types of information (e.g., blinks and saccades models, calibration, and other derived parameters provided by the eyetracker).

I'm not strong on it being a .tsv.gz file always, but for now we are writing it that way.

@yarikoptic please have a look if you can.

@arnodelorme
Copy link

Agree with @robertoostenveld

Forcing ".gz" compression of the .tsv file makes little sense to me. Some physio files (eye-tracking files, for example) have low channels and low sampling frequency, so raw .tsv files should be allowed because they are usually not that big anyway. Also, ".gz" compression is problematic. This is not a standard compression scheme on Windows. BIDS is supposed to be easy to use. The ".gz" compression may force some users (including naive users who only want to open the file in Excel) to install specific software. Preparing a physio file on Windows will be nearly impossible for naive users who do not master the DOS command line or install Cygwin or another similar tool. Forcing people to use a Unix compression scheme on Windows decreases adoption. Let's allow standard .tsv in addition to compressed .tsv files for increased ease of use. @CPernet

@yarikoptic
Copy link
Collaborator Author

I also agree with @robertoostenveld as it should be optional and allow for either with or without .gz.

I do appreciate some added complexity to read data from .gz files, but IMHO it should just be a stimulant for tools to "manage to do that" finally. Easy to do in linux shell, in Python, visidata opens them just fine, ... There should be conveniences in other languages/environments for such trivial operation IMHO.

One other "issue" here though that at least _physio.tsv.gz aren't just .tsv which are gzipped -- they lack header entirely, which I think is in general assumed to be present in .tsv files, e.g.

$> zcat ds002737/sub-01/ses-01/func/sub-01_ses-01_task-Stroop_acq-cf0AP_run-03_physio.tsv.gz | head -n 2
1838	1279	0
1838	1279	0

that is what @effigies made me aware about but I so far (on a quick look) failed to find description of such peculiarity in the bids-specification... yet to see more on that and worth fixing for bids 2.0

@oesteban
Copy link
Collaborator

Some physio files (eye-tracking files, for example) have low channels and low sampling frequency, so raw .tsv files should be allowed

BIDS is an exchange format, so I would lean towards compression (and gz doesn't seem such a bad choice in that regard) over plain text as soon as the number of rows (and potentially columns or channels, to keep the nomenclature) exceeds what a human being can comfortably read on a text terminal. Even at 500 Hz you get 60k rows in two minutes of experiment. In that scenario, I prefer that (naive) user to find it hard to open the file over knocking their computer out by trying to open a large plain-text tsv file with Excel.

I'd be fine if BIDS said something like: use tsv.gz for any tabular data surpassing, say, 1,000 lines. However, unless the threshold is chosen differently and much higher, we can find ourselves compressing events files that are thought to be somewhat readable as plain text.

Preparing a physio file on Windows will be nearly impossible for naive users who do not master the DOS command line or install Cygwin or another similar tool.

What do you mean by preparing? Are you referring to encoding a BIDS dataset or processing it? Either way, the kind of user you are referring to will probably run some existing tool available to them for the task, and compression will be the least of their concerns. Also, most of the BIDS datasets will have NIfTIs compressed for the reasons above—the idea is to share the data, and it takes less time to transfer a file when it is compressed.

I am skeptical that the naive user confused by a tsv.gz file will be able to do anything with it, even if a friend decompresses it for them. Therefore, compression is a concern just for the software application that digests or produces BIDS. Implementing (de)compression does not seem that hard for developers. For BIDS-consuming applications, the developer will likely open these files and change them to a more suitable format anyway.

Forcing people to use a Unix compression scheme on Windows decreases adoption.

As far as I understand, GZip is just an open file format widely used in Internet applications that Windows users also operate. If you mean decreasing the adoption of Windows, I'm not very concerned ;)


Finally, I believe @yarikoptic's angle in his initial proposal is different. One 'weirdness' of BIDS is that, at the moment, the recordings in a tsv.gz file MUST be continuous (i.e., equispaced and univocal samples), which is incompatible with data types of different nature (what in BEP020 we are calling "physioevents"). If I interpret the goal of his proposal, it tries to cover these "events" that are not task events.

The header issue does not seem that bad to me - we can keep the same convention (tsv have header, tsv.gz do not) and everything keeps working (it's quirky but okay).

@oesteban
Copy link
Collaborator

BTW - the upcoming release of BIDS is more clear regarding tsv.gz (https://bids-specification.readthedocs.io/en/latest/common-principles.html#compressed-tabular-files)

@CPernet
Copy link
Collaborator

CPernet commented Apr 17, 2024

my two cents are (1) I do not have an opinion on if it's good or not to use .gz (2) since we do not enforce .nii.gz we should not enforce .tsv.gz to be consistent across 'modalities'

@CPernet
Copy link
Collaborator

CPernet commented Apr 17, 2024

also +1 @oesteban clearly defining compressed vs non-compressed definitions solves use cases

@oesteban
Copy link
Collaborator

(2) since we do not enforce .nii.gz we should not enforce .tsv.gz to be consistent across 'modalities'

Just a couple of notes:

  • a NIfTI file is never human readable, so for the user there's little difference between nii and nii.gz.

    However, I do appreciate the function of tsv.gz to actually signal this point to the user. It's also very convenient with managing the data with DataLad, I typically push tsv.gz to the annex (they are big files, nonreadable by eye) and keep tsv in git.

  • In addition to that, while nii and nii.gz are completely consistent (nii.gz is just a compressed version of the nii file) this is not true for tsv (and that can also leak into the json sidecar) because of the header-less rule on tsv.gz.

@arnodelorme
Copy link

@oesteban what is the rational for the header rule?

@oesteban
Copy link
Collaborator

@oesteban what is the rational for the header rule?

I have no idea. The problem, to me, is more about having two encodings rather than what the encoding specifically requires (header/headerless)

@arnodelorme
Copy link

arnodelorme commented Apr 17, 2024

@robertoostenveld @CPernet @effigies @yarikoptic @Remi-Gau @sappelhoff do you know the rationale for the .tsv headerless rule (when .tsv are compressed as .tsv.gz, the header is removed and stored in another file).

@Remi-Gau
Copy link
Collaborator

That is the only rationale I know about:

https://bids-specification.readthedocs.io/en/latest/common-principles.html#compressed-tabular-files

image

I was not around when this decision was made.

@arnodelorme
Copy link

Thanks, @Remi-Gau. Seems to me like the decision was based on software formats more than user-friendliness. Also, why impose it across all data files in BIDS if it is just two software?

@Remi-Gau
Copy link
Collaborator

Have been wondering about this for a while too.

@yarikoptic
Copy link
Collaborator Author

I think overall verdict would be that this issue is a "no go" for BIDS 1.0 primarily due to internal difference (header/no-header) within .tsv/.tsv.gz and original focus on having support by the tools. I would keep discussion for BIDS 2.0 in

please 👍 or 👎 at least to establish the direction ;-)

PS feel welcome to reopen if you think there is more to do here

@oesteban
Copy link
Collaborator

oesteban commented Apr 18, 2024

@arnodelorme

Seems to me like the decision was based on software formats more than user-friendliness.

Since TSV files have header (except for motion), the JSON object does not define a "Columns" entry because the user can peek at the TSV header.

However, because the user cannot peek at the TSV.GZ header without decompressing, BIDS moves the definition to the sidecar JSON that is human readable. Forbidding the header in TSV.GZ is logical to preempt inconsistency (having some headers in the TSV.GZ file and have inconsistent "Columns" metadata in the JSON, the same way the "Columns" field should be IMHO forbidden for plain-text TSV).

Therefore, this actually looks like a user-centered decision as software tools can easily drop the header if it is there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion ongoing discussion opinions wanted Please read and offer your opinion on this matter physio
Projects
None yet
Development

No branches or pull requests

9 participants