Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Array data in .tsv cells #1446

Closed
1 task done
sappelhoff opened this issue Mar 16, 2023 · 55 comments · Fixed by #1483
Closed
1 task done

ENH: Array data in .tsv cells #1446

sappelhoff opened this issue Mar 16, 2023 · 55 comments · Fixed by #1483
Labels
enhancement New feature or request opinions wanted Please read and offer your opinion on this matter
Milestone

Comments

@sappelhoff
Copy link
Member

sappelhoff commented Mar 16, 2023

Potentially to be dealt with:

Discussion originated in this thread:

Currently, cells in BIDS .tsv files may take on numeric or string values (see tabluar data).

Here I list some examples:

  • a string n/a, reflecting a missing value or that a value is not applicable for the present cell
  • a string mystr, that is further specified in the column_name.Levels object in a .json file that accompanies the .tsv file, where column_name refers to the name of the column in the .tsv file under which the string mystr occurs
  • floating point numbers, reflecting time since acquisition for the onset column in events.tsv
    • often associated with a unit, defined in column_name.Units, analogously to the mystr example above

However, for some cases, it would be desirable to define an array of values in a particular .tsv cell. For example when a given event (in a row) in an events.tsv file is associated with multiple channels from a neural recording (e.g., EEG, iEEG, NIRS, ...):

This event at onset X and with duration Y is associated with channels A, B, and C

Currently, there is no formal way to specify such an array in BIDS. We think that it could be helpful for several data modalities (now and in the future) to have such a concept to work with.

In bids-standard/bids-examples#324 (comment), @effigies suggests that we use JSON arrays for this case (see his comment for advantages over more basic structures, like a comma-separated string).

An array is an ordered collection of values. An array begins with [ (left bracket) and ends with ] (right bracket). Values are separated by , (comma).

grafik

I think if we were to adopt JSON arrays as a valid input for .tsv cells, we would would benefit by (at least initially) restricting the kinds of values that may be listed in a given JSON array. Values are defined as such:

A value can be a string in double quotes, or a number, or true or false or null, or an object or an array. These structures can be nested.

For our present purposes, I think it would suffice to confine ourselves to:

  • string
  • number
  • boolean (true/false)

What are your opinions? Do you see other use cases apart from annotations in ephys data (in events.tsv files)? Please comment and advise.

@sappelhoff sappelhoff added enhancement New feature or request opinions wanted Please read and offer your opinion on this matter labels Mar 16, 2023
@VisLab
Copy link
Member

VisLab commented Mar 16, 2023

I always like to see what I'm in for downstream, so I did a couple of simple tests of the alternatives on Python (pandas DataFrame) and Matlab (Table), which I believe will be the two most common downstream data structures to hold tabular data for users.

After view the results (below and attached files), I would prefer option 3 of just having channels be [x , y, z]. It seems that the tabular input for standard tools sees the [ ] and immediately thinks this is a string.

Here is my code and the results. The three test files are attached. All of the Matlab results have outer { } because they all come back as a string in a cell array of size 1 x 1.

table1 = readtable('test1_events.tsv', 'Delimiter', 'tab', 'FileType', 'text');
table2 = readtable('test2_events.tsv', 'Delimiter', 'tab', 'FileType', 'text');
table3 = readtable('test3_events.tsv', 'Delimiter', 'tab', 'FileType', 'text');

Python:

table1 = pd.read_csv('test1_events.tsv', delimiter='\t')
table2 = pd.read_csv('test2_events.tsv', delimiter='\t')
table3 = pd.read_csv('test3_events.tsv', delimiter='\t')
Table Row Input Python Matlab
1 0 ["P1", "P2", "P3"] '["P1", "P2", "P3"]' {'["P1", "P2", "P3"]'}
1 1 n/a nan {'n/a'}
1 2 ["CP1"] '["CP1"]' {'["CP1"]'}
1 3 [1, 2, 3] '[1, 2, 3]' {'[1, 2, 3]'}
1 4 [1, "CP1", 2] '[1, "CP1", 2]' {'[1, "CP1", 2]'}
2 0 '[P1, P2, P3]' '\'[P1, P2, P3]\'' {''[P1, P2, P3]''}
2 1 'n/a' nan {'n/a'}
2 2 '[CP1]' '\'[CP1]\'' {''[CP1]''}
2 3 '[1, 2, 3]' '\'[1, 2, 3]\'' {''[1, 2, 3]''}
2 4 '[1, CP1, 2]' '\'[1, CP1, 2]\'' {''[1, CP1, 2]''}
3 0 [P1, P2, P3] '[P1, P2, P3]' {'[P1, P2, P3]'}
3 1 n/a nan {'n/a'}
3 2 [CP1] '[CP1]' {'[CP1]'}
3 3 [1, 2, 3] '[1, 2, 3]' {'[1, 2, 3]'}
3 4 [1, CP1, 2] '[1, CP1, 2]' {'[1, CP1, 2]'}
test_event_files.zip

@christinerogers @arnodelorme

@christinerogers
Copy link
Contributor

christinerogers commented Mar 16, 2023

Thanks @sappelhoff and @VisLab . My EEGNet colleague Tyler @Andesha Collins will weigh below, and I'll re-iterate part of our concern with embedding json in a tsv --:

It adds challenges for researchers trying to BIDS-ify their data, since it's harder to read and format in a .tsv cell, more prone to formatting errors, and can't be validated as easily (afaik) - a recipe for frustration and attrition in data sharing.

I understand the reasons why embedded json can be preferred, and at the same time it would be nice to also aim for reducing barriers in onboarding many new research groups to BIDS (as my group does with large open-source projects like LORIS and data-sharing platforms like EEGNet).

p.s. @sappelhoff per my original question in the BIDS Maintainers meeting March 9 - We can't be the only corner of the BIDS world looking at this, can we? who else should be consulted in from other specs?

@Remi-Gau
Copy link
Collaborator

Note to self: check how octave + bids-matlab will deal with this.

@effigies
Copy link
Collaborator

Some notes on Kay's comment:

I would suggest that we don't allow heterogeneous arrays, and instead say the type is "array of strings". Encoded in JSON it would be:

onset       duration    trial_type  channels
3.452       n/a         blink       ["P1", "P2", "P3"]
6.112       0           show        n/a
6.82        0.2         response    ["CP1"]
7.345       n/a         start       ["1", "2", "3"]
8.543       5.0         mixed       ["1", "CP1", "2"]
9.112       0           show        []

(I've added another row to show an empty list.)

There are two approaches with Pandas (note that the reprs don't make clear the string versus object; if "P1" shows up, the whole thing is a string; if P1 shows up it is in a list of strings):

Convert on load
>>> import pandas as pd
>>> import json
>>> import math
>>> def loader(val):
...     if val == 'n/a':
...         return None
...     return json.loads(val)
... 
>>> table = pd.read_csv('test4_events.tsv', delimiter='\t', converters={'channels': loader})
>>> table
   onset  duration trial_type      channels
0  3.452       NaN      blink  [P1, P2, P3]
1  6.112       0.0       show          None
2  6.820       0.2   response         [CP1]
3  7.345       NaN      start     [1, 2, 3]
4  8.543       5.0      mixed   [1, CP1, 2]
5  9.112       0.0      mixed            []

1, 2 and 3 are strings here.

Load then convert
>>> import pandas as pd
>>> import json
>>> def from_default(val):
...     if isinstance(val, str):
...         return json.loads(val)
...     return val
... 
>>> table = pd.read_csv('test4_events.tsv', delimiter='\t')
>>> table
   onset  duration trial_type            channels
0  3.452       NaN      blink  ["P1", "P2", "P3"]
1  6.112       0.0       show                 NaN
2  6.820       0.2   response             ["CP1"]
3  7.345       NaN      start     ["1", "2", "3"]
4  8.543       5.0      mixed   ["1", "CP1", "2"]
5  9.112       0.0      mixed                  []
>>> table.channels = table.channels.apply(from_default)
>>> table
   onset  duration trial_type      channels
0  3.452       NaN      blink  [P1, P2, P3]
1  6.112       0.0       show           NaN
2  6.820       0.2   response         [CP1]
3  7.345       NaN      start     [1, 2, 3]
4  8.543       5.0      mixed   [1, CP1, 2]
5  9.112       0.0      mixed            []

1, 2 and 3 are also strings here.


The other thing we considered was a simple comma-separated string, so it would be

onset       duration    trial_type  channels
3.452       n/a         blink       P1,P2,P3
6.112       0           show        n/a
6.82        0.2         response    CP1
7.345       n/a         start       1,2,3
8.543       5.0         mixed       1,CP1,2
9.112       0           mixed       n/a

This would be easy enough to deal with simply by using ','.split(val) in our converter. Note, that there is no distinction between the integer 1 and the string "1". If channels for some reason had numeric names, pandas would by default convert 1 to an integer, but 1,2 to a string.

When suggesting JSON-encoded strings, I assumed that these would not be hand-edited but generated from spreadsheets or annotation tools. If this will be tooling mediated, the JSON will be better structured and the validator could distinguish numbers from strings and give errors. If people are going to be writing these by hand, then I think the comma-separated strings will be the easiest.

@robertoostenveld
Copy link
Collaborator

Just some wild thoughts:

Besides it being JSON, what would be the advantage of using the [ and ] brackets around the list? And why use " quotes around each item by default? The 2nd example in the previous comment is simpler and also looks fine to me (assuming that the horizontal spaces correspond to tabs, which they technically don't in this case). I realize that I am touching here upon human- versus machine-readability.

Possible answers to myself:

The [] brackets facilitate parsing nested lists like [1, 2, 3, [4.1, 4.2]], which would be impossible without them.

The " quotes allow specifying characters that otherwise would be parsed incorrectly, like ["{", "(", "["], or strings containing a tab. This would not be specific to the lists, it would also apply if I now wanted to represent an item with a tab in it, where the tab needs to be escaped somehow to prevent it from being interpreted as column separator.

Why would we use a different separator inside the list (comma) than outside the list (tab); that could also both be a tab. Yup, I realize this makes parsing harder, both for human and machine. But would that mean that if we have a nested list, we would use yet another separator (e.g., ;) at the 2nd level?

I like the idea what the downstream consequences are when people use software to process the data. Some softwares that I suggest to consider (besides native Python, Python+Pandas, MATLAB ) are R, SPSS, Excel, ...

@Andesha
Copy link
Collaborator

Andesha commented Mar 20, 2023

I hadn't personally considered nested lists as part of my suggestion based on the simple List format...

@robertoostenveld Is that something that we would need to support? 🤔

@jadesjardins
Copy link

The comma separated list inside of brackets seems to be the most simple (and safest) representation that fits the range of use cases that I am aware of.. from what I can tell, things like adding the quotations around items inside the brackets would only allow the users to do things like have commas in channels names for example (which I think I am okay with restricting).

@effigies
Copy link
Collaborator

Talking with @sappelhoff, one of the problems with comma-separated channels is that in practice channels might have commas in their names. I can see two approaches that would allow custom delimiters that are only incrementally more complex than comma-separated strings.

Syntax Example With channels containing commas
Comma-separation P1,P2,P3 Not allowed
Wrapped delimiters ,P1,P2,P3, |P1,2|P2|P3|
Prefixed delimiter ,:P1,P2,P3 |:P1,2|P2|P3

These could be implemented:

def wrapped_delimiters(val):
    if isinstance(val, str):
        delimiter = val[0]
        if not delimiter.isalnum() and val[-1] == delim:
            return val[1:-1].split(delimiter)
    return val

def prefixed_delimiters(val):
    if isinstance(val, str) and len(val) > 2:
        if val[1] == ":":
            delimiter = val[0]
            return val[2:].split(delimiter)
    return val

Unless there's some globally safe delimiter (or we're willing to force people to mass-rename channels in their data), A dynamic delimiter or escape characters are likely the only options. One additional step of complexity is an overridable default:

Syntax Example With channels containing commas
Overridable delimiter P1,P2,P3 |P1,2|P2|P3| or |:P1,2|P2|P3

@robertoostenveld
Copy link
Collaborator

I hadn't personally considered nested lists as part of my suggestion based on the simple List format...

@robertoostenveld Is that something that we would need to support? 🤔

No, I do not see the use case for it at the moment. But then, I also did not see the use case in the past for lists within one cell of the TSV table.

Nevertheless, "Simple is better than complex", so I think that we should not plan for the future, but do what makes sense now. Personally I like the comma-separated list and if I need it I would simply avoid comma's in channel names (I do that anyway, I only use [A-Za-z0-9], - and _). And the overridable default (e.g., with |) makes that even better.

@sappelhoff
Copy link
Member Author

sappelhoff commented Mar 24, 2023

Thanks for the suggestion, Chris. I like the idea of delimited lists where the delimiter is wrapped and may be freely selected from a list of defined delimiters (like ,|;.+- or so). I am not sure I like the idea of just having comma-separated lists as the default and only optionally allowing for "overridable delimiters via wrapping (or prefixing)", because this seems like an unneeded branching to me: It's not a hassle to do ,P1,P2, instead of P1,P2.

@VisLab
Copy link
Member

VisLab commented Mar 24, 2023

I'm glad this discussion is progressing towards a resolution. I don't see a use case for nested lists at this time and don't think we should address this until there is a need for it.

Either the wrapped delimiter solution or the overridable delimiter solution is fine.

@dorahermes
Copy link
Member

This is great, thank you, either the wrapped delimiter solution or the overridable delimiter solution should work for our use cases.

@sappelhoff sappelhoff added this to the 1.9.0 milestone Mar 24, 2023
@arnodelorme
Copy link

Personally, I prefer the simple format of separated values with commas. P1,P2,P3
Otherwise, it will be confusing for non-programmers.

BIDS is supposed to be readable by users. Numerical array can be 1,2,3 and if you want string "1","2","3"

@effigies
Copy link
Collaborator

Okay, if you need to distinguish numbers and strings, then that makes sense, but you're now two characters away from a JSON string, which I thought we were trying to avoid. Adding in quoted strings makes this a parsing problem and syntax errors will be possible.

@Andesha
Copy link
Collaborator

Andesha commented Mar 27, 2023

I think instead of distinguishing between an int and a string type within a List, we can leave that up to how we describe the items inside of the List in the sidecar files. This comes later in my mind, but this way people can point the correct (or safe) parser at a List of an expected type.

@ericearl
Copy link
Collaborator

I take comfort in @robertoostenveld 's comment (and the majority here of array data CSV cells in TSV files supporters):

Nevertheless, "Simple is better than complex", so I think that we should not plan for the future, but do what makes sense now. Personally I like the comma-separated list and if I need it I would simply avoid comma's in channel names (I do that anyway, I only use [A-Za-z0-9], - and _). And the overridable default (e.g., with |) makes that even better.

I also like a comma-separated list's simplicity for "array data" inside of a TSV cell (which is trying to work around the spirit of a TSV already) and the acknowledgment that anything you might put inside a CSV array inside an otherwise TSV file should be absent of commas. Names can use a delimiter that is not a comma, such as /, |, ;, :, ., +, or -.

@effigies Do you have a sense of how many datasets or folks would be affected by the need "to mass-rename channels in their data"? Are channels curated into these TSV files in such a way that they cannot be renamed easily upon curation?

@VisLab
Copy link
Member

VisLab commented Mar 27, 2023

In EEG I have only seen [A-Za-z0-9], - and _. I don't know about eye tracking and motion capture.

@sappelhoff
Copy link
Member Author

sappelhoff commented Mar 27, 2023

@effigies Do you have a sense of how many datasets or folks would be affected by the need "to mass-rename channels in their data"? Are channels curated into these TSV files in such a way that they cannot be renamed easily upon curation?

It's not common to have non-alphanumeric characters in channel names (except maybe a space character ... not sure about the statistics here and what's "common"). However it does happen -- I have seen it raised in issues on ephys software, and for example, BrainProducts explicitly mention in their spec, how commas are supposed to be escaped:

; Commas in channel names are coded as "\1".

(taken from the BrainVision data format spec)

And given that the suggestion with wrapped delimiters is in my opinion a robust and easy-to-understand solution, I see no reason why we should force people to rename their channels.

I think instead of distinguishing between an int and a string type within a List, we can leave that up to how we describe the items inside of the List in the sidecar files.

I currently don't see a need for "mixed" arrays (str and int/float), so I agree with the suggestion to define it in the JSON, which data type is supposed to be expected.

However having that said, if the need ever does arise, then a JSON array probably would be the best solution for us.

@sappelhoff
Copy link
Member Author

sappelhoff commented Mar 30, 2023

Another proposal could be to define the delimiter of a .tsv cell in the accompanying .json file.

See this TSV file:

onset duration my_channels1 my_channels2
5 1 P1,P|2,R 23,XX12° Op,1|Op,2

and this accompanying JSON file:

{
   "my_channels1":{
      "LongName":"My Channels 1",
      "Description":"The first batch of channels associated with a given event.",
      "Delimiter":","
   },
   "my_channels2":{
      "LongName":"My Channels 2",
      "Description":"The second batch of channels associated with a given event.",
      "Delimiter":"|"
   }
}

We would basically extend the list of "defined JSON keywords to describe TSV columns" (current doc on this: https://bids-specification.readthedocs.io/en/latest/common-principles.html#tabular-files) by an additional keyword: Delimiter.

If Delimiter is defined, the respective column in the TSV file is assumed to be a cell array and must be split according to the delimiter that is defined as the string value to the Delimiter key.

In most cases, we will have {"Delimiter": ","} ... but we have a way to define other delimiters if needed.


EDIT: Or we think in this general direction and make this a key like so:

{
    "CellArray": {
        "Delimiter": ","
        "DataType": "int"
    }
}

and so on ... I propose this just as a new direction for our thoughts, and because I am worried that a simple "comma separated list" will not be a good idea to capture the heterogeneity that we have with data.

@effigies
Copy link
Collaborator

@sappelhoff I get where you're coming from with explicit delimiter metadata, but I think putting parsing instructions into JSON sidecars would be extremely error-prone. I can't think of an existing case where the data file cannot be fully parsed without reference to the sidecar, and TSV files in general have not had mandatory sidecars. I suspect you'll end up with a lot of tools just assuming comma-separated values.

Not to cut the conversation short, but does it make sense to put forward some questions for a vote? I have a sense that there's at least a camp that's pretty settled on comma-separated-unquoted-strings, and if that camp is larger than those trying to find an acceptable dynamic delimiter, we may just be wasting time coming up with possibilities.

@sappelhoff
Copy link
Member Author

Thanks for your perspective @effigies I agree. I proposed this because I am just really unhappy with the plain comma-separated list, as I the wrapped delimiter solution is so much more versatile at a minuscule increase in "complexity".

I think a vote is a good idea.

@ericearl
Copy link
Collaborator

ericearl commented Apr 3, 2023

Just to clarify what is being voted upon, it sounds like it should either be "Comma-separation" or "Wrapped delimiters" that are used to create an array of values inside any TSV cell. A brief summary based on discussion in this issue is below. I think it is important to note that any way you want to use delimiters for an array in a TSV cell, it will always be interpreted as a string which you would have to parse in your software.

Comma-separation

Use commas as the delimiter to separate values in an array.

  • A common example: P1,P2,P3 Simple, but effective (most of the time).
  • The obvious edge case: P1.2,P2,P3 Causes a need to change your channel names if any contain commas.

Wrapped delimiters

If the first and last character are non-alphanumeric and identical, then that character is identified as the delimiter and used to split the string. That would permit the user to select whatever they find most readable, given their channel names.

  • A common example: |P1,2|P2|P3| Pipes like these could be used as the delimiter which would allow characters like commas to still be in there.
  • Another possible example: ,P1,P2,P3, Commas could still be used, but just so any downstream software using this "array in TSV cell" format knows, you prefix and suffix your comma-separated lists with the comma delimiter.

@sappelhoff @effigies Did I miss anything else important?

@dorahermes
Copy link
Member

Would it perhaps be possible to clarify 2 things before voting:

  1. If this is a change not just about lists of channels, but a general change to BIDS .tsv files having any kind of list/array data, it seems like the corresponding sidecar 'should/must/?' have some information on the fact that there is a list/array rather than a string in this .tsv column. The 'delimiter' field in the json sidecar would indicate this. Would there be a possibility for a 'Wrapped delimiters' defaulting to comma separated? If the curator wants the text string to be parsable as a list, they 'should/must/?' add the delimiter field in the sidecar.

  2. In case we want to go simple now: a comma separated list, but would like to keep options for the future, to me ,P1,P2,P3, is the extremely similar to P1,P2,P3. Which of these would be the most backwards compatible solution?

@effigies
Copy link
Collaborator

effigies commented Apr 3, 2023

If this is a change not just about lists of channels [...]

At the moment this is just about lists of channels. I think this would end up as instructions that will show up in a column description table, not something to go in "Common Principles". The most this would do is create a precedent, making it more likely that the next column to have an array type (if there ever is one) will have whatever format we declare here, but it wouldn't be so strong that we couldn't make a different choice.

In case we want to go simple now: a comma separated list, but would like to keep options for the future, to me ,P1,P2,P3, is the extremely similar to P1,P2,P3. Which of these would be the most backwards compatible solution?

I think we're talking 2-3 lines of (Python) code difference if we picked one vs the other and then later wanted to add dynamic delimiting. P1,P2,P3 is less constraining on what a dynamic delimiter string would look like, while ,P1,P2,P3, means you could support dynamic delimiters today because you're pretty sure what they would look like if added.

For what it's worth, here are the questions I wrote up (but didn't post) last time:

  1. Dynamic delimiters: Acceptable (grudging acceptance counts) or unacceptable?
  2. If accepted:
    1. Should the delimiter be declared in the value (e.g., ,A,B,) or in a JSON sidecar?
    2. Should there be a default delimiter if a "declaration" is absent?
  3. If rejected:
    1. Should we declare a transformation for channel names containing commas?
    2. If so, do we need to come up with alternatives, or is \1 fine?

Would getting votes on these be sufficiently clarifying?

@christinerogers
Copy link
Contributor

@sappelhoff your original suggestion would be to circulate this to the bids community (off github?) -- Seems like we're ready? cc @effigies you have 7 upvotes so far!

Thanks also @dorahermes above for circling back to visibility and generalization across all BIDS (hence this spin-off thread) --
Given other parts of the BIDSverse will encounter this issue (as I mentioned at BIDS-maintainers last month) there's a big difference between a precedent buried in a spec vs. pointed to visibly from a central place (just as a precedent/option) vs. making it a could/should (e.g. in common principles). Can we take the middle path?

@effigies @sappelhoff let us know if you need help/ideas for increasing visibility or linkages across modalities etc.

@dorahermes
Copy link
Member

I vote 1) with @VisLab since it covers most of our use cases as well.

Only question I have left is whether the channel column always contains a list or whether this should be specified in the .json sidecar?

@ericearl
Copy link
Collaborator

ericearl commented Apr 4, 2023

I think I agree basically the same as @VisLab and @dorahermes.

  1. Dynamic delimiters are acceptable.
  2. The "array in a cell" delimiter should be declared in the JSON sidecar.
  3. The default delimiter should be a comma when the delimiter declaration in a JSON sidecar is absent.

@Andesha
Copy link
Collaborator

Andesha commented Apr 5, 2023

Agree based on comments from @VisLab and @dorahermes

@christinerogers
Copy link
Contributor

christinerogers commented Apr 5, 2023

👍 with @VisLab and @dorahermes and @Andesha 's votes and @ericearl's helpful clarification

@arnodelorme - for EEGNet this seems to be the most workable, a :+1 from you as well?
Does this break any assumptions in your view?

cc @Andesha @jeffersoncasimir @smakeig

@CPernet
Copy link
Collaborator

CPernet commented Apr 7, 2023

Using something like ,P1,P2,P3, seems the easiest for humans and machines alike, so +1 from me.
Option to specify the delimiter in JSON why not -- now of course it means no nesting, but if no one sees/has a case of nesting, let's not worry about it.

@arokem
Copy link
Collaborator

arokem commented Apr 13, 2023

Hello! The whole idea of adding arrays to a cell of a tsv file seems a bit unsavory to me, because it would takes what is otherwise a tidy table (in this sense) and make it a bit more messy, as is apparent in this discussion.

Another solution that comes to my mind is to distribute the array data across multiple rows with all other column values keeping exactly the same value. I believe that this comes with no loss of information, and retains the tidy format of the table. And IIUC, requires less obtrusive changes to the spec (if at all?).

For example, the table in this comment would become:

onset       duration    trial_type  channels
3.452       n/a         blink       P1
3.452       n/a         blink       P2
3.452       n/a         blink       P3
6.112       0           show        n/a
6.82        0.2         response    CP1
7.345       n/a         start       1
7.345       n/a         start       2
7.345       n/a         start       3
8.543       5.0         mixed       1
8.543       5.0         mixed       CP1
8.543       5.0         mixed       2
9.112       0           mixed       n/a

For disambiguation, we could add a column that explicitly highlights that these rows together form an array (a Boolean) and even the index within the array (i.e., integer values: 0, 1, 2). Does this preclude any use-cases that this proposal is designed to meet?

@Andesha
Copy link
Collaborator

Andesha commented Apr 13, 2023

In my opinion the strategy of breaking up the events across multiple lines has the problem of how to group them back together. For example, the table requires post processing to recognise the first three events are the "same" just on different channels. Further, it does not account for the situation of: "this event is an artifact that is present across these three channels" versus "there are three different artifacts separate across three channels at the same time". This makes the previously mentioned grouping challenging.

Lastly, I agree that this breaks the "tidy" assumptions. I feel it is necessary though. Pandas does have functions for manipulating data like this through explode.

@smakeig
Copy link

smakeig commented Apr 13, 2023 via email

@arokem
Copy link
Collaborator

arokem commented Apr 13, 2023

Thanks for the comments @Andesha. Regarding this one:

Pandas does have functions for manipulating data like this through explode.

I think that this could work both ways. On the one hand, it means that a vector cell could be interpreted provided software that knows how to interpret it (but the complexity of that is apparent in the above discussion). On the other hand, it also suggests that (almost?) anything that can be represented using vectors could be represented in a tidy table, provided informative columns. For example, event identifier columns, as suggested by @smakeig in his comment.

@smakeig
Copy link

smakeig commented Apr 14, 2023 via email

@robertoostenveld
Copy link
Collaborator

Another solution that comes to my mind is to distribute the array data across multiple rows with all other column values keeping exactly the same value

For me that would work.

@christinerogers
Copy link
Contributor

per @Andesha's response to @arokem - This is going to get a little messy no matter which way we slice it.

For data platforms (like LORIS and EEGNet.org), distributing an event over multiple rows is problematic (normalization etc) and also leads to more user typos, and zeroing out repeated values is tricky too.
@robertoostenveld for this reason it seems less workable from our perspective.

-- @effigies @sappelhoff
Is there any precedent in BIDS for either following tidyverse or nesting a list within a column ?

or is it time for a wider recap/survey?

p.s. @smakeig agreed this might grow further, and to support HED tagging at a basic level, I think we'd shoot for the simplest approach needed currently here.

@effigies
Copy link
Collaborator

Is there any precedent in BIDS for either following tidyverse or nesting a list within a column ?

I'm not aware of any precedent.

@sappelhoff
Copy link
Member Author

Is there any precedent in BIDS for either following tidyverse or nesting a list within a column ?

I'm not aware of any precedent in BIDS either.

The tidy suggestion (maybe together with an index column that "binds rows together" (is this what you mean by "event identifier columns" above?)) sounds like a reasonable alternative to the wrapped delimiter suggestion to me.

(A combination of both would even easily allow us a one-level nesting 😉, but I'm not advocating for that)

I'm not entirely sure (yet) how this constitutes a problem for loris, eegnet, or HED. And would love to see an example of the potential problems.

@robertoostenveld
Copy link
Collaborator

Is there any precedent in BIDS for either following tidyverse or nesting a list within a column ?

As HED was brought up, is that itself not a precedent? See https://github.com/hed-standard/hed-examples/blob/main/datasets/eeg_ds003645s_hed_column/sub-002/eeg/sub-002_task-FacePerception_run-1_events.tsv where the HED column contains a comma-separated list.

@VisLab
Copy link
Member

VisLab commented Apr 15, 2023 via email

@christinerogers
Copy link
Contributor

Agreed @robertoostenveld @VisLab - consistency with what has already been accepted by BIDS makes sense vs adding tidyverse compliance into the mix.

@arokem thoughts?
Cc @Andesha for eegnet

@dorahermes
Copy link
Member

While it would be nice to keep the .tsv files tidy, I am not sure whether the one line version is always the easiest to human read. Putting each channel on a separate line in the event file is extremely cumbersome as in the iEEG case seizures can start on a few channels and quickly spread to many channels, which would add many lines for 1 event in the _events.tsv. This is the particular use-case that we have in mind now.

HED columns containing a comma-separated list may be a precedent?

@christinerogers
Copy link
Contributor

Thanks @dorahermes -

@effigies @sappelhoff there was discussion in the BIDS maintainers meeting that we had enough consensus to move forward and note this elsewhere in the BIDS ecosystem. Any thoughts?

@CPernet
Copy link
Collaborator

CPernet commented Apr 25, 2023

note that 'we' (the steering group) have nothing against the proposed solution - just when we discussed it, @arokem came up with repeated line solution, and we just wanted to see if that would work - avoiding introducing new things. Since there is a precedent, and @dorahermes makes a compelling argument that it will be cumbersome -- I vote (again) for the simplest option as in e.g. ,P1,P2,P3,

@arokem
Copy link
Collaborator

arokem commented Apr 25, 2023

Agreed. Based on the responses here, I think that the original line of thought seems to be prevailing. I would much rather have folks who are actually using this kind of data weigh in on what is practical for them.

@christinerogers
Copy link
Contributor

christinerogers commented Apr 25, 2023

Thanks @CPernet and @arokem - good to know this was discussed at steering even

I vote (again) for the simplest option as in e.g. ,P1,P2,P3,

ok i.e. HED's P1, P2, P3 sans leading delimiter

@sappelhoff
Copy link
Member Author

sappelhoff commented Apr 26, 2023

Okay I also agree on dropping the "tidy solution" from the options for BIDS then.

Following @effigies summary from here: #1446 (comment)

I am still in favor of comma separated lists with:

  • "dynamic" delimiters, where the delimiter is declared in the value: For example for ,P1,P2,P3,, the delimiter is , (see wrapping), and for |P1|P2|P3|, the delimiter is |

However I would also be fine with this (although I find it less nice):

  • dynamic delimiters, where the delimiter is declared in the JSON sidecar
  • if no delimiter is declared in JSON, then a comma is the default delimiter: P1,P2,P3 (note: no "wrapping" of delimiters needed)

@VisLab
Copy link
Member

VisLab commented Apr 26, 2023 via email

@smakeig
Copy link

smakeig commented Apr 26, 2023 via email

@Andesha
Copy link
Collaborator

Andesha commented Apr 26, 2023

Happy with the following from @sappelhoff 's breakdown

dynamic delimiters, where the delimiter is declared in the JSON sidecar
if no delimiter is declared in JSON, then a comma is the default delimiter: P1,P2,P3 (note: no "wrapping" of delimiters needed)

@effigies
Copy link
Collaborator

is it then also possible to declare a 2D matrix using p1,p2,p3;
p4,p5,p6 (commas/semi-colons) notation?

Please bear in mind that any rule that goes in BIDS needs to be validatable. I think adding matrices or object references into the mix is asking for pain and, in any case, it is beyond the scope of what people have discussed and formed some kind of agreement around.


At this point, it seems that there is a majority for simple comma-separated strings, and if people want/need to use alternative delimiters, they can add a "Delimiter" property in the column description of the JSON.

It's time to make a concrete PR to verify that this can be specified and validated to everyone's satisfaction. The following is my assessment:

The "Delimiter" is optional, so we just need to make sure that its presence doesn't violate JSON schema; it will be very difficult to permit this only for one column, so I think it would need to be added to the generic column description table, but written in a way that makes its scope clear:

{{ MACROS___make_metadata_table(
{
"LongName": "OPTIONAL",
"Description": (
"RECOMMENDED",
"The description of the column.",
),
"Levels": "RECOMMENDED",
"Units": "RECOMMENDED",
"TermURL": "RECOMMENDED",
"HED": "OPTIONAL",
}
) }}

I suppose we add a channel definition to objects.columns and add it to the rules.tabular_data.task.TaskEvents:

TaskEvents:
selectors:
- '"task" in entities'
- suffix == "events"
columns:
onset: required
duration: required
trial_type: optional
response_time: optional
HED: optional
stim_file: optional
additional_columns: allowed
initial_columns:
- onset
- duration

This could be narrowed to only apply to MEG/EEG/iEEG/NIRS.

sappelhoff pushed a commit to tpatpa/bids-examples that referenced this issue Oct 27, 2023
Array data in .tsv cells discussion is converging towards a comma-separated list (possibly wrapped delimiters list), so the example was edited accordingly - combined rows representing the same event with a list of channels in the channel column.

See full discussion here: bids-standard/bids-specification#1446
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request opinions wanted Please read and offer your opinion on this matter
Projects
None yet