ENH: Array data in `.tsv` cells #1446

sappelhoff · 2023-03-16T12:09:27Z

Potentially to be dealt with:

[FIX] Do not require notch frequencies to be parsed as numbers, accommodating multiples #1605

Discussion originated in this thread:

Example of the HED schema library for SCORE implementation bids-examples#324 (comment)

Currently, cells in BIDS .tsv files may take on numeric or string values (see tabluar data).

Here I list some examples:

a string n/a, reflecting a missing value or that a value is not applicable for the present cell
a string mystr, that is further specified in the column_name.Levels object in a .json file that accompanies the .tsv file, where column_name refers to the name of the column in the .tsv file under which the string mystr occurs
floating point numbers, reflecting time since acquisition for the onset column in events.tsv
- often associated with a unit, defined in column_name.Units, analogously to the mystr example above

However, for some cases, it would be desirable to define an array of values in a particular .tsv cell. For example when a given event (in a row) in an events.tsv file is associated with multiple channels from a neural recording (e.g., EEG, iEEG, NIRS, ...):

This event at onset X and with duration Y is associated with channels A, B, and C

Currently, there is no formal way to specify such an array in BIDS. We think that it could be helpful for several data modalities (now and in the future) to have such a concept to work with.

In bids-standard/bids-examples#324 (comment), @effigies suggests that we use JSON arrays for this case (see his comment for advantages over more basic structures, like a comma-separated string).

An array is an ordered collection of values. An array begins with [ (left bracket) and ends with ] (right bracket). Values are separated by , (comma).

I think if we were to adopt JSON arrays as a valid input for .tsv cells, we would would benefit by (at least initially) restricting the kinds of values that may be listed in a given JSON array. Values are defined as such:

A value can be a string in double quotes, or a number, or true or false or null, or an object or an array. These structures can be nested.

For our present purposes, I think it would suffice to confine ourselves to:

string
number
boolean (true/false)

What are your opinions? Do you see other use cases apart from annotations in ephys data (in events.tsv files)? Please comment and advise.

The text was updated successfully, but these errors were encountered:

VisLab · 2023-03-16T15:54:45Z

I always like to see what I'm in for downstream, so I did a couple of simple tests of the alternatives on Python (pandas DataFrame) and Matlab (Table), which I believe will be the two most common downstream data structures to hold tabular data for users.

After view the results (below and attached files), I would prefer option 3 of just having channels be [x , y, z]. It seems that the tabular input for standard tools sees the [ ] and immediately thinks this is a string.

Here is my code and the results. The three test files are attached. All of the Matlab results have outer { } because they all come back as a string in a cell array of size 1 x 1.

table1 = readtable('test1_events.tsv', 'Delimiter', 'tab', 'FileType', 'text');
table2 = readtable('test2_events.tsv', 'Delimiter', 'tab', 'FileType', 'text');
table3 = readtable('test3_events.tsv', 'Delimiter', 'tab', 'FileType', 'text');

Python:

table1 = pd.read_csv('test1_events.tsv', delimiter='\t')
table2 = pd.read_csv('test2_events.tsv', delimiter='\t')
table3 = pd.read_csv('test3_events.tsv', delimiter='\t')

Table	Row	Input	Python	Matlab
1	0	["P1", "P2", "P3"]	'["P1", "P2", "P3"]'	{'["P1", "P2", "P3"]'}
1	1	n/a	nan	{'n/a'}
1	2	["CP1"]	'["CP1"]'	{'["CP1"]'}
1	3	[1, 2, 3]	'[1, 2, 3]'	{'[1, 2, 3]'}
1	4	[1, "CP1", 2]	'[1, "CP1", 2]'	{'[1, "CP1", 2]'}
2	0	'[P1, P2, P3]'	'\'[P1, P2, P3]\''	{''[P1, P2, P3]''}
2	1	'n/a'	nan	{'n/a'}
2	2	'[CP1]'	'\'[CP1]\''	{''[CP1]''}
2	3	'[1, 2, 3]'	'\'[1, 2, 3]\''	{''[1, 2, 3]''}
2	4	'[1, CP1, 2]'	'\'[1, CP1, 2]\''	{''[1, CP1, 2]''}
3	0	[P1, P2, P3]	'[P1, P2, P3]'	{'[P1, P2, P3]'}
3	1	n/a	nan	{'n/a'}
3	2	[CP1]	'[CP1]'	{'[CP1]'}
3	3	[1, 2, 3]	'[1, 2, 3]'	{'[1, 2, 3]'}
3	4	[1, CP1, 2]	'[1, CP1, 2]'	{'[1, CP1, 2]'}
test_event_files.zip

@christinerogers @arnodelorme

christinerogers · 2023-03-16T16:15:05Z

Thanks @sappelhoff and @VisLab . My EEGNet colleague Tyler @Andesha Collins will weigh below, and I'll re-iterate part of our concern with embedding json in a tsv --:

It adds challenges for researchers trying to BIDS-ify their data, since it's harder to read and format in a .tsv cell, more prone to formatting errors, and can't be validated as easily (afaik) - a recipe for frustration and attrition in data sharing.

I understand the reasons why embedded json can be preferred, and at the same time it would be nice to also aim for reducing barriers in onboarding many new research groups to BIDS (as my group does with large open-source projects like LORIS and data-sharing platforms like EEGNet).

p.s. @sappelhoff per my original question in the BIDS Maintainers meeting March 9 - We can't be the only corner of the BIDS world looking at this, can we? who else should be consulted in from other specs?

Remi-Gau · 2023-03-16T16:20:59Z

Note to self: check how octave + bids-matlab will deal with this.

effigies · 2023-03-16T16:49:15Z

Some notes on Kay's comment:

I would suggest that we don't allow heterogeneous arrays, and instead say the type is "array of strings". Encoded in JSON it would be:

onset       duration    trial_type  channels
3.452       n/a         blink       ["P1", "P2", "P3"]
6.112       0           show        n/a
6.82        0.2         response    ["CP1"]
7.345       n/a         start       ["1", "2", "3"]
8.543       5.0         mixed       ["1", "CP1", "2"]
9.112       0           show        []

(I've added another row to show an empty list.)

There are two approaches with Pandas (note that the reprs don't make clear the string versus object; if "P1" shows up, the whole thing is a string; if P1 shows up it is in a list of strings):

Convert on load

>>> import pandas as pd
>>> import json
>>> import math
>>> def loader(val):
...     if val == 'n/a':
...         return None
...     return json.loads(val)
... 
>>> table = pd.read_csv('test4_events.tsv', delimiter='\t', converters={'channels': loader})
>>> table
   onset  duration trial_type      channels
0  3.452       NaN      blink  [P1, P2, P3]
1  6.112       0.0       show          None
2  6.820       0.2   response         [CP1]
3  7.345       NaN      start     [1, 2, 3]
4  8.543       5.0      mixed   [1, CP1, 2]
5  9.112       0.0      mixed            []

1, 2 and 3 are strings here.

Load then convert

>>> import pandas as pd
>>> import json
>>> def from_default(val):
...     if isinstance(val, str):
...         return json.loads(val)
...     return val
... 
>>> table = pd.read_csv('test4_events.tsv', delimiter='\t')
>>> table
   onset  duration trial_type            channels
0  3.452       NaN      blink  ["P1", "P2", "P3"]
1  6.112       0.0       show                 NaN
2  6.820       0.2   response             ["CP1"]
3  7.345       NaN      start     ["1", "2", "3"]
4  8.543       5.0      mixed   ["1", "CP1", "2"]
5  9.112       0.0      mixed                  []
>>> table.channels = table.channels.apply(from_default)
>>> table
   onset  duration trial_type      channels
0  3.452       NaN      blink  [P1, P2, P3]
1  6.112       0.0       show           NaN
2  6.820       0.2   response         [CP1]
3  7.345       NaN      start     [1, 2, 3]
4  8.543       5.0      mixed   [1, CP1, 2]
5  9.112       0.0      mixed            []

1, 2 and 3 are also strings here.

The other thing we considered was a simple comma-separated string, so it would be

onset       duration    trial_type  channels
3.452       n/a         blink       P1,P2,P3
6.112       0           show        n/a
6.82        0.2         response    CP1
7.345       n/a         start       1,2,3
8.543       5.0         mixed       1,CP1,2
9.112       0           mixed       n/a

This would be easy enough to deal with simply by using ','.split(val) in our converter. Note, that there is no distinction between the integer 1 and the string "1". If channels for some reason had numeric names, pandas would by default convert 1 to an integer, but 1,2 to a string.

When suggesting JSON-encoded strings, I assumed that these would not be hand-edited but generated from spreadsheets or annotation tools. If this will be tooling mediated, the JSON will be better structured and the validator could distinguish numbers from strings and give errors. If people are going to be writing these by hand, then I think the comma-separated strings will be the easiest.

robertoostenveld · 2023-03-17T13:51:15Z

Just some wild thoughts:

Besides it being JSON, what would be the advantage of using the [ and ] brackets around the list? And why use " quotes around each item by default? The 2nd example in the previous comment is simpler and also looks fine to me (assuming that the horizontal spaces correspond to tabs, which they technically don't in this case). I realize that I am touching here upon human- versus machine-readability.

Possible answers to myself:

The [] brackets facilitate parsing nested lists like [1, 2, 3, [4.1, 4.2]], which would be impossible without them.

The " quotes allow specifying characters that otherwise would be parsed incorrectly, like ["{", "(", "["], or strings containing a tab. This would not be specific to the lists, it would also apply if I now wanted to represent an item with a tab in it, where the tab needs to be escaped somehow to prevent it from being interpreted as column separator.

Why would we use a different separator inside the list (comma) than outside the list (tab); that could also both be a tab. Yup, I realize this makes parsing harder, both for human and machine. But would that mean that if we have a nested list, we would use yet another separator (e.g., ;) at the 2nd level?

I like the idea what the downstream consequences are when people use software to process the data. Some softwares that I suggest to consider (besides native Python, Python+Pandas, MATLAB ) are R, SPSS, Excel, ...

Andesha · 2023-03-20T15:32:04Z

I hadn't personally considered nested lists as part of my suggestion based on the simple List format...

@robertoostenveld Is that something that we would need to support? 🤔

jadesjardins · 2023-03-20T17:52:20Z

The comma separated list inside of brackets seems to be the most simple (and safest) representation that fits the range of use cases that I am aware of.. from what I can tell, things like adding the quotations around items inside the brackets would only allow the users to do things like have commas in channels names for example (which I think I am okay with restricting).

effigies · 2023-03-23T19:34:05Z

Talking with @sappelhoff, one of the problems with comma-separated channels is that in practice channels might have commas in their names. I can see two approaches that would allow custom delimiters that are only incrementally more complex than comma-separated strings.

Syntax	Example	With channels containing commas
Comma-separation	`P1,P2,P3`	Not allowed
Wrapped delimiters	`,P1,P2,P3,`	`\|P1,2\|P2\|P3\|`
Prefixed delimiter	`,:P1,P2,P3`	`\|:P1,2\|P2\|P3`

These could be implemented:

def wrapped_delimiters(val):
    if isinstance(val, str):
        delimiter = val[0]
        if not delimiter.isalnum() and val[-1] == delim:
            return val[1:-1].split(delimiter)
    return val

def prefixed_delimiters(val):
    if isinstance(val, str) and len(val) > 2:
        if val[1] == ":":
            delimiter = val[0]
            return val[2:].split(delimiter)
    return val

Unless there's some globally safe delimiter (or we're willing to force people to mass-rename channels in their data), A dynamic delimiter or escape characters are likely the only options. One additional step of complexity is an overridable default:

Syntax	Example	With channels containing commas
Overridable delimiter	`P1,P2,P3`	`\|P1,2\|P2\|P3\|` or `\|:P1,2\|P2\|P3`

robertoostenveld · 2023-03-24T08:32:08Z

I hadn't personally considered nested lists as part of my suggestion based on the simple List format...

@robertoostenveld Is that something that we would need to support? 🤔

No, I do not see the use case for it at the moment. But then, I also did not see the use case in the past for lists within one cell of the TSV table.

Nevertheless, "Simple is better than complex", so I think that we should not plan for the future, but do what makes sense now. Personally I like the comma-separated list and if I need it I would simply avoid comma's in channel names (I do that anyway, I only use [A-Za-z0-9], - and _). And the overridable default (e.g., with |) makes that even better.

sappelhoff · 2023-03-24T10:36:36Z

Thanks for the suggestion, Chris. I like the idea of delimited lists where the delimiter is wrapped and may be freely selected from a list of defined delimiters (like ,|;.+- or so). I am not sure I like the idea of just having comma-separated lists as the default and only optionally allowing for "overridable delimiters via wrapping (or prefixing)", because this seems like an unneeded branching to me: It's not a hassle to do ,P1,P2, instead of P1,P2.

VisLab · 2023-03-24T11:01:28Z

I'm glad this discussion is progressing towards a resolution. I don't see a use case for nested lists at this time and don't think we should address this until there is a need for it.

Either the wrapped delimiter solution or the overridable delimiter solution is fine.

dorahermes · 2023-03-24T13:29:16Z

This is great, thank you, either the wrapped delimiter solution or the overridable delimiter solution should work for our use cases.

arnodelorme · 2023-03-24T21:12:14Z

Personally, I prefer the simple format of separated values with commas. P1,P2,P3
Otherwise, it will be confusing for non-programmers.

BIDS is supposed to be readable by users. Numerical array can be 1,2,3 and if you want string "1","2","3"

effigies · 2023-03-27T13:56:44Z

Okay, if you need to distinguish numbers and strings, then that makes sense, but you're now two characters away from a JSON string, which I thought we were trying to avoid. Adding in quoted strings makes this a parsing problem and syntax errors will be possible.

Andesha · 2023-03-27T14:39:39Z

I think instead of distinguishing between an int and a string type within a List, we can leave that up to how we describe the items inside of the List in the sidecar files. This comes later in my mind, but this way people can point the correct (or safe) parser at a List of an expected type.

ericearl · 2023-03-27T14:48:13Z

I take comfort in @robertoostenveld 's comment (and the majority here of array data CSV cells in TSV files supporters):

Nevertheless, "Simple is better than complex", so I think that we should not plan for the future, but do what makes sense now. Personally I like the comma-separated list and if I need it I would simply avoid comma's in channel names (I do that anyway, I only use [A-Za-z0-9], - and _). And the overridable default (e.g., with |) makes that even better.

I also like a comma-separated list's simplicity for "array data" inside of a TSV cell (which is trying to work around the spirit of a TSV already) and the acknowledgment that anything you might put inside a CSV array inside an otherwise TSV file should be absent of commas. Names can use a delimiter that is not a comma, such as /, |, ;, :, ., +, or -.

@effigies Do you have a sense of how many datasets or folks would be affected by the need "to mass-rename channels in their data"? Are channels curated into these TSV files in such a way that they cannot be renamed easily upon curation?

VisLab · 2023-03-27T14:54:54Z

In EEG I have only seen [A-Za-z0-9], - and _. I don't know about eye tracking and motion capture.

sappelhoff · 2023-03-27T14:57:05Z

@effigies Do you have a sense of how many datasets or folks would be affected by the need "to mass-rename channels in their data"? Are channels curated into these TSV files in such a way that they cannot be renamed easily upon curation?

It's not common to have non-alphanumeric characters in channel names (except maybe a space character ... not sure about the statistics here and what's "common"). However it does happen -- I have seen it raised in issues on ephys software, and for example, BrainProducts explicitly mention in their spec, how commas are supposed to be escaped:

; Commas in channel names are coded as "\1".

(taken from the BrainVision data format spec)

And given that the suggestion with wrapped delimiters is in my opinion a robust and easy-to-understand solution, I see no reason why we should force people to rename their channels.

I think instead of distinguishing between an int and a string type within a List, we can leave that up to how we describe the items inside of the List in the sidecar files.

I currently don't see a need for "mixed" arrays (str and int/float), so I agree with the suggestion to define it in the JSON, which data type is supposed to be expected.

However having that said, if the need ever does arise, then a JSON array probably would be the best solution for us.

sappelhoff · 2023-03-30T14:49:33Z

Another proposal could be to define the delimiter of a .tsv cell in the accompanying .json file.

See this TSV file:

onset	duration	my_channels1	my_channels2
5	1	P1,P\|2,R 23,XX12°	Op,1\|Op,2

and this accompanying JSON file:

{
   "my_channels1":{
      "LongName":"My Channels 1",
      "Description":"The first batch of channels associated with a given event.",
      "Delimiter":","
   },
   "my_channels2":{
      "LongName":"My Channels 2",
      "Description":"The second batch of channels associated with a given event.",
      "Delimiter":"|"
   }
}

We would basically extend the list of "defined JSON keywords to describe TSV columns" (current doc on this: https://bids-specification.readthedocs.io/en/latest/common-principles.html#tabular-files) by an additional keyword: Delimiter.

If Delimiter is defined, the respective column in the TSV file is assumed to be a cell array and must be split according to the delimiter that is defined as the string value to the Delimiter key.

In most cases, we will have {"Delimiter": ","} ... but we have a way to define other delimiters if needed.

EDIT: Or we think in this general direction and make this a key like so:

{
    "CellArray": {
        "Delimiter": ","
        "DataType": "int"
    }
}

and so on ... I propose this just as a new direction for our thoughts, and because I am worried that a simple "comma separated list" will not be a good idea to capture the heterogeneity that we have with data.

effigies · 2023-03-31T18:32:03Z

@sappelhoff I get where you're coming from with explicit delimiter metadata, but I think putting parsing instructions into JSON sidecars would be extremely error-prone. I can't think of an existing case where the data file cannot be fully parsed without reference to the sidecar, and TSV files in general have not had mandatory sidecars. I suspect you'll end up with a lot of tools just assuming comma-separated values.

Not to cut the conversation short, but does it make sense to put forward some questions for a vote? I have a sense that there's at least a camp that's pretty settled on comma-separated-unquoted-strings, and if that camp is larger than those trying to find an acceptable dynamic delimiter, we may just be wasting time coming up with possibilities.

sappelhoff · 2023-04-01T15:29:23Z

Thanks for your perspective @effigies I agree. I proposed this because I am just really unhappy with the plain comma-separated list, as I the wrapped delimiter solution is so much more versatile at a minuscule increase in "complexity".

I think a vote is a good idea.

ericearl · 2023-04-03T16:19:53Z

Just to clarify what is being voted upon, it sounds like it should either be "Comma-separation" or "Wrapped delimiters" that are used to create an array of values inside any TSV cell. A brief summary based on discussion in this issue is below. I think it is important to note that any way you want to use delimiters for an array in a TSV cell, it will always be interpreted as a string which you would have to parse in your software.

Comma-separation

Use commas as the delimiter to separate values in an array.

A common example: P1,P2,P3 Simple, but effective (most of the time).
The obvious edge case: P1.2,P2,P3 Causes a need to change your channel names if any contain commas.

Wrapped delimiters

If the first and last character are non-alphanumeric and identical, then that character is identified as the delimiter and used to split the string. That would permit the user to select whatever they find most readable, given their channel names.

A common example: |P1,2|P2|P3| Pipes like these could be used as the delimiter which would allow characters like commas to still be in there.
Another possible example: ,P1,P2,P3, Commas could still be used, but just so any downstream software using this "array in TSV cell" format knows, you prefix and suffix your comma-separated lists with the comma delimiter.

@sappelhoff @effigies Did I miss anything else important?

dorahermes · 2023-04-03T16:47:47Z

Would it perhaps be possible to clarify 2 things before voting:

If this is a change not just about lists of channels, but a general change to BIDS .tsv files having any kind of list/array data, it seems like the corresponding sidecar 'should/must/?' have some information on the fact that there is a list/array rather than a string in this .tsv column. The 'delimiter' field in the json sidecar would indicate this. Would there be a possibility for a 'Wrapped delimiters' defaulting to comma separated? If the curator wants the text string to be parsable as a list, they 'should/must/?' add the delimiter field in the sidecar.
In case we want to go simple now: a comma separated list, but would like to keep options for the future, to me ,P1,P2,P3, is the extremely similar to P1,P2,P3. Which of these would be the most backwards compatible solution?

effigies · 2023-04-03T17:10:15Z

If this is a change not just about lists of channels [...]

At the moment this is just about lists of channels. I think this would end up as instructions that will show up in a column description table, not something to go in "Common Principles". The most this would do is create a precedent, making it more likely that the next column to have an array type (if there ever is one) will have whatever format we declare here, but it wouldn't be so strong that we couldn't make a different choice.

In case we want to go simple now: a comma separated list, but would like to keep options for the future, to me ,P1,P2,P3, is the extremely similar to P1,P2,P3. Which of these would be the most backwards compatible solution?

I think we're talking 2-3 lines of (Python) code difference if we picked one vs the other and then later wanted to add dynamic delimiting. P1,P2,P3 is less constraining on what a dynamic delimiter string would look like, while ,P1,P2,P3, means you could support dynamic delimiters today because you're pretty sure what they would look like if added.

For what it's worth, here are the questions I wrote up (but didn't post) last time:

Dynamic delimiters: Acceptable (grudging acceptance counts) or unacceptable?

If accepted:

Should the delimiter be declared in the value (e.g., ,A,B,) or in a JSON sidecar?

Should there be a default delimiter if a "declaration" is absent?

If rejected:

Should we declare a transformation for channel names containing commas?

If so, do we need to come up with alternatives, or is \1 fine?

Would getting votes on these be sufficiently clarifying?

christinerogers · 2023-04-03T21:06:27Z

@sappelhoff your original suggestion would be to circulate this to the bids community (off github?) -- Seems like we're ready? cc @effigies you have 7 upvotes so far!

Thanks also @dorahermes above for circling back to visibility and generalization across all BIDS (hence this spin-off thread) --
Given other parts of the BIDSverse will encounter this issue (as I mentioned at BIDS-maintainers last month) there's a big difference between a precedent buried in a spec vs. pointed to visibly from a central place (just as a precedent/option) vs. making it a could/should (e.g. in common principles). Can we take the middle path?

@effigies @sappelhoff let us know if you need help/ideas for increasing visibility or linkages across modalities etc.

dorahermes · 2023-04-04T17:47:18Z

I vote 1) with @VisLab since it covers most of our use cases as well.

Only question I have left is whether the channel column always contains a list or whether this should be specified in the .json sidecar?

ericearl · 2023-04-04T20:28:27Z

I think I agree basically the same as @VisLab and @dorahermes.

Dynamic delimiters are acceptable.
The "array in a cell" delimiter should be declared in the JSON sidecar.
The default delimiter should be a comma when the delimiter declaration in a JSON sidecar is absent.

Andesha · 2023-04-05T17:16:46Z

Agree based on comments from @VisLab and @dorahermes

christinerogers · 2023-04-05T18:29:28Z

👍 with @VisLab and @dorahermes and @Andesha 's votes and @ericearl's helpful clarification

@arnodelorme - for EEGNet this seems to be the most workable, a :+1 from you as well?
Does this break any assumptions in your view?

cc @Andesha @jeffersoncasimir @smakeig

CPernet · 2023-04-07T09:05:16Z

Using something like ,P1,P2,P3, seems the easiest for humans and machines alike, so +1 from me.
Option to specify the delimiter in JSON why not -- now of course it means no nesting, but if no one sees/has a case of nesting, let's not worry about it.

arokem · 2023-04-13T14:02:05Z

Hello! The whole idea of adding arrays to a cell of a tsv file seems a bit unsavory to me, because it would takes what is otherwise a tidy table (in this sense) and make it a bit more messy, as is apparent in this discussion.

Another solution that comes to my mind is to distribute the array data across multiple rows with all other column values keeping exactly the same value. I believe that this comes with no loss of information, and retains the tidy format of the table. And IIUC, requires less obtrusive changes to the spec (if at all?).

For example, the table in this comment would become:

onset       duration    trial_type  channels
3.452       n/a         blink       P1
3.452       n/a         blink       P2
3.452       n/a         blink       P3
6.112       0           show        n/a
6.82        0.2         response    CP1
7.345       n/a         start       1
7.345       n/a         start       2
7.345       n/a         start       3
8.543       5.0         mixed       1
8.543       5.0         mixed       CP1
8.543       5.0         mixed       2
9.112       0           mixed       n/a

For disambiguation, we could add a column that explicitly highlights that these rows together form an array (a Boolean) and even the index within the array (i.e., integer values: 0, 1, 2). Does this preclude any use-cases that this proposal is designed to meet?

Andesha · 2023-04-13T14:22:40Z

In my opinion the strategy of breaking up the events across multiple lines has the problem of how to group them back together. For example, the table requires post processing to recognise the first three events are the "same" just on different channels. Further, it does not account for the situation of: "this event is an artifact that is present across these three channels" versus "there are three different artifacts separate across three channels at the same time". This makes the previously mentioned grouping challenging.

Lastly, I agree that this breaks the "tidy" assumptions. I feel it is necessary though. Pandas does have functions for manipulating data like this through explode.

smakeig · 2023-04-13T18:18:41Z

In the HED Working Group, we are considering how to explicitly enter lines in the *events.tsv* files for *non*-onset markers into event processes - for example, markers of the onsets of syllables in a spoken word event. HED will refer to these as 'Inset' event markers. Explicit 'Offset' event markers (also occupying a line in the .tsv file) are also being supported. Here, a new (optional) column in the .tsv file, 'Marker_type' can be used - in this case now-standard event onset entries in the .tsv file should be referred to as being an event 'Onset' marker_type. This advance in event annotation raises the same issue of how to group event markers to refer to the same event (process). This requires another (optional) column in the .tsv file titled 'Event_index'. Currently, we are thinking to allow the user to use their own event indexing scheme (including the obvious possibility, monotonically increasing event number). I have no opinion about the proposal to allow interpretation of vector entries. For small (short) vectors, thsi doesn't seem a problem - for long vectors, however, ... Scott Makeig

…

On Thu, Apr 13, 2023 at 10:22 AM Tyler Collins ***@***.***> wrote: In my opinion the strategy of breaking up the events across multiple lines has the problem of how to group them back together. For example, the table requires post processing to recognise the first three events are the "same" just on different channels. Further, it does not account for the situation of: "this event is an artifact that is present across these three channels" versus "there are three different artifacts separate across three channels at the same time". This makes the previously mentioned grouping challenging. Lastly, I agree that this breaks the "tidy" assumptions. I feel it is necessary though. Pandas does have functions for manipulating data like this through explode <https://urldefense.com/v3/__https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.explode.html__;!!Mih3wA!F0_1OIYog4xEqCW4lRtXmU6IhPp8s5eLK5fhi7KAbj-UOCQlrfa1aRuEm-6GNvBgdAqF1cJps1zKaXuU9FJqgOae$> . — Reply to this email directly, view it on GitHub <https://urldefense.com/v3/__https://github.com/bids-standard/bids-specification/issues/1446*issuecomment-1507061592__;Iw!!Mih3wA!F0_1OIYog4xEqCW4lRtXmU6IhPp8s5eLK5fhi7KAbj-UOCQlrfa1aRuEm-6GNvBgdAqF1cJps1zKaXuU9JDtJUZc$>, or unsubscribe <https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AKN2SFTPPM37ERMQWIXV2DLXBAD3XANCNFSM6AAAAAAV5D2DSA__;!!Mih3wA!F0_1OIYog4xEqCW4lRtXmU6IhPp8s5eLK5fhi7KAbj-UOCQlrfa1aRuEm-6GNvBgdAqF1cJps1zKaXuU9DEGcUK1$> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Scott Makeig, Research Scientist and Director, Swartz Center for Computational Neuroscience, Institute for Neural Computation, University of California San Diego, La Jolla CA 92093-0559, http://sccn.ucsd.edu/~scott

arokem · 2023-04-13T23:22:25Z

Thanks for the comments @Andesha. Regarding this one:

Pandas does have functions for manipulating data like this through explode.

I think that this could work both ways. On the one hand, it means that a vector cell could be interpreted provided software that knows how to interpret it (but the complexity of that is apparent in the above discussion). On the other hand, it also suggests that (almost?) anything that can be represented using vectors could be represented in a tidy table, provided informative columns. For example, event identifier columns, as suggested by @smakeig in his comment.

smakeig · 2023-04-14T00:05:45Z

Yes, Kay Robbins has demonstrated using events.tsv columns to reveal the statistical design of the experiment. For example in a simple 2X2 design ({loud, soft}X{beep,boop}) specifying the nature of the event (onset) using two columns (Loudness, Pitch), abetted by HED definitions of the (2x2) levels of these two variables in an events.json file (typically at the top level of the BIDS file hierarchy) allows HED tools to compile a full HED annotation for the events in the dataset automatically. I suppose it might be possible to allow vector column entries along with (already supported) multicolumn entries? (the key there being that the definitions of the columns and their values are given in the .json file). -Scott

…

On Thu, Apr 13, 2023 at 7:22 PM Ariel Rokem ***@***.***> wrote: Thanks for the comments @Andesha <https://urldefense.com/v3/__https://github.com/Andesha__;!!Mih3wA!EqHIYCHvmOSoL6IEwSLEgt0DGEx8Ka3bAXamy51zHOBMWI8C67cf75l69XbvnBDpkNdaj0pChhEcB5ReEUStLqMy$>. Regarding this one: Pandas does have functions for manipulating data like this through explode <https://urldefense.com/v3/__https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.explode.html__;!!Mih3wA!EqHIYCHvmOSoL6IEwSLEgt0DGEx8Ka3bAXamy51zHOBMWI8C67cf75l69XbvnBDpkNdaj0pChhEcB5ReEfzLhmoN$> . I think that this could work both ways. On the one hand, it means that a vector cell could be interpreted provided software that knows how to interpret it (but the complexity of that is apparent in the above discussion). On the other hand, it also suggests that (almost?) anything that can be represented using vectors could be represented in a tidy table, provided informative columns. For example, event identifier columns, as suggested by @smakeig <https://urldefense.com/v3/__https://github.com/smakeig__;!!Mih3wA!EqHIYCHvmOSoL6IEwSLEgt0DGEx8Ka3bAXamy51zHOBMWI8C67cf75l69XbvnBDpkNdaj0pChhEcB5ReEZ3qR8V8$> in his comment. — Reply to this email directly, view it on GitHub <https://urldefense.com/v3/__https://github.com/bids-standard/bids-specification/issues/1446*issuecomment-1507723742__;Iw!!Mih3wA!EqHIYCHvmOSoL6IEwSLEgt0DGEx8Ka3bAXamy51zHOBMWI8C67cf75l69XbvnBDpkNdaj0pChhEcB5ReERO9A-0W$>, or unsubscribe <https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AKN2SFQQMI5PGKJG3YYRJNDXBCDD5ANCNFSM6AAAAAAV5D2DSA__;!!Mih3wA!EqHIYCHvmOSoL6IEwSLEgt0DGEx8Ka3bAXamy51zHOBMWI8C67cf75l69XbvnBDpkNdaj0pChhEcB5ReEW8_z4bp$> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Scott Makeig, Research Scientist and Director, Swartz Center for Computational Neuroscience, Institute for Neural Computation, University of California San Diego, La Jolla CA 92093-0559, http://sccn.ucsd.edu/~scott

robertoostenveld · 2023-04-14T07:27:18Z

Another solution that comes to my mind is to distribute the array data across multiple rows with all other column values keeping exactly the same value

For me that would work.

christinerogers · 2023-04-14T16:41:41Z

per @Andesha's response to @arokem - This is going to get a little messy no matter which way we slice it.

For data platforms (like LORIS and EEGNet.org), distributing an event over multiple rows is problematic (normalization etc) and also leads to more user typos, and zeroing out repeated values is tricky too.
@robertoostenveld for this reason it seems less workable from our perspective.

-- @effigies @sappelhoff
Is there any precedent in BIDS for either following tidyverse or nesting a list within a column ?

or is it time for a wider recap/survey?

p.s. @smakeig agreed this might grow further, and to support HED tagging at a basic level, I think we'd shoot for the simplest approach needed currently here.

effigies · 2023-04-14T17:04:03Z

Is there any precedent in BIDS for either following tidyverse or nesting a list within a column ?

I'm not aware of any precedent.

sappelhoff · 2023-04-15T06:09:09Z

Is there any precedent in BIDS for either following tidyverse or nesting a list within a column ?

I'm not aware of any precedent in BIDS either.

The tidy suggestion (maybe together with an index column that "binds rows together" (is this what you mean by "event identifier columns" above?)) sounds like a reasonable alternative to the wrapped delimiter suggestion to me.

(A combination of both would even easily allow us a one-level nesting 😉, but I'm not advocating for that)

I'm not entirely sure (yet) how this constitutes a problem for loris, eegnet, or HED. And would love to see an example of the potential problems.

robertoostenveld · 2023-04-15T08:52:38Z

Is there any precedent in BIDS for either following tidyverse or nesting a list within a column ?

As HED was brought up, is that itself not a precedent? See https://github.com/hed-standard/hed-examples/blob/main/datasets/eeg_ds003645s_hed_column/sub-002/eeg/sub-002_task-FacePerception_run-1_events.tsv where the HED column contains a comma-separated list.

VisLab · 2023-04-15T12:40:20Z

Yes, there is the precedent of the HED column in the event file. Each entry in the HED column is treated as a string as far as BIDS is concerned. The hed-javascript validator verifies the actual contents in detail as part of bids-validation. From an annotator's perspective, putting each channel on a separate line in the event file is a nightmare, especially when you want to annotate different features that start at the same time, many channels are included, and the feature needs to be identified with markers a starting time and an ending time (which will be the predominant use case for which this question arose) and you don't know what the ending point is when you identify the starting point. One possibility (not necessarily ideal) is to pass this off to the HED Working Group to deal with. From the BIDS perspective, the list column entry would be treated as a string. Users would need to provide HED annotation in an events.json file (or as part of the HED column in the events.tsv file) using the Collection tag. The string would need to conform to the syntax HWG imposes on the values for the Collection tag (which haven't been specified yet). @scott Makeig ***@***.***> @christinerogers @dorahermes.

…

On Sat, Apr 15, 2023 at 3:52 AM Robert Oostenveld ***@***.***> wrote: Is there any precedent in BIDS for either following tidyverse or nesting a list within a column ? As HED <https://bids-specification.readthedocs.io/en/stable/appendices/hed.html> was brought up, is that itself not a precedent? See https://github.com/hed-standard/hed-examples/blob/main/datasets/eeg_ds003645s_hed_column/sub-002/eeg/sub-002_task-FacePerception_run-1_events.tsv where the HED column contains a comma-separated list. — Reply to this email directly, view it on GitHub <#1446 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJCJOU3KBOTSBJ2X4JYY6LXBJOWDANCNFSM6AAAAAAV5D2DSA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

christinerogers · 2023-04-15T17:44:06Z

Agreed @robertoostenveld @VisLab - consistency with what has already been accepted by BIDS makes sense vs adding tidyverse compliance into the mix.

@arokem thoughts?
Cc @Andesha for eegnet

dorahermes · 2023-04-25T15:50:23Z

While it would be nice to keep the .tsv files tidy, I am not sure whether the one line version is always the easiest to human read. Putting each channel on a separate line in the event file is extremely cumbersome as in the iEEG case seizures can start on a few channels and quickly spread to many channels, which would add many lines for 1 event in the _events.tsv. This is the particular use-case that we have in mind now.

HED columns containing a comma-separated list may be a precedent?

christinerogers · 2023-04-25T15:59:40Z

Thanks @dorahermes -

@effigies @sappelhoff there was discussion in the BIDS maintainers meeting that we had enough consensus to move forward and note this elsewhere in the BIDS ecosystem. Any thoughts?

CPernet · 2023-04-25T16:01:18Z

note that 'we' (the steering group) have nothing against the proposed solution - just when we discussed it, @arokem came up with repeated line solution, and we just wanted to see if that would work - avoiding introducing new things. Since there is a precedent, and @dorahermes makes a compelling argument that it will be cumbersome -- I vote (again) for the simplest option as in e.g. ,P1,P2,P3,

arokem · 2023-04-25T16:05:27Z

Agreed. Based on the responses here, I think that the original line of thought seems to be prevailing. I would much rather have folks who are actually using this kind of data weigh in on what is practical for them.

christinerogers · 2023-04-25T17:58:56Z

Thanks @CPernet and @arokem - good to know this was discussed at steering even

I vote (again) for the simplest option as in e.g. ,P1,P2,P3,

ok i.e. HED's P1, P2, P3 sans leading delimiter

sappelhoff · 2023-04-26T08:06:15Z

Okay I also agree on dropping the "tidy solution" from the options for BIDS then.

Following @effigies summary from here: #1446 (comment)

I am still in favor of comma separated lists with:

"dynamic" delimiters, where the delimiter is declared in the value: For example for ,P1,P2,P3,, the delimiter is , (see wrapping), and for |P1|P2|P3|, the delimiter is |

However I would also be fine with this (although I find it less nice):

dynamic delimiters, where the delimiter is declared in the JSON sidecar
if no delimiter is declared in JSON, then a comma is the default delimiter: P1,P2,P3 (note: no "wrapping" of delimiters needed)

VisLab · 2023-04-26T10:10:41Z

I'm fine with either of these solutions. Thx

…

On Wed, Apr 26, 2023 at 3:06 AM Stefan Appelhoff ***@***.***> wrote: Okay I also agree on dropping the "tidy solution" from the options for BIDS then. Following @effigies <https://github.com/effigies> summary from here: #1446 (comment) <#1446 (comment)> I am still in favor of comma separated lists with: - "dynamic" delimiters, where the delimiter is declared in the value: For example for ,P1,P2,P3,, the delimiter is , (see wrapping), and for |P1|P2|P3|, the delimiter is | However I would *also* be fine with this (although I find it less nice): - dynamic delimiters, where the delimiter is declared in the JSON sidecar - if no delimiter is declared in JSON, then a comma is the default delimiter — Reply to this email directly, view it on GitHub <#1446 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJCJOXHICC3NGY4O6DAOO3XDDJQHANCNFSM6AAAAAAV5D2DSA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

smakeig · 2023-04-26T12:20:40Z

Or, is it then also possible to declare a 2D matrix using p1,p2,p3; p4,p5,p6 (commas/semi-colons) notation? For both vectors and matrices, there should be some guideline advice not to make them too big... IF the tsv file cell vectors (or matrices) need to be quite large (e.g., say 97 sEEG channels showing some event), might these vectors or matrices go into a 'objects.json' file as named objects? Then, in the 'events.tsv' file the object defined in the 'objects.json' file can be referenced in the 'events.tsv' cell as, e.g., 'obj:thing27 where the object names (here, 'thing27') are the user's choice. This would preserve the (human) readability of the events.tsv files...

…

On Wed, Apr 26, 2023 at 6:10 AM VisLab ***@***.***> wrote: I'm fine with either of these solutions. Thx On Wed, Apr 26, 2023 at 3:06 AM Stefan Appelhoff ***@***.***> wrote: > Okay I also agree on dropping the "tidy solution" from the options for > BIDS then. > > Following @effigies <https://github.com/effigies> <https://urldefense.com/v3/__https://github.com/effigies*3E__;JQ!!Mih3wA!FvB8Vz_q5q8WjLHSP8-lRrgYz7l3unw1z_fwPg2qDs1sgrhOi9s8yAZWkLl_-G6MYeJW2KOVLD8m6AOe95VplLHI$> summary from here: #1446 > (comment) > < #1446 (comment)> <https://urldefense.com/v3/__https://github.com/bids-standard/bids-specification/issues/1446*issuecomment-1494687731*3E__;IyU!!Mih3wA!FvB8Vz_q5q8WjLHSP8-lRrgYz7l3unw1z_fwPg2qDs1sgrhOi9s8yAZWkLl_-G6MYeJW2KOVLD8m6AOe9-PA52hO$> > > I am still in favor of comma separated lists with: > > - "dynamic" delimiters, where the delimiter is declared in the value: > For example for ,P1,P2,P3,, the delimiter is , (see wrapping), and for > |P1|P2|P3|, the delimiter is | > > However I would *also* be fine with this (although I find it less nice): > > - dynamic delimiters, where the delimiter is declared in the JSON > sidecar > - if no delimiter is declared in JSON, then a comma is the default > delimiter > > — > Reply to this email directly, view it on GitHub > < #1446 (comment)> <https://urldefense.com/v3/__https://github.com/bids-standard/bids-specification/issues/1446*issuecomment-1522967227*3E__;IyU!!Mih3wA!FvB8Vz_q5q8WjLHSP8-lRrgYz7l3unw1z_fwPg2qDs1sgrhOi9s8yAZWkLl_-G6MYeJW2KOVLD8m6AOe9zlQekmX$> , > or unsubscribe > < https://github.com/notifications/unsubscribe-auth/AAJCJOXHICC3NGY4O6DAOO3XDDJQHANCNFSM6AAAAAAV5D2DSA> <https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AAJCJOXHICC3NGY4O6DAOO3XDDJQHANCNFSM6AAAAAAV5D2DSA*3E__;JQ!!Mih3wA!FvB8Vz_q5q8WjLHSP8-lRrgYz7l3unw1z_fwPg2qDs1sgrhOi9s8yAZWkLl_-G6MYeJW2KOVLD8m6AOe96GLZFCE$> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> > — Reply to this email directly, view it on GitHub <https://urldefense.com/v3/__https://github.com/bids-standard/bids-specification/issues/1446*issuecomment-1523161407__;Iw!!Mih3wA!FvB8Vz_q5q8WjLHSP8-lRrgYz7l3unw1z_fwPg2qDs1sgrhOi9s8yAZWkLl_-G6MYeJW2KOVLD8m6AOe9waD44aj$>, or unsubscribe <https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AKN2SFWZ53A654LK4NOVTO3XDDYC5ANCNFSM6AAAAAAV5D2DSA__;!!Mih3wA!FvB8Vz_q5q8WjLHSP8-lRrgYz7l3unw1z_fwPg2qDs1sgrhOi9s8yAZWkLl_-G6MYeJW2KOVLD8m6AOe98rBRAxS$> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Scott Makeig, Research Scientist and Director, Swartz Center for Computational Neuroscience, Institute for Neural Computation, University of California San Diego, La Jolla CA 92093-0559, http://sccn.ucsd.edu/~scott

Andesha · 2023-04-26T14:01:07Z

Happy with the following from @sappelhoff 's breakdown

dynamic delimiters, where the delimiter is declared in the JSON sidecar
if no delimiter is declared in JSON, then a comma is the default delimiter: P1,P2,P3 (note: no "wrapping" of delimiters needed)

effigies · 2023-04-26T14:19:31Z

is it then also possible to declare a 2D matrix using p1,p2,p3;
p4,p5,p6 (commas/semi-colons) notation?

Please bear in mind that any rule that goes in BIDS needs to be validatable. I think adding matrices or object references into the mix is asking for pain and, in any case, it is beyond the scope of what people have discussed and formed some kind of agreement around.

At this point, it seems that there is a majority for simple comma-separated strings, and if people want/need to use alternative delimiters, they can add a "Delimiter" property in the column description of the JSON.

It's time to make a concrete PR to verify that this can be specified and validated to everyone's satisfaction. The following is my assessment:

The "Delimiter" is optional, so we just need to make sure that its presence doesn't violate JSON schema; it will be very difficult to permit this only for one column, so I think it would need to be added to the generic column description table, but written in a way that makes its scope clear:

bids-specification/src/common-principles.md

Lines 460 to 472 in 77066b3

    
           {{ MACROS___make_metadata_table( 
        
              { 
        
                   "LongName": "OPTIONAL", 
        
                   "Description": ( 
        
                       "RECOMMENDED", 
        
                       "The description of the column.", 
        
                   ), 
        
                   "Levels": "RECOMMENDED", 
        
                   "Units": "RECOMMENDED", 
        
                   "TermURL": "RECOMMENDED", 
        
                   "HED": "OPTIONAL", 
        
              } 
        
           ) }}

I suppose we add a channel definition to objects.columns and add it to the rules.tabular_data.task.TaskEvents:

bids-specification/src/schema/rules/tabular_data/task.yaml

Lines 2 to 16 in 77066b3

    
           TaskEvents: 
        
             selectors: 
        
               - '"task" in entities' 
        
               - suffix == "events" 
        
             columns: 
        
               onset: required 
        
               duration: required 
        
               trial_type: optional 
        
               response_time: optional 
        
               HED: optional 
        
               stim_file: optional 
        
             additional_columns: allowed 
        
             initial_columns: 
        
               - onset 
        
               - duration

This could be narrowed to only apply to MEG/EEG/iEEG/NIRS.

Array data in .tsv cells discussion is converging towards a comma-separated list (possibly wrapped delimiters list), so the example was edited accordingly - combined rows representing the same event with a list of channels in the channel column. See full discussion here: bids-standard/bids-specification#1446

sappelhoff added enhancement New feature or request opinions wanted Please read and offer your opinion on this matter labels Mar 16, 2023

sappelhoff mentioned this issue Mar 16, 2023

Example of the HED schema library for SCORE implementation bids-standard/bids-examples#324

Open

Andesha mentioned this issue Mar 20, 2023

BIDS: Channel tagging planning CONP-PCNO/EEGNet-roadmap#83

Open

sappelhoff added this to the 1.9.0 milestone Mar 24, 2023

effigies mentioned this issue May 2, 2023

[ENH] Define channel column for events and Delimiter field for column descriptions #1483

Merged

sappelhoff closed this as completed in #1483 Nov 13, 2023

ENH: Array data in .tsv cells #1446

ENH: Array data in .tsv cells #1446

Comments

sappelhoff commented Mar 16, 2023 • edited Loading

VisLab commented Mar 16, 2023 • edited Loading

christinerogers commented Mar 16, 2023 • edited Loading

Remi-Gau commented Mar 16, 2023

effigies commented Mar 16, 2023

robertoostenveld commented Mar 17, 2023

Andesha commented Mar 20, 2023

jadesjardins commented Mar 20, 2023

effigies commented Mar 23, 2023

robertoostenveld commented Mar 24, 2023

sappelhoff commented Mar 24, 2023 • edited Loading

VisLab commented Mar 24, 2023

dorahermes commented Mar 24, 2023

arnodelorme commented Mar 24, 2023

effigies commented Mar 27, 2023

Andesha commented Mar 27, 2023

ericearl commented Mar 27, 2023

VisLab commented Mar 27, 2023

sappelhoff commented Mar 27, 2023 • edited Loading

sappelhoff commented Mar 30, 2023 • edited Loading

effigies commented Mar 31, 2023

sappelhoff commented Apr 1, 2023

ericearl commented Apr 3, 2023

Comma-separation

Wrapped delimiters

dorahermes commented Apr 3, 2023

effigies commented Apr 3, 2023 • edited Loading

christinerogers commented Apr 3, 2023

dorahermes commented Apr 4, 2023

ericearl commented Apr 4, 2023

Andesha commented Apr 5, 2023

christinerogers commented Apr 5, 2023 • edited Loading

CPernet commented Apr 7, 2023

arokem commented Apr 13, 2023

Andesha commented Apr 13, 2023

smakeig commented Apr 13, 2023 via email

arokem commented Apr 13, 2023

smakeig commented Apr 14, 2023 via email

robertoostenveld commented Apr 14, 2023

christinerogers commented Apr 14, 2023

effigies commented Apr 14, 2023

sappelhoff commented Apr 15, 2023

robertoostenveld commented Apr 15, 2023

VisLab commented Apr 15, 2023 via email

christinerogers commented Apr 15, 2023

dorahermes commented Apr 25, 2023

christinerogers commented Apr 25, 2023

CPernet commented Apr 25, 2023

arokem commented Apr 25, 2023

christinerogers commented Apr 25, 2023 • edited by sappelhoff Loading

sappelhoff commented Apr 26, 2023 • edited Loading

VisLab commented Apr 26, 2023 via email

smakeig commented Apr 26, 2023 via email

Andesha commented Apr 26, 2023

effigies commented Apr 26, 2023

ENH: Array data in `.tsv` cells #1446

ENH: Array data in `.tsv` cells #1446

sappelhoff commented Mar 16, 2023 •

edited

Loading

VisLab commented Mar 16, 2023 •

edited

Loading

christinerogers commented Mar 16, 2023 •

edited

Loading

sappelhoff commented Mar 24, 2023 •

edited

Loading

sappelhoff commented Mar 27, 2023 •

edited

Loading

sappelhoff commented Mar 30, 2023 •

edited

Loading

effigies commented Apr 3, 2023 •

edited

Loading

christinerogers commented Apr 5, 2023 •

edited

Loading

christinerogers commented Apr 25, 2023 •

edited by sappelhoff

Loading

sappelhoff commented Apr 26, 2023 •

edited

Loading