Use a coherent schema definition syntax #884

erdalkaraca · 2021-09-19T09:11:18Z

Currently, the schema consists of a mix of ad-hoc definitions and syntax.
This makes it hard to implement a tool to generically validate a given BIDS dataset.

In the ancpbids project (see below for link) we experimented with various schema definition languages, such as:

XSD (XML Schema Definition): industry standard with mature support in many programming languages
YAMALE: a community effort to formalize data structures using YAML format

Since the BIDS community is more comfortable with JSON/YAML formats, I would like to propose using a more formal schema definition language such as YAMALE to map the BIDS specification onto a machine readable schema.

Examples to follow as additional comments...

Links:
ancpbids: https://github.com/ANCPLabOldenburg/ancp-bids
XSD: https://www.w3.org/TR/xmlschema11-1/
Yamale: https://github.com/23andMe/Yamale

edits:

https://linkml.io (added by @yarikoptic)

erdalkaraca · 2021-09-19T09:12:39Z

An example of how a coherent schema might look like using Yamale:

Dataset:
  .extends: Folder
  subjects: list(include('Subject'), required=False)
  dataset_description: include('DatasetDescriptionFile')
  README: include('File', required=False)
  CHANGES: include('File', required=False)
  LICENSE: include('File', required=False)
  genetic_info: include('JsonFile', required=False)
  samples: include('JsonFile', required=False)
  participants_tsv: include('ParticipantsTsvFile', required=False)
  participants_json: include('TsvSidecarFile', required=False)
  code: include('Folder', required=False)
  derivatives: include('DerivativeFolder', required=False)
  sourcedata: include('Folder', required=False)
  stimuli: include('Folder', required=False)

---
JSONSchemaType:
  map()

MetadataFieldDefinition:
  name: str()
  description: str()
  type: include('JSONSchemaType')

SuffixDefinition:
  name: str()
  description: str()
  type: include('JSONSchemaType')

EntitiyDefinition:
  key: str()
  name: str()
  entity: str()
  description: str()
  type: include('JSONSchemaType')

DatasetDescriptionFile:
  .extends: JsonFile
  Name: str()
  BIDSVersion: str()
  HEDVersion: str(required=False, recommended=True)
  DatasetType: enum('raw', 'derivative', required=False, recommended=True)
  License: str(required=False, recommended=True)
  Acknowledgements: str(required=False)
  HowToAcknowledge: str(required=False)
  DatasetDOI: str(required=False)
  Authors: list(str(), required=False)
  Funding: list(str(), required=False)
  EthicsApprovals: list(str(), required=False)
  ReferencesAndLinks: list(str(), required=False)

DerivativeDatasetDescriptionFile:
  .extends: DatasetDescriptionFile
  GeneratedBy: list(include('GeneratedBy'))
  SourceDatasets: list(include('SourceDatasets'), required=False, recommended=True)

GeneratedBy:
  Name: str()
  Version: str(required=False, recommended=True)
  Description: str(required=False)
  CodeURL: str(required=False)
  Container: list(include('GeneratedByContainer'), required=False)

Artifact:
  .doc: >-
    An artifact is a file whose name conforms to the BIDS file naming convention.
  .extends: File
  suffix: str()
  entities: list(include('EntityRef'))

erdalkaraca · 2021-09-19T09:14:27Z

Example definitions of entities:

- key: subject
  name: Subject
  entity: sub
  description: |
    A person or animal participating in the study.
  type: label
- key: run
  name: Run
  entity: run
  description: |
    If several scans with [...]
  type: index
- key: mtransfer
  name: Magnetization Transfer
  entity: mt
  description: |
    If files belonging [...]
  type:
    enum:
      - "on"
      - "off"
- key: part
  name: Part
  entity: part
  description: |
    This entity is used to [...]
  type:
    enum:
      - mag
      - phase
      - real
      - imag

erdalkaraca · 2021-09-19T09:15:54Z

Example metadata fields definitiones:

- name: AcquisitionDuration
  description: |
    Duration (in seconds) of [...]
  type:
    min: 0
    unit: s

- name: AnatomicalLandmarkCoordinates
  description: |
    Key:value pairs of any [...]
    example: `{"AC": [127,119,149], "PC": [128,93,141],
    "IH": [131,114,206]}`, or `{"NAS": [127,213,139], "LPA": [52,113,96],
    "RPA": [202,113,91]}`
  type:
    patternProperties:
      "^[A-Z]{2,3}$":
        type: array
        items:
          type: number
        minItems: 3
        maxItems: 3

Here, we allow the type of AnatomicalLandmarkCoordinates field to be defined as a JSON schema using patternProperties to match any landmark name (2 - 3 uppercase letters as key and a coordinate set of 3 numbers).

Tokazama · 2021-09-25T00:42:29Z

This is something I've found a bit odd about BIDS. Personally, I would've gone with JSON Schema because it can actually be formalized into an already developed system for schemas. There may be some larger vision here that I don't understand though, so I'll go along with whatever working system is developed.

erdalkaraca · 2021-09-25T09:39:04Z

@Tokazama At https://github.com/ANCPLabOldenburg/ancp-bids with have experimented with several "systems for schemas":

XSD, JSON Schema, YAMALE/Yaml

XSD is very verbose and "noisy" meaning it is not very convenient for humans to read just by using a text editor, but it is used in industries a lot and it has matured over decades.
XSD requires good (visual) tools support which are not freely available. Eclipse XML editors are open source but the installation/usage seems too complex for future contributors who just want to make simple modifications/proposals.

JSON is much more compact and used in machine-to-machine communication to transfer data objects (for example, web services, REST endpoints use it a lot), and it is still kind of readable for humans. Since JSON was meant for pure data transfer, it has no support for "decorating" specific lines, for example, it is lacking a built-in comment syntax, that is why nowadays you would be "hacking" in a synthetic property "_comment" which parsers/tools are to ignore when processing.

Yaml can be very compact but still has an exhaustive syntax.
The bad news about Yaml (at the moment) is that there is no standard "system for schemas" to use. To not come up with yet another schema definition format, we experimented with https://github.com/23andMe/Yamale which is what you see in the above comments.

The way the metadata fields are defined, we need a schema that allows to embed another schema. For example, see the above definition for AnatomicalLandmarkCoordinates where we have concrete data such as name and description, but we also need to describe (the structure of) the value object. It seems just fine to use a subset of JSON Schema for this.

In summary, there is no limiting factor to choose one schema language over the other as all can be transformed in a representation of each other, but from our experiments, the mix of Yamale and JSON Schema seems a good fit.

This is something I've found a bit odd about BIDS. Personally, I would've gone with JSON Schema because it can actually be formalized into an already developed system for schemas. There may be some larger vision here that I don't understand though, so I'll go along with whatever working system is developed.

Tokazama · 2021-09-25T10:26:55Z

So is it correct to say that the biggest reason JSON isn't being used is because there's not a good way to add comments?

erdalkaraca · 2021-09-25T10:46:05Z

Well, as also mentioned, Yaml is more compact and has some more features that JSON does not provide such as mixing documents from different namespaces which is called TAGs in Yaml: https://yaml.org/spec/1.2/spec.html#id2782090
But not sure if there is a use case for that feature in the BIDS spec.

Tokazama · 2021-09-25T11:38:36Z

My understanding of TAGs in YAML (which I admit is very limited) is that they are most useful for telling specific applications parsing the file how to do something, which seems like it would be language specific and counterproductive to the end goal of a cross language schema. I'm not trying to say the current direction (using YAML) is bad. It's just that schemas have been around a while and we aren't looking for high performance I/O on these so I assume we just want a solid set of tools to build on rather than building up something new.

If you already have it figured out and I'm just eating up valuable development time then feel free to ignore this. It may be that I'm just completely missing the point.

tsalo · 2021-10-14T16:14:07Z

Sorry for the delay in responding. I don't think we will be using any features that are specific to YAML beyond the file type, since YAML files are (1) more readable and (2) allow comments, so converting to valid JSON files should be trivial. For that reason, we're still looking into using JSON schema, but having a YAML-based schema definition language would definitely be nice!

@erdalkaraca your YAMALE-based approach looks really good, although we'll need to adjust based on changes from #883 and other recent PRs. I'm still not sure how many of the rules that we need to implement are possible with YAMALE (or JSON schema), but, as you said, adopting something like XSD is just not feasible within the community.

For now, I like the idea of adopting either YAMALE or JSON schema for the object validation (files in schema/objects/) and holding off on the files in schema/rules/, which don't cover enough of the actual rules in the specification for us to commit to any tools yet. At least not until we've made more progress with #620.

@effigies @rwblair do you have any thoughts on the proposed schema syntax?

erdalkaraca · 2021-10-14T17:13:54Z

In BIDS we have the RECOMMENDED type fields which are neither covered by YAMALE nor XSD and I guess also not by JSON Schema. Unfortunately, the author of YAMALE refused to adopt that as an enhancement and instead suggested to implement it in my own fork - it would not have been that much overhead to extend YAMALE, but anyways...

The current approach is to mix the features of those schema languages. You can have a look at it at:

https://github.com/ANCPLabOldenburg/ancp-bids/blob/main/ancpbids/data/bids_v1_7_0.yaml

Furthermore, that schema definition is used to generate Python code which allows type safe programming against the BIDS schema. For example, all entities (directly read from the new object definition files) are collected in an Enum class, so you have full IDE support at compile time.

Tokazama · 2021-10-14T17:20:01Z

Furthermore, that schema definition is used to generate Python code which allows type safe programming against the BIDS schema. For example, all entities (directly read from the new object definition files) are collected in an Enum class, so you have full IDE support at compile time.

But is the ability to do this dependent on YAML? Can't you accomplish the same thing with JSON?

erdalkaraca · 2021-10-14T17:22:56Z

Furthermore, that schema definition is used to generate Python code which allows type safe programming against the BIDS schema. For example, all entities (directly read from the new object definition files) are collected in an Enum class, so you have full IDE support at compile time.

But is the ability to do this dependent on YAML? Can't you accomplish the same thing with JSON?

Sure, JSON would also be possible, but as you know, decision making is not always an objective process, i.e. I really like the flexibility and compactness of YAML more than JSON :-)

erdalkaraca · 2021-10-14T17:31:18Z

BTW, you can find the generated code here:

https://github.com/ANCPLabOldenburg/ancp-bids/blob/main/ancpbids/model_v1_7_0.py

I am not yet sure, but I think this approach will also allow to have future schemas co-exist next to each other as separate modules.
Model evolution (software engineering term used to denote schema changes over time) would become a model-to-model transformation.

Tokazama · 2021-10-14T17:43:02Z

Sure, JSON would also be possible, but as you know, decision making is not always an objective process, i.e. I really like the flexibility and compactness of YAML more than JSON :-)

Perhaps a way of making that a bit more objective is the cost to switch to JSON. I've assumed that in the long run conforming with JSON schemas that are already established would save time in the long run, but if there is no actual benefit then we're just making breaking changes for the sake of opinion.

erdalkaraca · 2021-10-14T17:51:03Z

Perhaps a way of making that a bit more objective is the cost to switch to JSON. I've assumed that in the long run conforming with JSON schemas that are already established would save time in the long run, but if there is no actual benefit then we're just making breaking changes for the sake of opinion.

+1, but the schema is rather for technical people to implement software tools. One idea we had at our lab was to have a graphical (and interactive) representation of the schema, so non-technical people could navigate through that graph instead of the textual BIDS spec. At that level, it does not even matter what format/language is used behind the scenes...

Tokazama · 2021-10-14T18:07:14Z

JSON schemas can be reconstructed as graphs so that should be relatively straight forward if we go that route.

rwblair · 2021-10-14T20:50:47Z

In the validator we already use our own made up keywords in json schema files that have no meaning in the json schema specification and interpret them how we see fit, things like here's a list of tsv fields that should exist if this json field is used. (side note: We could probably do a lot of tsv validation with json schemas if behind the scenes we convert TSVs to json files and designed a schema around what that might look like.)

The plan with the validator as far as I know was to regenerate the json schema files used for json file validation from the yaml files in the specification. As the maintainer of a javascript project json and json schemas are the hammer with which I prefer to hit every problem, but this yaml to json transformation shouldn't be problematic as far as I can see.

yarikoptic · 2024-02-14T17:29:55Z

The most recent effort for a language to describe schemas is https://linkml.io which is getting adopted by NWB, DANDI and exported to by OpenMINDS. Added to the listing above.

Tokazama · 2024-02-15T16:16:08Z

Linked JSON is a good way to manage mainly schema related stuff. JSON isn't going anywhere and each time a new linked data standard comes out a corollary interface is adopted into JSON. If we are talking about table data and large arrays there is a lot more to consider.

yarikoptic · 2024-02-15T19:55:48Z

What do you mean exactly by "Linked JSON" @Tokazama? json-ld?

Tokazama · 2024-02-15T20:52:08Z

Most JSON schema's are derivatives of JSON-LD. The link you shared probably refers to one such derivative. It's just a JSON file with additional semantics so you can run something through a JSON reader and then an additional parser for the output of that.

erdalkaraca mentioned this issue Sep 19, 2021

Consolidate schema files into a single file for each term type #877

Closed

tsalo added the schema Issues related to the YAML schema representation of the specification. Patch version release. label Sep 19, 2021

tsalo mentioned this issue Oct 5, 2021

[SCHEMA] Consolidate schema files by term type #883

Merged

tsalo added this to schema-based specification Mar 24, 2022

tsalo moved this to Todo in schema-based specification Mar 24, 2022

tsalo mentioned this issue Mar 30, 2022

[SCHEMA] Define the types present in the schema in a new YAML file #1048

Closed

1 task

rwblair mentioned this issue May 26, 2022

[ENH] schema - specify what is valid at the root of a dataset #1108

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use a coherent schema definition syntax #884

Use a coherent schema definition syntax #884

erdalkaraca commented Sep 19, 2021 •

edited by yarikoptic

Loading

erdalkaraca commented Sep 19, 2021 •

edited

Loading

erdalkaraca commented Sep 19, 2021

erdalkaraca commented Sep 19, 2021 •

edited

Loading

Tokazama commented Sep 25, 2021

erdalkaraca commented Sep 25, 2021

Tokazama commented Sep 25, 2021

erdalkaraca commented Sep 25, 2021

Tokazama commented Sep 25, 2021

tsalo commented Oct 14, 2021

erdalkaraca commented Oct 14, 2021

Tokazama commented Oct 14, 2021

erdalkaraca commented Oct 14, 2021

erdalkaraca commented Oct 14, 2021

Tokazama commented Oct 14, 2021

erdalkaraca commented Oct 14, 2021

Tokazama commented Oct 14, 2021

rwblair commented Oct 14, 2021

yarikoptic commented Feb 14, 2024

Tokazama commented Feb 15, 2024

yarikoptic commented Feb 15, 2024

Tokazama commented Feb 15, 2024 •

edited

Loading

Use a coherent schema definition syntax #884

Use a coherent schema definition syntax #884

Comments

erdalkaraca commented Sep 19, 2021 • edited by yarikoptic Loading

erdalkaraca commented Sep 19, 2021 • edited Loading

erdalkaraca commented Sep 19, 2021

erdalkaraca commented Sep 19, 2021 • edited Loading

Tokazama commented Sep 25, 2021

erdalkaraca commented Sep 25, 2021

Tokazama commented Sep 25, 2021

erdalkaraca commented Sep 25, 2021

Tokazama commented Sep 25, 2021

tsalo commented Oct 14, 2021

erdalkaraca commented Oct 14, 2021

Tokazama commented Oct 14, 2021

erdalkaraca commented Oct 14, 2021

erdalkaraca commented Oct 14, 2021

Tokazama commented Oct 14, 2021

erdalkaraca commented Oct 14, 2021

Tokazama commented Oct 14, 2021

rwblair commented Oct 14, 2021

yarikoptic commented Feb 14, 2024

Tokazama commented Feb 15, 2024

yarikoptic commented Feb 15, 2024

Tokazama commented Feb 15, 2024 • edited Loading

erdalkaraca commented Sep 19, 2021 •

edited by yarikoptic

Loading

erdalkaraca commented Sep 19, 2021 •

edited

Loading

erdalkaraca commented Sep 19, 2021 •

edited

Loading

Tokazama commented Feb 15, 2024 •

edited

Loading