Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a coherent schema definition syntax #884

Open
erdalkaraca opened this issue Sep 19, 2021 · 21 comments
Open

Use a coherent schema definition syntax #884

erdalkaraca opened this issue Sep 19, 2021 · 21 comments
Labels
schema Issues related to the YAML schema representation of the specification. Patch version release.

Comments

@erdalkaraca
Copy link
Collaborator

erdalkaraca commented Sep 19, 2021

Currently, the schema consists of a mix of ad-hoc definitions and syntax.
This makes it hard to implement a tool to generically validate a given BIDS dataset.

In the ancpbids project (see below for link) we experimented with various schema definition languages, such as:

  • XSD (XML Schema Definition): industry standard with mature support in many programming languages
  • YAMALE: a community effort to formalize data structures using YAML format

Since the BIDS community is more comfortable with JSON/YAML formats, I would like to propose using a more formal schema definition language such as YAMALE to map the BIDS specification onto a machine readable schema.

Examples to follow as additional comments...

Links:
ancpbids: https://github.com/ANCPLabOldenburg/ancp-bids
XSD: https://www.w3.org/TR/xmlschema11-1/
Yamale: https://github.com/23andMe/Yamale

edits:

https://linkml.io (added by @yarikoptic)

@erdalkaraca
Copy link
Collaborator Author

erdalkaraca commented Sep 19, 2021

An example of how a coherent schema might look like using Yamale:

Dataset:
  .extends: Folder
  subjects: list(include('Subject'), required=False)
  dataset_description: include('DatasetDescriptionFile')
  README: include('File', required=False)
  CHANGES: include('File', required=False)
  LICENSE: include('File', required=False)
  genetic_info: include('JsonFile', required=False)
  samples: include('JsonFile', required=False)
  participants_tsv: include('ParticipantsTsvFile', required=False)
  participants_json: include('TsvSidecarFile', required=False)
  code: include('Folder', required=False)
  derivatives: include('DerivativeFolder', required=False)
  sourcedata: include('Folder', required=False)
  stimuli: include('Folder', required=False)

---
JSONSchemaType:
  map()

MetadataFieldDefinition:
  name: str()
  description: str()
  type: include('JSONSchemaType')

SuffixDefinition:
  name: str()
  description: str()
  type: include('JSONSchemaType')

EntitiyDefinition:
  key: str()
  name: str()
  entity: str()
  description: str()
  type: include('JSONSchemaType')

DatasetDescriptionFile:
  .extends: JsonFile
  Name: str()
  BIDSVersion: str()
  HEDVersion: str(required=False, recommended=True)
  DatasetType: enum('raw', 'derivative', required=False, recommended=True)
  License: str(required=False, recommended=True)
  Acknowledgements: str(required=False)
  HowToAcknowledge: str(required=False)
  DatasetDOI: str(required=False)
  Authors: list(str(), required=False)
  Funding: list(str(), required=False)
  EthicsApprovals: list(str(), required=False)
  ReferencesAndLinks: list(str(), required=False)

DerivativeDatasetDescriptionFile:
  .extends: DatasetDescriptionFile
  GeneratedBy: list(include('GeneratedBy'))
  SourceDatasets: list(include('SourceDatasets'), required=False, recommended=True)

GeneratedBy:
  Name: str()
  Version: str(required=False, recommended=True)
  Description: str(required=False)
  CodeURL: str(required=False)
  Container: list(include('GeneratedByContainer'), required=False)

Artifact:
  .doc: >-
    An artifact is a file whose name conforms to the BIDS file naming convention.
  .extends: File
  suffix: str()
  entities: list(include('EntityRef'))

@erdalkaraca
Copy link
Collaborator Author

Example definitions of entities:

- key: subject
  name: Subject
  entity: sub
  description: |
    A person or animal participating in the study.
  type: label
- key: run
  name: Run
  entity: run
  description: |
    If several scans with [...]
  type: index
- key: mtransfer
  name: Magnetization Transfer
  entity: mt
  description: |
    If files belonging [...]
  type:
    enum:
      - "on"
      - "off"
- key: part
  name: Part
  entity: part
  description: |
    This entity is used to [...]
  type:
    enum:
      - mag
      - phase
      - real
      - imag

@erdalkaraca
Copy link
Collaborator Author

erdalkaraca commented Sep 19, 2021

Example metadata fields definitiones:

- name: AcquisitionDuration
  description: |
    Duration (in seconds) of [...]
  type:
    min: 0
    unit: s

- name: AnatomicalLandmarkCoordinates
  description: |
    Key:value pairs of any [...]
    example: `{"AC": [127,119,149], "PC": [128,93,141],
    "IH": [131,114,206]}`, or `{"NAS": [127,213,139], "LPA": [52,113,96],
    "RPA": [202,113,91]}`
  type:
    patternProperties:
      "^[A-Z]{2,3}$":
        type: array
        items:
          type: number
        minItems: 3
        maxItems: 3

Here, we allow the type of AnatomicalLandmarkCoordinates field to be defined as a JSON schema using patternProperties to match any landmark name (2 - 3 uppercase letters as key and a coordinate set of 3 numbers).

@tsalo tsalo added the schema Issues related to the YAML schema representation of the specification. Patch version release. label Sep 19, 2021
@Tokazama
Copy link
Member

This is something I've found a bit odd about BIDS. Personally, I would've gone with JSON Schema because it can actually be formalized into an already developed system for schemas. There may be some larger vision here that I don't understand though, so I'll go along with whatever working system is developed.

@erdalkaraca
Copy link
Collaborator Author

@Tokazama At https://github.com/ANCPLabOldenburg/ancp-bids with have experimented with several "systems for schemas":

  • XSD, JSON Schema, YAMALE/Yaml

XSD is very verbose and "noisy" meaning it is not very convenient for humans to read just by using a text editor, but it is used in industries a lot and it has matured over decades.
XSD requires good (visual) tools support which are not freely available. Eclipse XML editors are open source but the installation/usage seems too complex for future contributors who just want to make simple modifications/proposals.

JSON is much more compact and used in machine-to-machine communication to transfer data objects (for example, web services, REST endpoints use it a lot), and it is still kind of readable for humans. Since JSON was meant for pure data transfer, it has no support for "decorating" specific lines, for example, it is lacking a built-in comment syntax, that is why nowadays you would be "hacking" in a synthetic property "_comment" which parsers/tools are to ignore when processing.

Yaml can be very compact but still has an exhaustive syntax.
The bad news about Yaml (at the moment) is that there is no standard "system for schemas" to use. To not come up with yet another schema definition format, we experimented with https://github.com/23andMe/Yamale which is what you see in the above comments.

The way the metadata fields are defined, we need a schema that allows to embed another schema. For example, see the above definition for AnatomicalLandmarkCoordinates where we have concrete data such as name and description, but we also need to describe (the structure of) the value object. It seems just fine to use a subset of JSON Schema for this.

In summary, there is no limiting factor to choose one schema language over the other as all can be transformed in a representation of each other, but from our experiments, the mix of Yamale and JSON Schema seems a good fit.

This is something I've found a bit odd about BIDS. Personally, I would've gone with JSON Schema because it can actually be formalized into an already developed system for schemas. There may be some larger vision here that I don't understand though, so I'll go along with whatever working system is developed.

@Tokazama
Copy link
Member

So is it correct to say that the biggest reason JSON isn't being used is because there's not a good way to add comments?

@erdalkaraca
Copy link
Collaborator Author

Well, as also mentioned, Yaml is more compact and has some more features that JSON does not provide such as mixing documents from different namespaces which is called TAGs in Yaml: https://yaml.org/spec/1.2/spec.html#id2782090
But not sure if there is a use case for that feature in the BIDS spec.

@Tokazama
Copy link
Member

My understanding of TAGs in YAML (which I admit is very limited) is that they are most useful for telling specific applications parsing the file how to do something, which seems like it would be language specific and counterproductive to the end goal of a cross language schema. I'm not trying to say the current direction (using YAML) is bad. It's just that schemas have been around a while and we aren't looking for high performance I/O on these so I assume we just want a solid set of tools to build on rather than building up something new.

If you already have it figured out and I'm just eating up valuable development time then feel free to ignore this. It may be that I'm just completely missing the point.

@tsalo
Copy link
Member

tsalo commented Oct 14, 2021

Sorry for the delay in responding. I don't think we will be using any features that are specific to YAML beyond the file type, since YAML files are (1) more readable and (2) allow comments, so converting to valid JSON files should be trivial. For that reason, we're still looking into using JSON schema, but having a YAML-based schema definition language would definitely be nice!

@erdalkaraca your YAMALE-based approach looks really good, although we'll need to adjust based on changes from #883 and other recent PRs. I'm still not sure how many of the rules that we need to implement are possible with YAMALE (or JSON schema), but, as you said, adopting something like XSD is just not feasible within the community.

For now, I like the idea of adopting either YAMALE or JSON schema for the object validation (files in schema/objects/) and holding off on the files in schema/rules/, which don't cover enough of the actual rules in the specification for us to commit to any tools yet. At least not until we've made more progress with #620.

@effigies @rwblair do you have any thoughts on the proposed schema syntax?

@erdalkaraca
Copy link
Collaborator Author

In BIDS we have the RECOMMENDED type fields which are neither covered by YAMALE nor XSD and I guess also not by JSON Schema. Unfortunately, the author of YAMALE refused to adopt that as an enhancement and instead suggested to implement it in my own fork - it would not have been that much overhead to extend YAMALE, but anyways...

The current approach is to mix the features of those schema languages. You can have a look at it at:

https://github.com/ANCPLabOldenburg/ancp-bids/blob/main/ancpbids/data/bids_v1_7_0.yaml

Furthermore, that schema definition is used to generate Python code which allows type safe programming against the BIDS schema. For example, all entities (directly read from the new object definition files) are collected in an Enum class, so you have full IDE support at compile time.

@Tokazama
Copy link
Member

Furthermore, that schema definition is used to generate Python code which allows type safe programming against the BIDS schema. For example, all entities (directly read from the new object definition files) are collected in an Enum class, so you have full IDE support at compile time.

But is the ability to do this dependent on YAML? Can't you accomplish the same thing with JSON?

@erdalkaraca
Copy link
Collaborator Author

Furthermore, that schema definition is used to generate Python code which allows type safe programming against the BIDS schema. For example, all entities (directly read from the new object definition files) are collected in an Enum class, so you have full IDE support at compile time.

But is the ability to do this dependent on YAML? Can't you accomplish the same thing with JSON?

Sure, JSON would also be possible, but as you know, decision making is not always an objective process, i.e. I really like the flexibility and compactness of YAML more than JSON :-)

@erdalkaraca
Copy link
Collaborator Author

BTW, you can find the generated code here:

https://github.com/ANCPLabOldenburg/ancp-bids/blob/main/ancpbids/model_v1_7_0.py

I am not yet sure, but I think this approach will also allow to have future schemas co-exist next to each other as separate modules.
Model evolution (software engineering term used to denote schema changes over time) would become a model-to-model transformation.

@Tokazama
Copy link
Member

Sure, JSON would also be possible, but as you know, decision making is not always an objective process, i.e. I really like the flexibility and compactness of YAML more than JSON :-)

Perhaps a way of making that a bit more objective is the cost to switch to JSON. I've assumed that in the long run conforming with JSON schemas that are already established would save time in the long run, but if there is no actual benefit then we're just making breaking changes for the sake of opinion.

@erdalkaraca
Copy link
Collaborator Author

Perhaps a way of making that a bit more objective is the cost to switch to JSON. I've assumed that in the long run conforming with JSON schemas that are already established would save time in the long run, but if there is no actual benefit then we're just making breaking changes for the sake of opinion.

+1, but the schema is rather for technical people to implement software tools. One idea we had at our lab was to have a graphical (and interactive) representation of the schema, so non-technical people could navigate through that graph instead of the textual BIDS spec. At that level, it does not even matter what format/language is used behind the scenes...

@Tokazama
Copy link
Member

JSON schemas can be reconstructed as graphs so that should be relatively straight forward if we go that route.

@rwblair
Copy link
Member

rwblair commented Oct 14, 2021

In the validator we already use our own made up keywords in json schema files that have no meaning in the json schema specification and interpret them how we see fit, things like here's a list of tsv fields that should exist if this json field is used. (side note: We could probably do a lot of tsv validation with json schemas if behind the scenes we convert TSVs to json files and designed a schema around what that might look like.)

The plan with the validator as far as I know was to regenerate the json schema files used for json file validation from the yaml files in the specification. As the maintainer of a javascript project json and json schemas are the hammer with which I prefer to hit every problem, but this yaml to json transformation shouldn't be problematic as far as I can see.

@yarikoptic
Copy link
Collaborator

The most recent effort for a language to describe schemas is https://linkml.io which is getting adopted by NWB, DANDI and exported to by OpenMINDS. Added to the listing above.

@Tokazama
Copy link
Member

Linked JSON is a good way to manage mainly schema related stuff. JSON isn't going anywhere and each time a new linked data standard comes out a corollary interface is adopted into JSON. If we are talking about table data and large arrays there is a lot more to consider.

@yarikoptic
Copy link
Collaborator

What do you mean exactly by "Linked JSON" @Tokazama? json-ld?

@Tokazama
Copy link
Member

Tokazama commented Feb 15, 2024

Most JSON schema's are derivatives of JSON-LD. The link you shared probably refers to one such derivative. It's just a JSON file with additional semantics so you can run something through a JSON reader and then an additional parser for the output of that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
schema Issues related to the YAML schema representation of the specification. Patch version release.
Projects
No open projects
Development

No branches or pull requests

5 participants