-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use a coherent schema definition syntax #884
Comments
An example of how a coherent schema might look like using Yamale: Dataset:
.extends: Folder
subjects: list(include('Subject'), required=False)
dataset_description: include('DatasetDescriptionFile')
README: include('File', required=False)
CHANGES: include('File', required=False)
LICENSE: include('File', required=False)
genetic_info: include('JsonFile', required=False)
samples: include('JsonFile', required=False)
participants_tsv: include('ParticipantsTsvFile', required=False)
participants_json: include('TsvSidecarFile', required=False)
code: include('Folder', required=False)
derivatives: include('DerivativeFolder', required=False)
sourcedata: include('Folder', required=False)
stimuli: include('Folder', required=False)
---
JSONSchemaType:
map()
MetadataFieldDefinition:
name: str()
description: str()
type: include('JSONSchemaType')
SuffixDefinition:
name: str()
description: str()
type: include('JSONSchemaType')
EntitiyDefinition:
key: str()
name: str()
entity: str()
description: str()
type: include('JSONSchemaType')
DatasetDescriptionFile:
.extends: JsonFile
Name: str()
BIDSVersion: str()
HEDVersion: str(required=False, recommended=True)
DatasetType: enum('raw', 'derivative', required=False, recommended=True)
License: str(required=False, recommended=True)
Acknowledgements: str(required=False)
HowToAcknowledge: str(required=False)
DatasetDOI: str(required=False)
Authors: list(str(), required=False)
Funding: list(str(), required=False)
EthicsApprovals: list(str(), required=False)
ReferencesAndLinks: list(str(), required=False)
DerivativeDatasetDescriptionFile:
.extends: DatasetDescriptionFile
GeneratedBy: list(include('GeneratedBy'))
SourceDatasets: list(include('SourceDatasets'), required=False, recommended=True)
GeneratedBy:
Name: str()
Version: str(required=False, recommended=True)
Description: str(required=False)
CodeURL: str(required=False)
Container: list(include('GeneratedByContainer'), required=False)
Artifact:
.doc: >-
An artifact is a file whose name conforms to the BIDS file naming convention.
.extends: File
suffix: str()
entities: list(include('EntityRef'))
|
Example definitions of entities: - key: subject
name: Subject
entity: sub
description: |
A person or animal participating in the study.
type: label
- key: run
name: Run
entity: run
description: |
If several scans with [...]
type: index
- key: mtransfer
name: Magnetization Transfer
entity: mt
description: |
If files belonging [...]
type:
enum:
- "on"
- "off"
- key: part
name: Part
entity: part
description: |
This entity is used to [...]
type:
enum:
- mag
- phase
- real
- imag |
Example metadata fields definitiones: - name: AcquisitionDuration
description: |
Duration (in seconds) of [...]
type:
min: 0
unit: s
- name: AnatomicalLandmarkCoordinates
description: |
Key:value pairs of any [...]
example: `{"AC": [127,119,149], "PC": [128,93,141],
"IH": [131,114,206]}`, or `{"NAS": [127,213,139], "LPA": [52,113,96],
"RPA": [202,113,91]}`
type:
patternProperties:
"^[A-Z]{2,3}$":
type: array
items:
type: number
minItems: 3
maxItems: 3 Here, we allow the type of AnatomicalLandmarkCoordinates field to be defined as a JSON schema using patternProperties to match any landmark name (2 - 3 uppercase letters as key and a coordinate set of 3 numbers). |
This is something I've found a bit odd about BIDS. Personally, I would've gone with JSON Schema because it can actually be formalized into an already developed system for schemas. There may be some larger vision here that I don't understand though, so I'll go along with whatever working system is developed. |
@Tokazama At https://github.com/ANCPLabOldenburg/ancp-bids with have experimented with several "systems for schemas":
XSD is very verbose and "noisy" meaning it is not very convenient for humans to read just by using a text editor, but it is used in industries a lot and it has matured over decades. JSON is much more compact and used in machine-to-machine communication to transfer data objects (for example, web services, REST endpoints use it a lot), and it is still kind of readable for humans. Since JSON was meant for pure data transfer, it has no support for "decorating" specific lines, for example, it is lacking a built-in comment syntax, that is why nowadays you would be "hacking" in a synthetic property "_comment" which parsers/tools are to ignore when processing. Yaml can be very compact but still has an exhaustive syntax. The way the metadata fields are defined, we need a schema that allows to embed another schema. For example, see the above definition for AnatomicalLandmarkCoordinates where we have concrete data such as name and description, but we also need to describe (the structure of) the value object. It seems just fine to use a subset of JSON Schema for this. In summary, there is no limiting factor to choose one schema language over the other as all can be transformed in a representation of each other, but from our experiments, the mix of Yamale and JSON Schema seems a good fit.
|
So is it correct to say that the biggest reason JSON isn't being used is because there's not a good way to add comments? |
Well, as also mentioned, Yaml is more compact and has some more features that JSON does not provide such as mixing documents from different namespaces which is called TAGs in Yaml: https://yaml.org/spec/1.2/spec.html#id2782090 |
My understanding of TAGs in YAML (which I admit is very limited) is that they are most useful for telling specific applications parsing the file how to do something, which seems like it would be language specific and counterproductive to the end goal of a cross language schema. I'm not trying to say the current direction (using YAML) is bad. It's just that schemas have been around a while and we aren't looking for high performance I/O on these so I assume we just want a solid set of tools to build on rather than building up something new. If you already have it figured out and I'm just eating up valuable development time then feel free to ignore this. It may be that I'm just completely missing the point. |
Sorry for the delay in responding. I don't think we will be using any features that are specific to YAML beyond the file type, since YAML files are (1) more readable and (2) allow comments, so converting to valid JSON files should be trivial. For that reason, we're still looking into using JSON schema, but having a YAML-based schema definition language would definitely be nice! @erdalkaraca your YAMALE-based approach looks really good, although we'll need to adjust based on changes from #883 and other recent PRs. I'm still not sure how many of the rules that we need to implement are possible with YAMALE (or JSON schema), but, as you said, adopting something like XSD is just not feasible within the community. For now, I like the idea of adopting either YAMALE or JSON schema for the object validation (files in @effigies @rwblair do you have any thoughts on the proposed schema syntax? |
In BIDS we have the RECOMMENDED type fields which are neither covered by YAMALE nor XSD and I guess also not by JSON Schema. Unfortunately, the author of YAMALE refused to adopt that as an enhancement and instead suggested to implement it in my own fork - it would not have been that much overhead to extend YAMALE, but anyways... The current approach is to mix the features of those schema languages. You can have a look at it at: https://github.com/ANCPLabOldenburg/ancp-bids/blob/main/ancpbids/data/bids_v1_7_0.yaml Furthermore, that schema definition is used to generate Python code which allows type safe programming against the BIDS schema. For example, all entities (directly read from the new object definition files) are collected in an Enum class, so you have full IDE support at compile time. |
But is the ability to do this dependent on YAML? Can't you accomplish the same thing with JSON? |
Sure, JSON would also be possible, but as you know, decision making is not always an objective process, i.e. I really like the flexibility and compactness of YAML more than JSON :-) |
BTW, you can find the generated code here: https://github.com/ANCPLabOldenburg/ancp-bids/blob/main/ancpbids/model_v1_7_0.py I am not yet sure, but I think this approach will also allow to have future schemas co-exist next to each other as separate modules. |
Perhaps a way of making that a bit more objective is the cost to switch to JSON. I've assumed that in the long run conforming with JSON schemas that are already established would save time in the long run, but if there is no actual benefit then we're just making breaking changes for the sake of opinion. |
+1, but the schema is rather for technical people to implement software tools. One idea we had at our lab was to have a graphical (and interactive) representation of the schema, so non-technical people could navigate through that graph instead of the textual BIDS spec. At that level, it does not even matter what format/language is used behind the scenes... |
JSON schemas can be reconstructed as graphs so that should be relatively straight forward if we go that route. |
In the validator we already use our own made up keywords in json schema files that have no meaning in the json schema specification and interpret them how we see fit, things like here's a list of tsv fields that should exist if this json field is used. (side note: We could probably do a lot of tsv validation with json schemas if behind the scenes we convert TSVs to json files and designed a schema around what that might look like.) The plan with the validator as far as I know was to regenerate the json schema files used for json file validation from the yaml files in the specification. As the maintainer of a javascript project json and json schemas are the hammer with which I prefer to hit every problem, but this yaml to json transformation shouldn't be problematic as far as I can see. |
The most recent effort for a language to describe schemas is https://linkml.io which is getting adopted by NWB, DANDI and exported to by OpenMINDS. Added to the listing above. |
Linked JSON is a good way to manage mainly schema related stuff. JSON isn't going anywhere and each time a new linked data standard comes out a corollary interface is adopted into JSON. If we are talking about table data and large arrays there is a lot more to consider. |
What do you mean exactly by "Linked JSON" @Tokazama? json-ld? |
Most JSON schema's are derivatives of JSON-LD. The link you shared probably refers to one such derivative. It's just a JSON file with additional semantics so you can run something through a JSON reader and then an additional parser for the output of that. |
Currently, the schema consists of a mix of ad-hoc definitions and syntax.
This makes it hard to implement a tool to generically validate a given BIDS dataset.
In the ancpbids project (see below for link) we experimented with various schema definition languages, such as:
Since the BIDS community is more comfortable with JSON/YAML formats, I would like to propose using a more formal schema definition language such as YAMALE to map the BIDS specification onto a machine readable schema.
Examples to follow as additional comments...
Links:
ancpbids: https://github.com/ANCPLabOldenburg/ancp-bids
XSD: https://www.w3.org/TR/xmlschema11-1/
Yamale: https://github.com/23andMe/Yamale
edits:
https://linkml.io (added by @yarikoptic)
The text was updated successfully, but these errors were encountered: