How to structure a data dump as a DataFeed? #579

AlasdairGray · 2022-05-24T14:27:17Z

AlasdairGray
May 24, 2022
Maintainer

The Schema.org community have created a proposal for exchanging markup data as a DataFeed. This feed can be made up of one or many files which should be stored at a well known location.

What the proposal does not specify is what should be in the file(s).

AlasdairGray · 2022-05-24T16:20:25Z

AlasdairGray
May 24, 2022
Maintainer Author

@ivanmicetic for your IDP resources, I would expect that the JSON-LD file would contain an array of entries that correspond to your pages. This may look something like the following (I've taken a snippet of two of your entries – note that I've changed the dataset to an object rather than a string):

[
  {
    "@context": "https://schema.org",
    "includedInDataset": {
      "@id":"https://disprot.org/#2021-08",
      "@type": "Dataset",
      "name": "DisProt (August 2021)"
    },
    "@type": "Protein",
    "@id": "https://disprot.org/DP00004",
    "http://purl.org/dc/terms/conformsTo": {
      "@id": "https://bioschemas.org/profiles/Protein/0.11-RELEASE",
      "@type": "CreativeWork"
    },
    "identifier": "https://identifiers.org/disprot:DP00004",
    "sameAs": "http://purl.uniprot.org/uniprot/P49913",
    "name": "Cathelicidin antimicrobial peptide"
  },
  {
    "@context": "https://schema.org",
    "includedInDataset": { 
      "@id": "https://disprot.org/#2021-08",
      "@type": "Dataset",
      "name": "DisProt (August 2021)"
    },
    "@type": "Protein",
    "@id": "https://disprot.org/DP00072",
    "http://purl.org/dc/terms/conformsTo": {
      "@id": "https://bioschemas.org/profiles/Protein/0.11-RELEASE",
      "@type": "CreativeWork"
    },
    "identifier": "https://identifiers.org/disprot:DP00072",
    "sameAs": "http://purl.uniprot.org/uniprot/Q8WZ42",
    "name": "Titin"
  }
]

For reducing the data load on consumers and to reduce the size of the file, we could use the JSON-LD flattened function which would result in. See how there is only one entry node for the profile and the dataset rather than that information being repeated.

{
  "@graph": [
    {
      "@id": "https://bioschemas.org/profiles/Protein/0.11-RELEASE",
      "@type": "http://schema.org/CreativeWork"
    },
    {
      "@id": "https://disprot.org/#2021-08",
      "@type": "http://schema.org/Dataset",
      "http://schema.org/name": "DisProt (August 2021)"
    },
    {
      "@id": "https://disprot.org/DP00004",
      "@type": "http://schema.org/Protein",
      "http://purl.org/dc/terms/conformsTo": {
        "@id": "https://bioschemas.org/profiles/Protein/0.11-RELEASE"
      },
      "http://schema.org/identifier": "https://identifiers.org/disprot:DP00004",
      "http://schema.org/includedInDataset": {
        "@id": "https://disprot.org/#2021-08"
      },
      "http://schema.org/name": "Cathelicidin antimicrobial peptide",
      "http://schema.org/sameAs": {
        "@id": "http://purl.uniprot.org/uniprot/P49913"
      }
    },
    {
      "@id": "https://disprot.org/DP00072",
      "@type": "http://schema.org/Protein",
      "http://purl.org/dc/terms/conformsTo": {
        "@id": "https://bioschemas.org/profiles/Protein/0.11-RELEASE"
      },
      "http://schema.org/identifier": "https://identifiers.org/disprot:DP00072",
      "http://schema.org/includedInDataset": {
        "@id": "https://disprot.org/#2021-08"
      },
      "http://schema.org/name": "Titin",
      "http://schema.org/sameAs": {
        "@id": "http://purl.uniprot.org/uniprot/Q8WZ42"
      }
    }
  ]
}

So in essence you need to create a JSON-LD entry for each protein, put them together in an array, and then use the JSON-LD flatten function.

2 replies

ivanmicetic May 25, 2022
Maintainer

Hmmm, interesting. I need to test how to implement the flattening in our backend (node.js or python) and how it scales with large JSON arrays (2-3k entries).

AlasdairGray May 26, 2022
Maintainer Author

Flattening is a function available in JSON-LD libraries https://w3c.github.io/json-ld-api/#flattening

In fact we may want to consider flattening and compacting.

AlasdairGray · 2022-08-18T14:50:29Z

AlasdairGray
Aug 18, 2022
Maintainer Author

Using the Disprot sample data that we collected for developing the IDP-KG, I have generated this sample json-ld file.

Hopefully that shows what should be in a DataFeed file.

0 replies

stain · 2022-10-11T12:57:46Z

stain
Oct 11, 2022

One suggestion is to make the dump an RO-Crate - which also use flattened JSON-LD, basically the only change from the example you pasted would be to declare a root dataset that mentions the BioChemEntity and hasPart any files.

it would also be good to use the (otherwise deprecated) http://schema.org prefix in JSON-LD so that the JSON-LD is compatible with @context: https://schema.org/ (which still maps to http) but I am not sure if bioschema has its own take on that?

See https://gist.github.com/stain/a7143d5276b927571a68f493cd388836 based on @AlasdairGray's sample data above. Changes on top.

It may seem strange that from the RO-Crate the Dataset is root while the DataCatalog is on top, alternatively the RO-Crate Root could be a second Dataset that had the DataCatalog as its own part if the crate was to capture multiple such datasets.

One thing you may notice as I reflattened with https://w3id.org/ro/crate/1.1/context is that our JSON-LD context is based on schema.org 10.0, so unmapped keys like schema:rangeEnd get a prefix as they are not registered with schema.org.

(If you flatten with {"@context": "https://schema.org"} these keys get shortened again with the @vocab base -- for archival purposes we don't have that in the regular RO-Crate JSON-LD context)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to structure a data dump as a DataFeed? #579

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to structure a data dump as a DataFeed? #579

AlasdairGray May 24, 2022 Maintainer

Replies: 3 comments · 2 replies

AlasdairGray May 24, 2022 Maintainer Author

ivanmicetic May 25, 2022 Maintainer

AlasdairGray May 26, 2022 Maintainer Author

AlasdairGray Aug 18, 2022 Maintainer Author

stain Oct 11, 2022

AlasdairGray
May 24, 2022
Maintainer

Replies: 3 comments 2 replies

AlasdairGray
May 24, 2022
Maintainer Author

ivanmicetic May 25, 2022
Maintainer

AlasdairGray May 26, 2022
Maintainer Author

AlasdairGray
Aug 18, 2022
Maintainer Author

stain
Oct 11, 2022