Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Json harvester #5942

Merged
merged 9 commits into from
Sep 22, 2021
Merged

Json harvester #5942

merged 9 commits into from
Sep 22, 2021

Conversation

fgravin
Copy link
Member

@fgravin fgravin commented Sep 6, 2021

Continue the work started by @fxprunayre in #4034
Aligned with last main branch.

The goal is to be able to harvester Opendata catalog native API endpoints (CKAN, Opendatasoft, esri).
Loop on JSON datasets and map each object to a metadata record using a dedicated XSL transformation.

Exemple of configuration

  • OPENDATASOFT
    URL https://metropole-europeenne-de-lille.opendatasoft.com/api/datasets/1.0/search
    loopElement /datasets
    recordIdPath datasetid
    toISOConversion OPENDATASOFT-to-DCAT2

  • ESRI
    URL https://data-atmo-hdf.opendata.arcgis.com/data.json
    loopElement /dataset
    numberOfRecordPath /result/count
    recordIdPath identifier
    pageFromParam start
    pageSizeParam rows
    toISOConversion ESRIDCAT-to-DCAT2

fxprunayre and others added 6 commits September 17, 2019 09:35
A simple harvester which takes a URL expecting for now a JSON document
and loop over document identified by a JSONPointer and applying an XSL
to convert to ISO format.

This should allow GeoNetwork to harvest some of the opendata portal
providing all various search API providing JSON response usually.

nodes.forEach(record -> {
Element xml = convertRecordToXml(record);
uuids.put(record.get(params.recordIdPath).asText(), xml);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fxprunayre, @josegar74 if the identifier is an URI (like in ESRI DCAT eg "identifier": "https://data-atmo-hdf.opendata.arcgis.com/datasets/bac17d7d05a34242a8b22c535ecdb13d" it set the URI as the uuid and it does not work when I open the metadata page.
I did a hack for my test:

uuids.put(record.get(params.recordIdPath).asText().split("/datasets/")[1], xml);

but I don't know how we could handle that properly, in any situation, would you have any suggestion ?
having a regexp in the harvester setting to extract the uuid but seems a bit tricky for the admins.
Thanks

Copy link
Member

@josegar74 josegar74 Sep 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fxprunayre did this PR #5736 that probably helps, but needs to enable some specific configuration.

If the identifiers have this format: http(s)://URL/UUID, maybe an option is when converting the JSON to ISO19139, set the gmd:fileIdentifier to the UUID part of the identifier element in JSON and store the full identifier in gmd:identifier element. That should not require any hack in the UI code, but not sure if it's really "correct" from the metadata content point of view.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @josegar74, yes I think it's the way to go
I would just keep the uuid for the uuid and geep the uri for the resourceIdentifier.
But it means while harvesting, I have to know where is the uuid in the URI to extract it.

jahow pushed a commit to georchestra/geonetwork that referenced this pull request Sep 9, 2021
This commit is a squash of geonetwork/core-geonetwork#5942

A simple harvester which takes a URL expecting for now a JSON document
and loop over document identified by a JSONPointer and applying an XSL
to convert to ISO format.

This should allow GeoNetwork to harvest some of the opendata portal
providing all various search API providing JSON response usually.

Harvester / Simple URL / Paging and basic opendatasoft support.

Json harvester: fix merge conflicts

jsonHarvester: handle JSONLD format with @ in tag names

jsonHarvester: add ESRI JSONLD DCAT transformation

hack: to remove, extract uuid from URIs
jahow pushed a commit to georchestra/geonetwork that referenced this pull request Sep 9, 2021
This commit is a squash of geonetwork/core-geonetwork#5942

A simple harvester which takes a URL expecting for now a JSON document
and loop over document identified by a JSONPointer and applying an XSL
to convert to ISO format.

This should allow GeoNetwork to harvest some of the opendata portal
providing all various search API providing JSON response usually.

Harvester / Simple URL / Paging and basic opendatasoft support.

Json harvester: fix merge conflicts

jsonHarvester: handle JSONLD format with @ in tag names

jsonHarvester: add ESRI JSONLD DCAT transformation

hack: to remove, extract uuid from URIs

jsonHarvester: extract uuid from identifier

https://data-atmo-hdf.opendata.arcgis.com/datasets/bac17d7d05a34242a8b22c535ecdb13d
will extract bac17d7d05a34242a8b22c535ecdb13d
@fgravin fgravin marked this pull request as ready for review September 15, 2021 10:31
@fgravin fgravin requested a review from fxprunayre September 15, 2021 10:33
@fgravin
Copy link
Member Author

fgravin commented Sep 20, 2021

@fxprunayre does it follow what you have initiated, do you approve ?
Can we move forward with this one please.
Thanks

Copy link
Contributor

@jahow jahow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really an expert on harvesters but I think this makes sense as a first iteration. Ideally this harvester should not be used directly as it is quite low level, and the user should be able to choose between ESRI DCAT, OpenDataSoft, CKAN etc.

@fgravin fgravin merged commit de14f1c into geonetwork:main Sep 22, 2021
@fgravin fgravin deleted the json-harvester branch September 22, 2021 08:15
@fgravin
Copy link
Member Author

fgravin commented Sep 22, 2021

Thanks @jahow

pmauduit pushed a commit to georchestra/geonetwork that referenced this pull request Feb 3, 2022
This commit is a squash of geonetwork/core-geonetwork#5942

A simple harvester which takes a URL expecting for now a JSON document
and loop over document identified by a JSONPointer and applying an XSL
to convert to ISO format.

This should allow GeoNetwork to harvest some of the opendata portal
providing all various search API providing JSON response usually.

Harvester / Simple URL / Paging and basic opendatasoft support.

Json harvester: fix merge conflicts

jsonHarvester: handle JSONLD format with @ in tag names

jsonHarvester: add ESRI JSONLD DCAT transformation

hack: to remove, extract uuid from URIs

jsonHarvester: extract uuid from identifier

https://data-atmo-hdf.opendata.arcgis.com/datasets/bac17d7d05a34242a8b22c535ecdb13d
will extract bac17d7d05a34242a8b22c535ecdb13d
<xsl:strip-space elements="*"/>

<xsl:template match="/record">
<xsl:variable name="cataloglang" select="'fr'"></xsl:variable>
Copy link
Member

@fxprunayre fxprunayre May 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure french language hard coded value here is representative of the variety of ESRI users. Not sure everyone using this want to create metadata record in french. You can check how other harvester are handling the case of a source not providing language information (eg. OGCWxS)

landryb pushed a commit to landryb/geonetwork that referenced this pull request Jun 2, 2023
This commit is a squash of geonetwork/core-geonetwork#5942

A simple harvester which takes a URL expecting for now a JSON document
and loop over document identified by a JSONPointer and applying an XSL
to convert to ISO format.

This should allow GeoNetwork to harvest some of the opendata portal
providing all various search API providing JSON response usually.

Harvester / Simple URL / Paging and basic opendatasoft support.

Json harvester: fix merge conflicts

jsonHarvester: handle JSONLD format with @ in tag names

jsonHarvester: add ESRI JSONLD DCAT transformation

hack: to remove, extract uuid from URIs

jsonHarvester: extract uuid from identifier

https://data-atmo-hdf.opendata.arcgis.com/datasets/bac17d7d05a34242a8b22c535ecdb13d
will extract bac17d7d05a34242a8b22c535ecdb13d
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants