Harvester / URL / Add RDF DCAT harvester. #6771

fxprunayre · 2023-01-20T07:01:30Z

Harvesting RDF feeds

The simpleurl harvester can already point to JSON or XML feed. It can also point to a RDF DCAT feed which will be loaded using Jena. SPARQL queries are applied to extract necessary information from the RDF graph.

This work was initially made by GIM team for Metadata vlaanderen in a DCAT-AP dedicated harvester (see https://github.com/metadata101/dcat-ap1.1/tree/master/src/main/java/org/fao/geonet/kernel/harvest/harvester/dcatap) but we considered that the simpleurl harvester can be a good candidate for simplification and provide DCAT feed support directly.

The results can be converted using an XSL conversion. A conversion to ISO19115-3 is provided and custom plugins may provide other conversions (see #6772). The provided ISO19115-3 conversion support only Dataset and cover most of the mapping done in OGC API record (see https://github.com/geonetwork/geonetwork-microservices/blob/main/modules/library/common-index-model/src/main/java/org/fao/geonet/index/converter/DcatConverter.java#L188)

Tested with

Configuration improvements

When creating the harvester, add an helper menu to more easily set up the configuration with some examples for loading:
- DCAT feed converted to ISO
- JSON
- XML file containing one metadata record
- CSW GetRecords response

Additional notes & improvements

When selecting an harvester, select the settings tab by default (to not stay on history or results tab)

Note that simple URL is a bit faster for harvesting CSW (but it requires to manually write the CSW GET query) eg. for 159 records

Add support for batch editing.

Future work to be discussed

Helper configuration can probably be improved (eg. hide useless fields depending on harvesting target, more detailed configuration for specific implementations)
Add possibility to hash or not URI used for UUID (depends on Support UUID with URL special characters. #5736) - Add the possibility to use the rdf:about as UUID for the record (but require to support URL chars in UUID) or create a hash of the rdf:about
RDF / Paging support for RDF feeds using Hydra
RDF / Add support for multilingual DCAT feed
RDF / Add support for turtle, JSON-LD

Co-authored-by: Mathieu Chaussier [email protected]
Co-authored-by: Gustaaf Van de Boel [email protected]
Co-authored-by: Stijn Goedertier [email protected]

common/src/main/java/org/fao/geonet/utils/Xml.java

@@ -1128,6 +1129,8 @@
        Pattern pattern;
        Matcher matcher;

+        inXMLStr = inXMLStr.replaceFirst(XML_VERSION_HEADER, "");


common/src/main/java/org/fao/geonet/utils/Xml.java

+    public static boolean isRDFLike(String inXMLStr) {
+        boolean retBool = false;
+        if (isXMLLike(inXMLStr)) {
+            String xml = inXMLStr.replaceFirst(XML_VERSION_HEADER, ""),


fgravin · 2023-01-20T11:27:22Z

Great news !

UI / Based on type of harvesting hide uneeded options eg. for a DCAT feed, only the URL and conversion is really necessary

👍

Conversion / We could move them to schema to not to have to copy them in webapp/xsl/conversion folder. They would be grouped by schema which could also make the choice easier for end users

👍

The simpleurl harvester can already point to JSON or XML feed. It can also point to a RDF DCAT feed which will be loaded using Jena. SPARQL queries are applied to extract necessary information from the RDF graph. This work was initially made by GIM team for Metadata vlaanderen in a DCAT-AP dedicated harvester (see https://github.com/metadata101/dcat-ap1.1/tree/master/src/main/java/org/fao/geonet/kernel/harvest/harvester/dcatap) but we considered that the simpleurl harvester can be a good candidate for simplification and provide DCAT feed support directly. The results can be converted using an XSL conversion. A conversion to ISO19115-3 is provided and custom plugins may provide other conversions. The provided ISO19115-3 conversion support only Dataset and cover most of the mapping done in OGC API record (see https://github.com/geonetwork/geonetwork-microservices/blob/main/modules/library/common-index-model/src/main/java/org/fao/geonet/index/converter/DcatConverter.java#L188) Tested with * http://mow-dataroom.s3-eu-west-1.amazonaws.com/dr_dcat.rdf * https://apps.titellus.net/geonetwork/api/collections/main/items?q=AlpenKonvention&f=dcat * https://apps.titellus.net/geonetwork/api/collections/main/items/7bb33d95-7950-499a-9bd8-6f31d58b0b35?f=dcat Other actions: - [ ] Add possibility to hash or not URI used for UUID (depends on #5736) - [ ] UI / Based on type of harvesting hide uneeded options eg. for a DCAT feed, only the URL is really necessary - [ ] Paging support for RDF feeds ? - [ ] Conversion / We could move them to schema to not to have to copy them in webapp/xsl/conversion folder. They would be grouped by schema which could also make the choice easier for end users Co-authored-by: Mathieu Chaussier <[email protected]> Co-authored-by: Gustaaf Van de Boel <[email protected]> Co-authored-by: Stijn Goedertier <[email protected]>

…dd a quicker check for XML header. Reset properly harvester params.

…on harvester selection.

CMath04

Nice ! Tested and it works fine.
I also quickly tested the SPARQL to DCAT conversion implemented in the DCAT plugin (this one) as is and, aside from a few issues that will need to be fixed in the conversion itself, I managed to get an almost fully valid DCAT record fit for the plugin.

For the added support of turtle, JSON-LD and more, it shouldn't require too much changes:
As long as we can provide the right lang to the RDFDataMgr.read method called in the RDFUtils::getAllUuids and provide the turtle/JSON-LD/... as a string to it, the rest of the logic remain unchanged.

CMath04 · 2023-02-03T10:06:58Z

harvesters/src/main/java/org/fao/geonet/kernel/harvest/harvester/simpleurl/RDFUtils.java

+            .replace("%recordID%", recordUUID)
+            .replace("%recordUUID%", recordUUID)
+            .replace("%resourceId%", resourceId)
+            // TODO: Should we set modified of catalog record to the date of publication?


I agree we should find a fallback for the date to avoid updating the local records on every harvest.
However, whether this is the publication date or the modified date of the dataset/service, it might not be provided or up to date.
This is why we initially decided to always update local records when the date is not provided.

To be discussed.
Ideally, we want to move forward and expect harvested dcat feeds to provide the dcat:CatalogRecord directly. We might want to stop generating it when missing at some point in the future.

fxprunayre added this to the 4.2.3 milestone Jan 20, 2023

fxprunayre marked this pull request as draft January 20, 2023 07:01

github-advanced-security bot found potential problems Jan 20, 2023

View reviewed changes

fxprunayre and others added 5 commits January 23, 2023 11:56

Harvester / Simple URL / isXMLLike is super slow on large document. A…

b321999

…dd a quicker check for XML header. Reset properly harvester params.

Harvester / Simple URL / Add sample configuration. Reset to main tab …

f935d9e

…on harvester selection.

Harvester / Simple URL / Add batch edit support.

fce8414

Harvester / Simple URL / Updates based on 0763d14.

e9da062

fxprunayre force-pushed the 423-rdfharvester branch from 0b3af6e to e9da062 Compare January 23, 2023 11:09

Utils / XML / Improve detection.

8bda610

fxprunayre marked this pull request as ready for review January 23, 2023 12:25

fxprunayre requested a review from CMath04 January 23, 2023 12:25

CMath04 approved these changes Feb 3, 2023

View reviewed changes

fxprunayre merged commit 8be9d40 into main Feb 6, 2023

fxprunayre deleted the 423-rdfharvester branch February 6, 2023 08:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harvester / URL / Add RDF DCAT harvester. #6771

Harvester / URL / Add RDF DCAT harvester. #6771

fxprunayre commented Jan 20, 2023 •

edited

Loading

fgravin commented Jan 20, 2023

CMath04 left a comment

CMath04 Feb 3, 2023

Harvester / URL / Add RDF DCAT harvester. #6771

Harvester / URL / Add RDF DCAT harvester. #6771

Conversation

fxprunayre commented Jan 20, 2023 • edited Loading

Harvesting RDF feeds

Configuration improvements

Additional notes & improvements

Future work to be discussed

fgravin commented Jan 20, 2023

CMath04 left a comment

Choose a reason for hiding this comment

CMath04 Feb 3, 2023

Choose a reason for hiding this comment

fxprunayre commented Jan 20, 2023 •

edited

Loading