Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harvester / URL / Add RDF DCAT harvester. #6771

Merged
merged 6 commits into from
Feb 6, 2023
Merged

Conversation

fxprunayre
Copy link
Member

@fxprunayre fxprunayre commented Jan 20, 2023

Harvesting RDF feeds

The simpleurl harvester can already point to JSON or XML feed. It can also point to a RDF DCAT feed which will be loaded using Jena. SPARQL queries are applied to extract necessary information from the RDF graph.

This work was initially made by GIM team for Metadata vlaanderen in a DCAT-AP dedicated harvester (see https://github.com/metadata101/dcat-ap1.1/tree/master/src/main/java/org/fao/geonet/kernel/harvest/harvester/dcatap) but we considered that the simpleurl harvester can be a good candidate for simplification and provide DCAT feed support directly.

image

The results can be converted using an XSL conversion. A conversion to ISO19115-3 is provided and custom plugins may provide other conversions (see #6772). The provided ISO19115-3 conversion support only Dataset and cover most of the mapping done in OGC API record (see https://github.com/geonetwork/geonetwork-microservices/blob/main/modules/library/common-index-model/src/main/java/org/fao/geonet/index/converter/DcatConverter.java#L188)

Tested with

Configuration improvements

  • When creating the harvester, add an helper menu to more easily set up the configuration with some examples for loading:
    • DCAT feed converted to ISO
    • JSON
    • XML file containing one metadata record
    • CSW GetRecords response

image

Additional notes & improvements

  • When selecting an harvester, select the settings tab by default (to not stay on history or results tab)

image

  • Note that simple URL is a bit faster for harvesting CSW (but it requires to manually write the CSW GET query) eg. for 159 records

image

  • Add support for batch editing.

Future work to be discussed

  • Helper configuration can probably be improved (eg. hide useless fields depending on harvesting target, more detailed configuration for specific implementations)
  • Add possibility to hash or not URI used for UUID (depends on Support UUID with URL special characters. #5736) - Add the possibility to use the rdf:about as UUID for the record (but require to support URL chars in UUID) or create a hash of the rdf:about
  • RDF / Paging support for RDF feeds using Hydra
  • RDF / Add support for multilingual DCAT feed
  • RDF / Add support for turtle, JSON-LD

Co-authored-by: Mathieu Chaussier [email protected]
Co-authored-by: Gustaaf Van de Boel [email protected]
Co-authored-by: Stijn Goedertier [email protected]

@fxprunayre fxprunayre added this to the 4.2.3 milestone Jan 20, 2023
@fxprunayre fxprunayre marked this pull request as draft January 20, 2023 07:01
@@ -1128,6 +1129,8 @@
Pattern pattern;
Matcher matcher;

inXMLStr = inXMLStr.replaceFirst(XML_VERSION_HEADER, "");

Check failure

Code scanning / CodeQL

Polynomial regular expression used on uncontrolled data

This [regular expression](1) that depends on a [user-provided value](2) may run slow on strings starting with '<?xml version='1a0' encoding='' and with many repetitions of '<?xml version='1a0' encoding=''. This [regular expression](1) that depends on a [user-provided value](3) may run slow on strings starting with '<?xml version='1a0' encoding='' and with many repetitions of '<?xml version='1a0' encoding=''. This [regular expression](1) that depends on a [user-provided value](4) may run slow on strings starting with '<?xml version='1a0' encoding='' and with many repetitions of '<?xml version='1a0' encoding=''.
public static boolean isRDFLike(String inXMLStr) {
boolean retBool = false;
if (isXMLLike(inXMLStr)) {
String xml = inXMLStr.replaceFirst(XML_VERSION_HEADER, ""),

Check failure

Code scanning / CodeQL

Polynomial regular expression used on uncontrolled data

This [regular expression](1) that depends on a [user-provided value](2) may run slow on strings starting with '<?xml version='1a0' encoding='' and with many repetitions of '<?xml version='1a0' encoding=''.
@fgravin
Copy link
Member

fgravin commented Jan 20, 2023

Great news !

  • UI / Based on type of harvesting hide uneeded options eg. for a DCAT feed, only the URL and conversion is really necessary

👍

  • Conversion / We could move them to schema to not to have to copy them in webapp/xsl/conversion folder. They would be grouped by schema which could also make the choice easier for end users

👍

fxprunayre and others added 5 commits January 23, 2023 11:56
The simpleurl harvester can already point to JSON or XML feed. It can also
point to a RDF DCAT feed which will be loaded using Jena. SPARQL queries are applied
to extract necessary information from the RDF graph.

This work was initially made by GIM team for Metadata vlaanderen in a DCAT-AP dedicated harvester (see https://github.com/metadata101/dcat-ap1.1/tree/master/src/main/java/org/fao/geonet/kernel/harvest/harvester/dcatap) but we considered that
the simpleurl harvester can be a good candidate for simplification and provide DCAT feed support directly.

The results can be converted using an XSL conversion. A conversion to ISO19115-3
is provided and custom plugins may provide other conversions. The provided ISO19115-3 conversion
support only Dataset and cover most of the mapping done in OGC API record (see https://github.com/geonetwork/geonetwork-microservices/blob/main/modules/library/common-index-model/src/main/java/org/fao/geonet/index/converter/DcatConverter.java#L188)

Tested with
* http://mow-dataroom.s3-eu-west-1.amazonaws.com/dr_dcat.rdf
* https://apps.titellus.net/geonetwork/api/collections/main/items?q=AlpenKonvention&f=dcat
* https://apps.titellus.net/geonetwork/api/collections/main/items/7bb33d95-7950-499a-9bd8-6f31d58b0b35?f=dcat

Other actions:
- [ ] Add possibility to hash or not URI used for UUID (depends on #5736)
- [ ] UI / Based on type of harvesting hide uneeded options eg. for a DCAT feed, only the URL is really necessary
- [ ] Paging support for RDF feeds ?
- [ ] Conversion / We could move them to schema to not to have to copy them in webapp/xsl/conversion folder. They would be grouped by schema which could also make the choice easier for end users

Co-authored-by: Mathieu Chaussier <[email protected]>
Co-authored-by: Gustaaf Van de Boel <[email protected]>
Co-authored-by: Stijn Goedertier <[email protected]>
…dd a quicker check for XML header. Reset properly harvester params.
@fxprunayre fxprunayre marked this pull request as ready for review January 23, 2023 12:25
@fxprunayre fxprunayre requested a review from CMath04 January 23, 2023 12:25
Copy link
Collaborator

@CMath04 CMath04 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice ! Tested and it works fine.
I also quickly tested the SPARQL to DCAT conversion implemented in the DCAT plugin (this one) as is and, aside from a few issues that will need to be fixed in the conversion itself, I managed to get an almost fully valid DCAT record fit for the plugin.

For the added support of turtle, JSON-LD and more, it shouldn't require too much changes:
As long as we can provide the right lang to the RDFDataMgr.read method called in the RDFUtils::getAllUuids and provide the turtle/JSON-LD/... as a string to it, the rest of the logic remain unchanged.

.replace("%recordID%", recordUUID)
.replace("%recordUUID%", recordUUID)
.replace("%resourceId%", resourceId)
// TODO: Should we set modified of catalog record to the date of publication?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we should find a fallback for the date to avoid updating the local records on every harvest.
However, whether this is the publication date or the modified date of the dataset/service, it might not be provided or up to date.
This is why we initially decided to always update local records when the date is not provided.

To be discussed.
Ideally, we want to move forward and expect harvested dcat feeds to provide the dcat:CatalogRecord directly. We might want to stop generating it when missing at some point in the future.

@fxprunayre fxprunayre merged commit 8be9d40 into main Feb 6, 2023
@fxprunayre fxprunayre deleted the 423-rdfharvester branch February 6, 2023 08:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants