-
-
Notifications
You must be signed in to change notification settings - Fork 489
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Harvester / URL / Add RDF DCAT harvester. #6771
Conversation
@@ -1128,6 +1129,8 @@ | |||
Pattern pattern; | |||
Matcher matcher; | |||
|
|||
inXMLStr = inXMLStr.replaceFirst(XML_VERSION_HEADER, ""); |
Check failure
Code scanning / CodeQL
Polynomial regular expression used on uncontrolled data
public static boolean isRDFLike(String inXMLStr) { | ||
boolean retBool = false; | ||
if (isXMLLike(inXMLStr)) { | ||
String xml = inXMLStr.replaceFirst(XML_VERSION_HEADER, ""), |
Check failure
Code scanning / CodeQL
Polynomial regular expression used on uncontrolled data
Great news !
👍
👍 |
The simpleurl harvester can already point to JSON or XML feed. It can also point to a RDF DCAT feed which will be loaded using Jena. SPARQL queries are applied to extract necessary information from the RDF graph. This work was initially made by GIM team for Metadata vlaanderen in a DCAT-AP dedicated harvester (see https://github.com/metadata101/dcat-ap1.1/tree/master/src/main/java/org/fao/geonet/kernel/harvest/harvester/dcatap) but we considered that the simpleurl harvester can be a good candidate for simplification and provide DCAT feed support directly. The results can be converted using an XSL conversion. A conversion to ISO19115-3 is provided and custom plugins may provide other conversions. The provided ISO19115-3 conversion support only Dataset and cover most of the mapping done in OGC API record (see https://github.com/geonetwork/geonetwork-microservices/blob/main/modules/library/common-index-model/src/main/java/org/fao/geonet/index/converter/DcatConverter.java#L188) Tested with * http://mow-dataroom.s3-eu-west-1.amazonaws.com/dr_dcat.rdf * https://apps.titellus.net/geonetwork/api/collections/main/items?q=AlpenKonvention&f=dcat * https://apps.titellus.net/geonetwork/api/collections/main/items/7bb33d95-7950-499a-9bd8-6f31d58b0b35?f=dcat Other actions: - [ ] Add possibility to hash or not URI used for UUID (depends on #5736) - [ ] UI / Based on type of harvesting hide uneeded options eg. for a DCAT feed, only the URL is really necessary - [ ] Paging support for RDF feeds ? - [ ] Conversion / We could move them to schema to not to have to copy them in webapp/xsl/conversion folder. They would be grouped by schema which could also make the choice easier for end users Co-authored-by: Mathieu Chaussier <[email protected]> Co-authored-by: Gustaaf Van de Boel <[email protected]> Co-authored-by: Stijn Goedertier <[email protected]>
…dd a quicker check for XML header. Reset properly harvester params.
…on harvester selection.
0b3af6e
to
e9da062
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice ! Tested and it works fine.
I also quickly tested the SPARQL to DCAT conversion implemented in the DCAT plugin (this one) as is and, aside from a few issues that will need to be fixed in the conversion itself, I managed to get an almost fully valid DCAT record fit for the plugin.
For the added support of turtle, JSON-LD and more, it shouldn't require too much changes:
As long as we can provide the right lang to the RDFDataMgr.read
method called in the RDFUtils::getAllUuids
and provide the turtle/JSON-LD/... as a string to it, the rest of the logic remain unchanged.
.replace("%recordID%", recordUUID) | ||
.replace("%recordUUID%", recordUUID) | ||
.replace("%resourceId%", resourceId) | ||
// TODO: Should we set modified of catalog record to the date of publication? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we should find a fallback for the date to avoid updating the local records on every harvest.
However, whether this is the publication date or the modified date of the dataset/service, it might not be provided or up to date.
This is why we initially decided to always update local records when the date is not provided.
To be discussed.
Ideally, we want to move forward and expect harvested dcat feeds to provide the dcat:CatalogRecord
directly. We might want to stop generating it when missing at some point in the future.
Harvesting RDF feeds
The simpleurl harvester can already point to JSON or XML feed. It can also point to a RDF DCAT feed which will be loaded using Jena. SPARQL queries are applied to extract necessary information from the RDF graph.
This work was initially made by GIM team for Metadata vlaanderen in a DCAT-AP dedicated harvester (see https://github.com/metadata101/dcat-ap1.1/tree/master/src/main/java/org/fao/geonet/kernel/harvest/harvester/dcatap) but we considered that the simpleurl harvester can be a good candidate for simplification and provide DCAT feed support directly.
The results can be converted using an XSL conversion. A conversion to ISO19115-3 is provided and custom plugins may provide other conversions (see #6772). The provided ISO19115-3 conversion support only Dataset and cover most of the mapping done in OGC API record (see https://github.com/geonetwork/geonetwork-microservices/blob/main/modules/library/common-index-model/src/main/java/org/fao/geonet/index/converter/DcatConverter.java#L188)
Tested with
Configuration improvements
Additional notes & improvements
Future work to be discussed
rdf:about
as UUID for the record (but require to support URL chars in UUID) or create a hash of therdf:about
Co-authored-by: Mathieu Chaussier [email protected]
Co-authored-by: Gustaaf Van de Boel [email protected]
Co-authored-by: Stijn Goedertier [email protected]