Skip to content

Commit

Permalink
Harvester / URL / Add RDF DCAT harvester.
Browse files Browse the repository at this point in the history
The simpleurl harvester can already point to JSON or XML feed. It can also
point to a RDF DCAT feed which will be loaded using Jena. SPARQL queries are applied
to extract necessary information from the RDF graph.

This work was initially made by GIM team for Metadata vlaanderen in a DCAT-AP dedicated harvester (see https://github.com/metadata101/dcat-ap1.1/tree/master/src/main/java/org/fao/geonet/kernel/harvest/harvester/dcatap) but we considered that
the simpleurl harvester can be a good candidate for simplification and provide DCAT feed support directly.

The results can be converted using an XSL conversion. A conversion to ISO19115-3
is provided and custom plugins may provide other conversions. The provided ISO19115-3 conversion
support only Dataset and cover most of the mapping done in OGC API record (see https://github.com/geonetwork/geonetwork-microservices/blob/main/modules/library/common-index-model/src/main/java/org/fao/geonet/index/converter/DcatConverter.java#L188)

Tested with
* http://mow-dataroom.s3-eu-west-1.amazonaws.com/dr_dcat.rdf
* https://apps.titellus.net/geonetwork/api/collections/main/items?q=AlpenKonvention&f=dcat
* https://apps.titellus.net/geonetwork/api/collections/main/items/7bb33d95-7950-499a-9bd8-6f31d58b0b35?f=dcat

Other actions:
- [ ] Add possibility to hash or not URI used for UUID (depends on #5736)
- [ ] UI / Based on type of harvesting hide uneeded options eg. for a DCAT feed, only the URL is really necessary
- [ ] Paging support for RDF feeds ?
- [ ] Conversion / We could move them to schema to not to have to copy them in webapp/xsl/conversion folder. They would be grouped by schema which could also make the choice easier for end users

Co-authored-by: Mathieu Chaussier <[email protected]>
Co-authored-by: Gustaaf Van de Boel <[email protected]>
Co-authored-by: Stijn Goedertier <[email protected]>
  • Loading branch information
4 people committed Jan 23, 2023
1 parent 0763d14 commit 18a0fdf
Show file tree
Hide file tree
Showing 28 changed files with 2,795 additions and 135 deletions.
20 changes: 20 additions & 0 deletions common/src/main/java/org/fao/geonet/utils/Xml.java
Original file line number Diff line number Diff line change
Expand Up @@ -135,6 +135,7 @@ public final class Xml {
+ "\uE000-\uFFFD"
+ "\ud800\udc00-\udbff\udfff"
+ "]";
public static final String XML_VERSION_HEADER = "<\\?xml version='1.0' encoding='.*'\\?>\\s*";

public static SAXBuilder getSAXBuilder(boolean validate) {
SAXBuilder builder = getSAXBuilderWithPathXMLResolver(validate, null);
Expand Down Expand Up @@ -1128,6 +1129,8 @@ public static boolean isXMLLike(String inXMLStr) {
Pattern pattern;
Matcher matcher;

inXMLStr = inXMLStr.replaceFirst(XML_VERSION_HEADER, "");

// Regular expression to see if it starts and ends with the same element or
// it's a self-closing element.
final String XML_PATTERN_STR = "<(\\S+?)(.*?)>(.*?)</\\1>|<(\\S+?)(.*?)/>";
Expand All @@ -1146,6 +1149,23 @@ public static boolean isXMLLike(String inXMLStr) {
return retBool;
}

/**
* Check if is XML and the first tag local name
* is rdf or something like a DCAT feed.
*/
public static boolean isRDFLike(String inXMLStr) {
boolean retBool = false;
if (isXMLLike(inXMLStr)) {
String xml = inXMLStr.replaceFirst(XML_VERSION_HEADER, ""),
firstTag = xml
.substring(0, xml.indexOf(" "))
.toLowerCase();
retBool = firstTag.matches("<.*:(rdf|catalog|catalogrecord)");
}
return retBool;
}


private static class JeevesURIResolver implements URIResolver {

/**
Expand Down
13 changes: 12 additions & 1 deletion common/src/test/java/org/fao/geonet/utils/XmlTest.java
Original file line number Diff line number Diff line change
Expand Up @@ -192,5 +192,16 @@ public void testGetXPathExprAttribute() throws Exception {
assertSame(attribute, actual.get(0));
}


@Test
public void testIsXmlLike() {
assertEquals(true,
Xml.isXMLLike("<selfclosingtag attribute=\"\"/>"));
assertEquals(true,
Xml.isXMLLike("<tag attribute=\"\"></tag>"));
assertEquals(true,
Xml.isXMLLike("<?xml version='1.0' encoding='utf-8'?>\n<tag attribute=\"\"></tag>"));
assertEquals(true,
Xml.isRDFLike("<?xml version='1.0' encoding='utf-8'?>\n<rdf:RDF \n" +
" xmlns:rdf=\"http://www.w3.org/1999/02/22-rdf-syntax-ns#\"/>"));
}
}
16 changes: 16 additions & 0 deletions core/src/main/java/org/fao/geonet/util/XslUtil.java
Original file line number Diff line number Diff line change
Expand Up @@ -761,6 +761,22 @@ public static String wktGeomToBbox(Object WKT) throws Exception {
return ret;
}

public static String geoJsonGeomToBbox(Object WKT) throws Exception {
String ret = "";
try {
Geometry geometry = new GeometryJSON().read(WKT);
if (geometry != null) {
final Envelope envelope = geometry.getEnvelopeInternal();
return
String.format("%f|%f|%f|%f",
envelope.getMinX(), envelope.getMinY(),
envelope.getMaxX(), envelope.getMaxY());
}
} catch (Throwable e) {
}
return ret;
}

/**
* Get field value for metadata identified by uuid.
*
Expand Down
5 changes: 5 additions & 0 deletions harvesters/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,11 @@
<groupId>com.github.lookfirst</groupId>
<artifactId>sardine</artifactId>
</dependency>
<dependency>
<groupId>org.apache.jena</groupId>
<artifactId>apache-jena-libs</artifactId>
<type>pom</type>
</dependency>
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
//=== Rome - Italy. email: [email protected]
//==============================================================================

package org.fao.geonet.kernel.harvest.harvester.simpleUrl;
package org.fao.geonet.kernel.harvest.harvester.simpleurl;

import jeeves.server.context.ServiceContext;
import org.fao.geonet.GeonetContext;
Expand Down
Loading

0 comments on commit 18a0fdf

Please sign in to comment.