Skip to content

Commit

Permalink
[Backport 4.2.x] WebDav harvester / Add support for XSLT filter proce…
Browse files Browse the repository at this point in the history
…ss (geonetwork#8423)

* WebDav harvester / Add support for XSLT filter process

* Documentation / update documentation for harvesters and add Geonetwork 2.1-3.X harvester page

* WebDav harvester / Use current date for metadata change date if can't be retrieved from the remote metadata

* Documentation / remove Z39.50 harvester documentation.

The harvester is no longer available in GeoNetwork 4.x

* Documentation / Unify harvesters configuration

* Update web-ui/src/main/resources/catalog/templates/admin/harvest/type/webdav.html

Co-authored-by: François Prunayre <[email protected]>

* Update harvesters/src/main/java/org/fao/geonet/kernel/harvest/harvester/webdav/Harvester.java

Co-authored-by: François Prunayre <[email protected]>

* HarvesterUtil / Use conversions from schema (#106)

Cf geonetwork#6772

---------

Co-authored-by: Jose García <[email protected]>
Co-authored-by: François Prunayre <[email protected]>
  • Loading branch information
3 people authored Oct 14, 2024
1 parent 0af6926 commit 5251764
Show file tree
Hide file tree
Showing 37 changed files with 599 additions and 537 deletions.
46 changes: 34 additions & 12 deletions docs/manual/docs/user-guide/harvesting/harvesting-csw.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,38 @@ This harvester will connect to a remote CSW server and retrieve metadata records

## Adding a CSW harvester

The figure above shows the options available:

- **Site** - Options about the remote site.
- *Name* - This is a short description of the remote site. It will be shown in the harvesting main page as the name for this instance of the CSW harvester.
- *Service URL* - The URL of the capabilities document of the CSW server to be harvested. eg. <http://geonetwork-site.com/srv/eng/csw?service=CSW&request=GetCabilities&version=2.0.2>. This document is used to discover the location of the services to call to query and retrieve metadata.
- *Icon* - An icon to assign to harvested metadata. The icon will be used when showing harvested metadata records in the search results.
- *Use account* - Account credentials for basic HTTP authentication on the CSW server.
- **Search criteria** - Using the Add button, you can add several search criteria. You can query only the fields recognised by the CSW protocol.
- **Options** - Scheduling options.
- **Options** - Specific harvesting options for this harvester.
- *Validate* - If checked, the metadata will be validated after retrieval. If the validation does not pass, the metadata will be skipped.
To create a CSW harvester go to `Admin console` > `Harvesting` and select `Harvest from` > `CSW`:

![](img/add-csw-harvester.png)

Providing the following information:

- **Identification**
- *Node name and logo*: A unique name for the harvester and, optionally, a logo to assign to the harvester.
- *Group*: Group which owns the harvested records. Only the catalog administrator or users with the profile `UserAdmin` of this group can manage the harvester.
- *User*: User who owns the harvested records.

- **Schedule**: Scheduling options to execute the harvester. If disabled, the harvester must be run manually from the harvester page. If enabled, a scheduling expression using cron syntax should be configured ([See examples](https://www.quartz-scheduler.org/documentation/quartz-2.1.7/tutorials/crontrigger)).

- **Configure connection to OGC CSW 2.0.2**
- *Service URL*: The URL of the capabilities document of the CSW server to be harvested. eg. <http://geonetwork-site.com/srv/eng/csw?service=CSW&request=GetCabilities&version=2.0.2>. This document is used to discover the location of the services to call to query and retrieve metadata.
- *Remote authentication*: If checked, should be provided the credentials for basic HTTP authentication on the CSW server.
- *Search filter*: (Optional) Define the search criteria below to restrict the records to harvest.
- *Search options*:
- *Sort by*: Define sort option to retrieve the results. Sorting by 'identifier:A' means by UUID with alphabetical order. Any CSW queryables can be used in combination with A or D for setting the ordering.
- *Output Schema*: The metadata standard to request the metadata records from the CSW server.
- *Distributed search*: Enables the distributed search in remote server (if the remote server supports it). When this option is enabled, the remote catalog cascades the search to the Federated CSW servers that has configured.

- **Configure response processing for CSW**
- *Action on UUID collision*: When a harvester finds the same uuid on a record collected by another method (another harvester, importer, dashboard editor,...), should this record be skipped (default), overriden or generate a new UUID?
- *Validate records before import*: Defines the criteria to reject metadata that is invalid according to XML structure (XSD) and validation rules (schematron).
- Accept all metadata without validation.
- Accept metadata that are XSD valid.
- Accept metadata that are XSD and schematron valid.
- *Check for duplicate resources based on the resource identifier*: If checked, ignores metadata with a resource identifier (`gmd:identificationInfo/*/gmd:citation/gmd:CI_Citation/gmd:identifier/*/gmd:code/gco:CharacterString`) that is assigned to other metadata record in the catalog. It only applies to records in ISO19139 or ISO profiles.
- *XPath filter*: (Optional) When record is retrived from remote server, check an XPath expression to accept or discard the record.
- *XSL transformation to apply*: (Optional) The referenced XSL transform will be applied to each metadata record before it is added to GeoNetwork.
- *Batch edits*: (Optional) Allows to update harvested records, using XPATH syntax. It can be used to add, replace or delete element.
- *Category*: (Optional) A GeoNetwork category to assign to each metadata record.

- **Privileges** - Assign privileges to harvested metadata.
- **Categories**
46 changes: 30 additions & 16 deletions docs/manual/docs/user-guide/harvesting/harvesting-filesystem.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,21 +4,35 @@ This harvester will harvest metadata as XML files from a filesystem available on

## Adding a Local File System harvester

The figure above shows the options available:

- **Site** - Options about the remote site.
- *Name* - This is a short description of the filesystem harvester. It will be shown in the harvesting main page as the name for this instance of the Local Filesystem harvester.
- *Directory* - The path name of the directory containing the metadata (as XML files) to be harvested.
- *Recurse* - If checked and the *Directory* path contains other directories, then the harvester will traverse the entire file system tree in that directory and add all metadata files found.
- *Keep local if deleted at source* - If checked then metadata records that have already been harvested will be kept even if they have been deleted from the *Directory* specified.
- *Icon* - An icon to assign to harvested metadata. The icon will be used when showing harvested metadata records in the search results.
- **Options** - Scheduling options.
- **Harvested Content** - Options that are applied to harvested content.
- *Apply this XSLT to harvested records* - Choose an XSLT here that will convert harvested records to a different format.
- *Validate* - If checked, the metadata will be validated after retrieval. If the validation does not pass, the metadata will be skipped.
- **Privileges** - Assign privileges to harvested metadata.
- **Categories**
To create a Local File System harvester go to `Admin console` > `Harvesting` and select `Harvest from` > `Directory`:

![](img/add-filesystem-harvester.png)

Providing the following information:

!!! Notes
- **Identification**
- *Node name and logo*: A unique name for the harvester and, optionally, a logo to assign to the harvester.
- *Group*: Group which owns the harvested records. Only the catalog administrator or users with the profile `UserAdmin` of this group can manage the harvester.
- *User*: User who owns the harvested records.

- in order to be successfully harvested, metadata records retrieved from the file system must match a metadata schema in the local GeoNetwork instance
- **Schedule**: Scheduling options to execute the harvester. If disabled, the harvester must be run manually from the harvester page. If enabled, a scheduling expression using cron syntax should be configured ([See examples](https://www.quartz-scheduler.org/documentation/quartz-2.1.7/tutorials/crontrigger)).

- **Configure connection to Directory**
- *Directory*: The path name of the directory containing the metadata (as XML files) to be harvested. The directory must be accessible by GeoNetwork.
- *Also search in subfolders*: If checked and the *Directory* path contains other directories, then the harvester will traverse the entire file system tree in that directory and add all metadata files found.
- *Script to run before harvesting*
- *Type of record*

- **Configure response processing for filesystem**
- *Action on UUID collision*: When a harvester finds the same uuid on a record collected by another method (another harvester, importer, dashboard editor,...), should this record be skipped (default), overriden or generate a new UUID?
- *Update catalog record only if file was updated*
- *Keep local even if deleted at source*: If checked then metadata records that have already been harvested will be kept even if they have been deleted from the *Directory* specified.
- *Validate records before import*: Defines the criteria to reject metadata that is invalid according to XML structure (XSD) and validation rules (schematron).
- Accept all metadata without validation.
- Accept metadata that are XSD valid.
- Accept metadata that are XSD and schematron valid.
- *XSL transformation to apply*: (Optional) The referenced XSL transform will be applied to each metadata record before it is added to GeoNetwork.
- *Batch edits*: (Optional) Allows to update harvested records, using XPATH syntax. It can be used to add, replace or delete element.
- *Category*: (Optional) A GeoNetwork category to assign to each metadata record.

- **Privileges** - Assign privileges to harvested metadata.
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# GeoNetwork 2.0 Harvester {#gn2_harvester}

## Upgrading from GeoNetwork 2.0 Guidance

GeoNetwork 2.1 introduced a new powerful harvesting engine which is not compatible with GeoNetwork version 2.0 based catalogues.

* Harvesting metadata from a v2.0 server requires this harvesting type.
* Old 2.0 servers can still harvest from 2.1 servers
* Due to the fact that GeoNetwork 2.0 is no longer suitable for production use, this harvesting type is deprecated.
46 changes: 40 additions & 6 deletions docs/manual/docs/user-guide/harvesting/harvesting-geonetwork.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,43 @@
# GeoNetwork 2.0 Harvester {#gn2_harvester}
# GeoNetwork 2.1-3.X Harvester

## Upgrading from GeoNetwork 2.0 Guidance
This harvester will connect to a remote GeoNetwork server that uses versions from 2.1-3.X and retrieve metadata records that match the query parameters.

GeoNetwork 2.1 introduced a new powerful harvesting engine which is not compatible with GeoNetwork version 2.0 based catalogues.
## Adding a GeoNetwork 2.1-3.X harvester

* Harvesting metadata from a v2.0 server requires this harvesting type.
* Old 2.0 servers can still harvest from 2.1 servers
* Due to the fact that GeoNetwork 2.0 is no longer suitable for production use, this harvesting type is deprecated.
To create a GeoNetwork 2.1-3.X harvester go to `Admin console` > `Harvesting` and select `Harvest from` > `GeoNetwork (from 2.1 to 3.x)`:

![](img/add-geonetwork-3-harvester.png)

Providing the following information:

- **Identification**
- *Node name and logo*: A unique name for the harvester and, optionally, a logo to assign to the harvester.
- *Group*: Group which owns the harvested records. Only the catalog administrator or users with the profile `UserAdmin` of this group can manage the harvester.
- *User*: User who owns the harvested records.

- **Schedule**: Scheduling options to execute the harvester. If disabled, the harvester must be run manually from the harvester page. If enabled, a scheduling expression using cron syntax should be configured ([See examples](https://www.quartz-scheduler.org/documentation/quartz-2.1.7/tutorials/crontrigger)).

- **Configure connection to GeoNetwork (from 2.1 to 3.x)**
- *Catalog URL*:
- The remote URL of the GeoNetwork server from which metadata will be harvested. The URL should contain the catalog name, for example: http://www.fao.org/geonetwork.
- Additionally, it should be configured the node name, usually the value `srv`.
- *Search filter*: (Optional) Define the filter to retrieve the remote metadata.
- *Catalog*: (Optional) Select the portal in the remote server to harvest.

- **Configure response processing for GeoNetwork**
- *Action on UUID collision*: When a harvester finds the same uuid on a record collected by another method (another harvester, importer, dashboard editor,...), should this record be skipped (default), overriden or generate a new UUID?
- *Remote authentication*: If checked, should be provided the credentials for basic HTTP authentication on the WebDAV/WAF server.
- *Use full MEF format*: If checked, uses MEF format instead of XML to retrieve the remote metadata. Recommended to metadata with files.
- *Use change date for comparison*: If checked, uses change date to detect changes on remote server.
- *Set category if it exists locally*: If checked, uses the category set on the metadata in the remote server also locally (assuming it exists locally). Applies only when using MEF format for the harvesting.
- *Category*: (Optional) A GeoNetwork category to assign to each metadata record.
- *XSL filter name to apply*: (Optional) The XSL filter is applied to each metadata record. The filter is a process which depends on the schema (see the `process` folder of the schemas).

It could be composed of parameter which will be sent to XSL transformation using the following syntax: `anonymizer?protocol=MYLOCALNETWORK:FILEPATH&[email protected]&thesaurus=MYORGONLYTHEASURUS`

- *Validate records before import*: Defines the criteria to reject metadata that is invalid according to XML structure (XSD) and validation rules (schematron).
- Accept all metadata without validation.
- Accept metadata that are XSD valid.
- Accept metadata that are XSD and schematron valid.

- **Privileges** - Assign privileges to harvested metadata.
44 changes: 29 additions & 15 deletions docs/manual/docs/user-guide/harvesting/harvesting-geoportal.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,24 +4,38 @@ This harvester will connect to a remote GeoPortal version 9.3.x or 10.x server a

## Adding a GeoPortal REST harvester

The figure above shows the options available:

- **Site** - Options about the remote site.
- *Name* - This is a short description of the remote site. It will be shown in the harvesting main page as the name for this instance of the GeoPortal REST harvester.
- *Base URL* - The base URL of the GeoPortal server to be harvested. eg. <http://yourhost.com/geoportal>. The harvester will add the additional path required to access the REST services on the GeoPortal server.
- *Icon* - An icon to assign to harvested metadata. The icon will be used when showing harvested metadata records in the search results.
- **Search criteria** - Using the Add button, you can add several search criteria. You can query any field on the GeoPortal server using the Lucene query syntax described at <http://webhelp.esri.com/geoportal_extension/9.3.1/index.htm#srch_lucene.htm>.
- **Options** - Scheduling options.
- **Harvested Content** - Options that are applied to harvested content.
- *Apply this XSLT to harvested records* - Choose an XSLT here that will convert harvested records to a different format. See notes section below for typical usage.
- *Validate* - If checked, the metadata will be validated after retrieval. If the validation does not pass, the metadata will be skipped.
To create a GeoPortal REST harvester go to `Admin console` > `Harvesting` and select `Harvest from` > `GeoPortal REST`:

![](img/add-geoportalrest-harvester.png)

Providing the following information:

- **Identification**
- *Node name and logo*: A unique name for the harvester and, optionally, a logo to assign to the harvester.
- *Group*: Group which owns the harvested records. Only the catalog administrator or users with the profile `UserAdmin` of this group can manage the harvester.
- *User*: User who owns the harvested records.

- **Schedule**: Scheduling options to execute the harvester. If disabled, the harvester must be run manually from the harvester page. If enabled, a scheduling expression using cron syntax should be configured ([See examples](https://www.quartz-scheduler.org/documentation/quartz-2.1.7/tutorials/crontrigger)).

- **Configure connection to GeoPortal REST**
- *URL*: The base URL of the GeoPortal server to be harvested. eg. <http://yourhost.com/geoportal>. The harvester will add the additional path required to access the REST services on the GeoPortal server.
- *Remote authentication*: If checked, should be provided the credentials for basic HTTP authentication on the server.
- *Search filter*: (Optional) You can query any field on the GeoPortal server using the Lucene query syntax described at <http://webhelp.esri.com/geoportal_extension/9.3.1/index.htm#srch_lucene.htm>.

- **Configure response processing for geoPREST**
- *Validate records before import*: Defines the criteria to reject metadata that is invalid according to XML structure (XSD) and validation rules (schematron).
- Accept all metadata without validation.
- Accept metadata that are XSD valid.
- Accept metadata that are XSD and schematron valid.
- *XSL transformation to apply*: (Optional) The referenced XSL transform will be applied to each metadata record before it is added to GeoNetwork.

- **Privileges** - Assign privileges to harvested metadata.
- **Categories**


!!! Notes

- this harvester uses two REST services from the GeoPortal API:
- This harvester uses two REST services from the GeoPortal API:
- `rest/find/document` with searchText parameter to return an RSS listing of metadata records that meet the search criteria (maximum 100000)
- `rest/document` with id parameter from each result returned in the RSS listing
- this harvester has been tested with GeoPortal 9.3.x and 10.x. It can be used in preference to the CSW harvester if there are issues with the handling of the OGC standards etc.
- typically ISO19115 metadata produced by the Geoportal software will not have a 'gmd' prefix for the namespace `http://www.isotc211.org/2005/gmd`. GeoNetwork XSLTs will not have any trouble understanding this metadata but will not be able to map titles and codelists in the viewer/editor. To fix this problem, please select the ``Add-gmd-prefix`` XSLT for the *Apply this XSLT to harvested records* in the **Harvested Content** set of options described earlier
- This harvester has been tested with GeoPortal 9.3.x and 10.x. It can be used in preference to the CSW harvester if there are issues with the handling of the OGC standards etc.
- Typically ISO19115 metadata produced by the Geoportal software will not have a 'gmd' prefix for the namespace `http://www.isotc211.org/2005/gmd`. GeoNetwork XSLTs will not have any trouble understanding this metadata but will not be able to map titles and codelists in the viewer/editor. To fix this problem, please select the ``Add-gmd-prefix`` XSLT for the *Apply this XSLT to harvested records* in the **Harvested Content** set of options described earlier
Loading

0 comments on commit 5251764

Please sign in to comment.