diff --git a/docs/manual/docs/user-guide/harvesting/harvesting-csw.md b/docs/manual/docs/user-guide/harvesting/harvesting-csw.md index 614687eb471..dc94a777d4a 100644 --- a/docs/manual/docs/user-guide/harvesting/harvesting-csw.md +++ b/docs/manual/docs/user-guide/harvesting/harvesting-csw.md @@ -4,16 +4,38 @@ This harvester will connect to a remote CSW server and retrieve metadata records ## Adding a CSW harvester -The figure above shows the options available: - -- **Site** - Options about the remote site. - - *Name* - This is a short description of the remote site. It will be shown in the harvesting main page as the name for this instance of the CSW harvester. - - *Service URL* - The URL of the capabilities document of the CSW server to be harvested. eg. . This document is used to discover the location of the services to call to query and retrieve metadata. - - *Icon* - An icon to assign to harvested metadata. The icon will be used when showing harvested metadata records in the search results. - - *Use account* - Account credentials for basic HTTP authentication on the CSW server. -- **Search criteria** - Using the Add button, you can add several search criteria. You can query only the fields recognised by the CSW protocol. -- **Options** - Scheduling options. -- **Options** - Specific harvesting options for this harvester. - - *Validate* - If checked, the metadata will be validated after retrieval. If the validation does not pass, the metadata will be skipped. +To create a CSW harvester go to `Admin console` > `Harvesting` and select `Harvest from` > `CSW`: + +![](img/add-csw-harvester.png) + +Providing the following information: + +- **Identification** + - *Node name and logo*: A unique name for the harvester and, optionally, a logo to assign to the harvester. + - *Group*: Group which owns the harvested records. Only the catalog administrator or users with the profile `UserAdmin` of this group can manage the harvester. + - *User*: User who owns the harvested records. + +- **Schedule**: Scheduling options to execute the harvester. If disabled, the harvester must be run manually from the harvester page. If enabled, a scheduling expression using cron syntax should be configured ([See examples](https://www.quartz-scheduler.org/documentation/quartz-2.1.7/tutorials/crontrigger)). + +- **Configure connection to OGC CSW 2.0.2** + - *Service URL*: The URL of the capabilities document of the CSW server to be harvested. eg. . This document is used to discover the location of the services to call to query and retrieve metadata. + - *Remote authentication*: If checked, should be provided the credentials for basic HTTP authentication on the CSW server. + - *Search filter*: (Optional) Define the search criteria below to restrict the records to harvest. + - *Search options*: + - *Sort by*: Define sort option to retrieve the results. Sorting by 'identifier:A' means by UUID with alphabetical order. Any CSW queryables can be used in combination with A or D for setting the ordering. + - *Output Schema*: The metadata standard to request the metadata records from the CSW server. + - *Distributed search*: Enables the distributed search in remote server (if the remote server supports it). When this option is enabled, the remote catalog cascades the search to the Federated CSW servers that has configured. + +- **Configure response processing for CSW** + - *Action on UUID collision*: When a harvester finds the same uuid on a record collected by another method (another harvester, importer, dashboard editor,...), should this record be skipped (default), overriden or generate a new UUID? + - *Validate records before import*: Defines the criteria to reject metadata that is invalid according to XML structure (XSD) and validation rules (schematron). + - Accept all metadata without validation. + - Accept metadata that are XSD valid. + - Accept metadata that are XSD and schematron valid. + - *Check for duplicate resources based on the resource identifier*: If checked, ignores metadata with a resource identifier (`gmd:identificationInfo/*/gmd:citation/gmd:CI_Citation/gmd:identifier/*/gmd:code/gco:CharacterString`) that is assigned to other metadata record in the catalog. It only applies to records in ISO19139 or ISO profiles. + - *XPath filter*: (Optional) When record is retrived from remote server, check an XPath expression to accept or discard the record. + - *XSL transformation to apply*: (Optional) The referenced XSL transform will be applied to each metadata record before it is added to GeoNetwork. + - *Batch edits*: (Optional) Allows to update harvested records, using XPATH syntax. It can be used to add, replace or delete element. + - *Category*: (Optional) A GeoNetwork category to assign to each metadata record. + - **Privileges** - Assign privileges to harvested metadata. -- **Categories** diff --git a/docs/manual/docs/user-guide/harvesting/harvesting-filesystem.md b/docs/manual/docs/user-guide/harvesting/harvesting-filesystem.md index 5e0b6b3ab54..900deeafc4c 100644 --- a/docs/manual/docs/user-guide/harvesting/harvesting-filesystem.md +++ b/docs/manual/docs/user-guide/harvesting/harvesting-filesystem.md @@ -4,21 +4,35 @@ This harvester will harvest metadata as XML files from a filesystem available on ## Adding a Local File System harvester -The figure above shows the options available: - -- **Site** - Options about the remote site. - - *Name* - This is a short description of the filesystem harvester. It will be shown in the harvesting main page as the name for this instance of the Local Filesystem harvester. - - *Directory* - The path name of the directory containing the metadata (as XML files) to be harvested. - - *Recurse* - If checked and the *Directory* path contains other directories, then the harvester will traverse the entire file system tree in that directory and add all metadata files found. - - *Keep local if deleted at source* - If checked then metadata records that have already been harvested will be kept even if they have been deleted from the *Directory* specified. - - *Icon* - An icon to assign to harvested metadata. The icon will be used when showing harvested metadata records in the search results. -- **Options** - Scheduling options. -- **Harvested Content** - Options that are applied to harvested content. - - *Apply this XSLT to harvested records* - Choose an XSLT here that will convert harvested records to a different format. - - *Validate* - If checked, the metadata will be validated after retrieval. If the validation does not pass, the metadata will be skipped. -- **Privileges** - Assign privileges to harvested metadata. -- **Categories** +To create a Local File System harvester go to `Admin console` > `Harvesting` and select `Harvest from` > `Directory`: + +![](img/add-filesystem-harvester.png) + +Providing the following information: -!!! Notes +- **Identification** + - *Node name and logo*: A unique name for the harvester and, optionally, a logo to assign to the harvester. + - *Group*: Group which owns the harvested records. Only the catalog administrator or users with the profile `UserAdmin` of this group can manage the harvester. + - *User*: User who owns the harvested records. - - in order to be successfully harvested, metadata records retrieved from the file system must match a metadata schema in the local GeoNetwork instance +- **Schedule**: Scheduling options to execute the harvester. If disabled, the harvester must be run manually from the harvester page. If enabled, a scheduling expression using cron syntax should be configured ([See examples](https://www.quartz-scheduler.org/documentation/quartz-2.1.7/tutorials/crontrigger)). + +- **Configure connection to Directory** + - *Directory*: The path name of the directory containing the metadata (as XML files) to be harvested. The directory must be accessible by GeoNetwork. + - *Also search in subfolders*: If checked and the *Directory* path contains other directories, then the harvester will traverse the entire file system tree in that directory and add all metadata files found. + - *Script to run before harvesting* + - *Type of record* + +- **Configure response processing for filesystem** + - *Action on UUID collision*: When a harvester finds the same uuid on a record collected by another method (another harvester, importer, dashboard editor,...), should this record be skipped (default), overriden or generate a new UUID? + - *Update catalog record only if file was updated* + - *Keep local even if deleted at source*: If checked then metadata records that have already been harvested will be kept even if they have been deleted from the *Directory* specified. + - *Validate records before import*: Defines the criteria to reject metadata that is invalid according to XML structure (XSD) and validation rules (schematron). + - Accept all metadata without validation. + - Accept metadata that are XSD valid. + - Accept metadata that are XSD and schematron valid. + - *XSL transformation to apply*: (Optional) The referenced XSL transform will be applied to each metadata record before it is added to GeoNetwork. + - *Batch edits*: (Optional) Allows to update harvested records, using XPATH syntax. It can be used to add, replace or delete element. + - *Category*: (Optional) A GeoNetwork category to assign to each metadata record. + +- **Privileges** - Assign privileges to harvested metadata. diff --git a/docs/manual/docs/user-guide/harvesting/harvesting-geonetwork-2.md b/docs/manual/docs/user-guide/harvesting/harvesting-geonetwork-2.md new file mode 100644 index 00000000000..de085a9bb9b --- /dev/null +++ b/docs/manual/docs/user-guide/harvesting/harvesting-geonetwork-2.md @@ -0,0 +1,9 @@ +# GeoNetwork 2.0 Harvester {#gn2_harvester} + +## Upgrading from GeoNetwork 2.0 Guidance + +GeoNetwork 2.1 introduced a new powerful harvesting engine which is not compatible with GeoNetwork version 2.0 based catalogues. + +* Harvesting metadata from a v2.0 server requires this harvesting type. +* Old 2.0 servers can still harvest from 2.1 servers +* Due to the fact that GeoNetwork 2.0 is no longer suitable for production use, this harvesting type is deprecated. diff --git a/docs/manual/docs/user-guide/harvesting/harvesting-geonetwork.md b/docs/manual/docs/user-guide/harvesting/harvesting-geonetwork.md index de085a9bb9b..3c692b5e3ec 100644 --- a/docs/manual/docs/user-guide/harvesting/harvesting-geonetwork.md +++ b/docs/manual/docs/user-guide/harvesting/harvesting-geonetwork.md @@ -1,9 +1,43 @@ -# GeoNetwork 2.0 Harvester {#gn2_harvester} +# GeoNetwork 2.1-3.X Harvester -## Upgrading from GeoNetwork 2.0 Guidance +This harvester will connect to a remote GeoNetwork server that uses versions from 2.1-3.X and retrieve metadata records that match the query parameters. -GeoNetwork 2.1 introduced a new powerful harvesting engine which is not compatible with GeoNetwork version 2.0 based catalogues. +## Adding a GeoNetwork 2.1-3.X harvester -* Harvesting metadata from a v2.0 server requires this harvesting type. -* Old 2.0 servers can still harvest from 2.1 servers -* Due to the fact that GeoNetwork 2.0 is no longer suitable for production use, this harvesting type is deprecated. +To create a GeoNetwork 2.1-3.X harvester go to `Admin console` > `Harvesting` and select `Harvest from` > `GeoNetwork (from 2.1 to 3.x)`: + +![](img/add-geonetwork-3-harvester.png) + +Providing the following information: + +- **Identification** + - *Node name and logo*: A unique name for the harvester and, optionally, a logo to assign to the harvester. + - *Group*: Group which owns the harvested records. Only the catalog administrator or users with the profile `UserAdmin` of this group can manage the harvester. + - *User*: User who owns the harvested records. + +- **Schedule**: Scheduling options to execute the harvester. If disabled, the harvester must be run manually from the harvester page. If enabled, a scheduling expression using cron syntax should be configured ([See examples](https://www.quartz-scheduler.org/documentation/quartz-2.1.7/tutorials/crontrigger)). + +- **Configure connection to GeoNetwork (from 2.1 to 3.x)** + - *Catalog URL*: + - The remote URL of the GeoNetwork server from which metadata will be harvested. The URL should contain the catalog name, for example: http://www.fao.org/geonetwork. + - Additionally, it should be configured the node name, usually the value `srv`. + - *Search filter*: (Optional) Define the filter to retrieve the remote metadata. + - *Catalog*: (Optional) Select the portal in the remote server to harvest. + +- **Configure response processing for GeoNetwork** + - *Action on UUID collision*: When a harvester finds the same uuid on a record collected by another method (another harvester, importer, dashboard editor,...), should this record be skipped (default), overriden or generate a new UUID? + - *Remote authentication*: If checked, should be provided the credentials for basic HTTP authentication on the WebDAV/WAF server. + - *Use full MEF format*: If checked, uses MEF format instead of XML to retrieve the remote metadata. Recommended to metadata with files. + - *Use change date for comparison*: If checked, uses change date to detect changes on remote server. + - *Set category if it exists locally*: If checked, uses the category set on the metadata in the remote server also locally (assuming it exists locally). Applies only when using MEF format for the harvesting. + - *Category*: (Optional) A GeoNetwork category to assign to each metadata record. + - *XSL filter name to apply*: (Optional) The XSL filter is applied to each metadata record. The filter is a process which depends on the schema (see the `process` folder of the schemas). + + It could be composed of parameter which will be sent to XSL transformation using the following syntax: `anonymizer?protocol=MYLOCALNETWORK:FILEPATH&email=gis@organisation.org&thesaurus=MYORGONLYTHEASURUS` + + - *Validate records before import*: Defines the criteria to reject metadata that is invalid according to XML structure (XSD) and validation rules (schematron). + - Accept all metadata without validation. + - Accept metadata that are XSD valid. + - Accept metadata that are XSD and schematron valid. + +- **Privileges** - Assign privileges to harvested metadata. diff --git a/docs/manual/docs/user-guide/harvesting/harvesting-geoportal.md b/docs/manual/docs/user-guide/harvesting/harvesting-geoportal.md index e8887286ea3..ec16a07b9ae 100644 --- a/docs/manual/docs/user-guide/harvesting/harvesting-geoportal.md +++ b/docs/manual/docs/user-guide/harvesting/harvesting-geoportal.md @@ -4,24 +4,38 @@ This harvester will connect to a remote GeoPortal version 9.3.x or 10.x server a ## Adding a GeoPortal REST harvester -The figure above shows the options available: - -- **Site** - Options about the remote site. - - *Name* - This is a short description of the remote site. It will be shown in the harvesting main page as the name for this instance of the GeoPortal REST harvester. - - *Base URL* - The base URL of the GeoPortal server to be harvested. eg. . The harvester will add the additional path required to access the REST services on the GeoPortal server. - - *Icon* - An icon to assign to harvested metadata. The icon will be used when showing harvested metadata records in the search results. -- **Search criteria** - Using the Add button, you can add several search criteria. You can query any field on the GeoPortal server using the Lucene query syntax described at . -- **Options** - Scheduling options. -- **Harvested Content** - Options that are applied to harvested content. - - *Apply this XSLT to harvested records* - Choose an XSLT here that will convert harvested records to a different format. See notes section below for typical usage. - - *Validate* - If checked, the metadata will be validated after retrieval. If the validation does not pass, the metadata will be skipped. +To create a GeoPortal REST harvester go to `Admin console` > `Harvesting` and select `Harvest from` > `GeoPortal REST`: + +![](img/add-geoportalrest-harvester.png) + +Providing the following information: + +- **Identification** + - *Node name and logo*: A unique name for the harvester and, optionally, a logo to assign to the harvester. + - *Group*: Group which owns the harvested records. Only the catalog administrator or users with the profile `UserAdmin` of this group can manage the harvester. + - *User*: User who owns the harvested records. + +- **Schedule**: Scheduling options to execute the harvester. If disabled, the harvester must be run manually from the harvester page. If enabled, a scheduling expression using cron syntax should be configured ([See examples](https://www.quartz-scheduler.org/documentation/quartz-2.1.7/tutorials/crontrigger)). + +- **Configure connection to GeoPortal REST** + - *URL*: The base URL of the GeoPortal server to be harvested. eg. . The harvester will add the additional path required to access the REST services on the GeoPortal server. + - *Remote authentication*: If checked, should be provided the credentials for basic HTTP authentication on the server. + - *Search filter*: (Optional) You can query any field on the GeoPortal server using the Lucene query syntax described at . + +- **Configure response processing for geoPREST** + - *Validate records before import*: Defines the criteria to reject metadata that is invalid according to XML structure (XSD) and validation rules (schematron). + - Accept all metadata without validation. + - Accept metadata that are XSD valid. + - Accept metadata that are XSD and schematron valid. + - *XSL transformation to apply*: (Optional) The referenced XSL transform will be applied to each metadata record before it is added to GeoNetwork. + - **Privileges** - Assign privileges to harvested metadata. -- **Categories** + !!! Notes - - this harvester uses two REST services from the GeoPortal API: + - This harvester uses two REST services from the GeoPortal API: - `rest/find/document` with searchText parameter to return an RSS listing of metadata records that meet the search criteria (maximum 100000) - `rest/document` with id parameter from each result returned in the RSS listing - - this harvester has been tested with GeoPortal 9.3.x and 10.x. It can be used in preference to the CSW harvester if there are issues with the handling of the OGC standards etc. - - typically ISO19115 metadata produced by the Geoportal software will not have a 'gmd' prefix for the namespace `http://www.isotc211.org/2005/gmd`. GeoNetwork XSLTs will not have any trouble understanding this metadata but will not be able to map titles and codelists in the viewer/editor. To fix this problem, please select the ``Add-gmd-prefix`` XSLT for the *Apply this XSLT to harvested records* in the **Harvested Content** set of options described earlier + - This harvester has been tested with GeoPortal 9.3.x and 10.x. It can be used in preference to the CSW harvester if there are issues with the handling of the OGC standards etc. + - Typically ISO19115 metadata produced by the Geoportal software will not have a 'gmd' prefix for the namespace `http://www.isotc211.org/2005/gmd`. GeoNetwork XSLTs will not have any trouble understanding this metadata but will not be able to map titles and codelists in the viewer/editor. To fix this problem, please select the ``Add-gmd-prefix`` XSLT for the *Apply this XSLT to harvested records* in the **Harvested Content** set of options described earlier diff --git a/docs/manual/docs/user-guide/harvesting/harvesting-oaipmh.md b/docs/manual/docs/user-guide/harvesting/harvesting-oaipmh.md index cf046363634..6c528feb7e2 100644 --- a/docs/manual/docs/user-guide/harvesting/harvesting-oaipmh.md +++ b/docs/manual/docs/user-guide/harvesting/harvesting-oaipmh.md @@ -1,36 +1,49 @@ # OAIPMH Harvesting {#oaipmh_harvester} -This is a harvesting protocol that is widely used among libraries. GeoNetwork implements version 2.0 of the protocol. +This is a harvesting protocol that is widely used among libraries. GeoNetwork implements version 2.0 of the protocol. An OAI-PMH server implements a harvesting protocol that GeoNetwork, acting as a client, can use to harvest metadata. ## Adding an OAI-PMH harvester -An OAI-PMH server implements a harvesting protocol that GeoNetwork, acting as a client, can use to harvest metadata. +To create a OAI-PMH harvester go to `Admin console` > `Harvesting` and select `Harvest from` > `OAI/PMH`: -Configuration options: +![](img/add-oaipmh-harvester.png) -- **Site** - Options describing the remote site. - - *Name* - This is a short description of the remote site. It will be shown in the harvesting main page as the name for this instance of the OAIPMH harvester. - - *URL* - The URL of the OAI-PMH server from which metadata will be harvested. - - *Icon* - An icon to assign to harvested metadata. The icon will be used when showing search results. - - *Use account* - Account credentials for basic HTTP authentication on the OAIPMH server. -- **Search criteria** - This allows you to select metadata records for harvest based on certain criteria: - - *From* - You can provide a start date here. Any metadata whose last change date is equal to or greater than this date will be harvested. To add or edit a value for this field you need to use the icon alongside the text box. This field is optional so if you don't provide a start date the constraint is dropped. Use the icon to clear the field. - - *Until* - Functions in the same way as the *From* parameter but adds an end constraint to the last change date search. Any metadata whose last change data is less than or equal to this data will be harvested. - - *Set* - An OAI-PMH server classifies metadata into sets (like categories in GeoNetwork). You can request all metadata records that belong to a set (and any of its subsets) by specifying the name of that set here. - - *Prefix* - 'Prefix' means metadata format. The oai_dc prefix must be supported by all OAI-PMH compliant servers. - - You can use the Add button to add more than one Search Criteria set. Search Criteria sets can be removed by clicking on the small cross at the top left of the set. +Providing the following information: -!!! note +- **Identification** + - *Node name and logo*: A unique name for the harvester and, optionally, a logo to assign to the harvester. + - *Group*: Group which owns the harvested records. Only the catalog administrator or users with the profile `UserAdmin` of this group can manage the harvester. + - *User*: User who owns the harvested records. - the 'OAI provider sets' drop down next to the *Set* text box and the 'OAI provider prefixes' drop down next to the *Prefix* textbox are initially blank. After specifying the connection URL, you can press the **Retrieve Info** button, which will connect to the remote OAI-PMH server, retrieve all supported sets and prefixes and fill the drop downs with these values. Selecting a value from either of these drop downs will fill the appropriate text box with the selected value. +- **Schedule**: Scheduling options to execute the harvester. If disabled, the harvester must be run manually from the harvester page. If enabled, a scheduling expression using cron syntax should be configured ([See examples](https://www.quartz-scheduler.org/documentation/quartz-2.1.7/tutorials/crontrigger)). +- **Configure connection to OGC CSW 2.0.2** + - *URL*: The URL of the OAI-PMH server from which metadata will be harvested. + - *Remote authentication*: If checked, should be provided the credentials for basic HTTP authentication on the OAIPMH server. + - *Search filter*: (Optional) Define the search criteria below to restrict the records to harvest. + - *From*: You can provide a start date here. Any metadata whose last change date is equal to or greater than this date will be harvested. To add or edit a value for this field you need to use the icon alongside the text box. This field is optional so if you don't provide a start date the constraint is dropped. Use the icon to clear the field. + - *Until*: Functions in the same way as the *From* parameter but adds an end constraint to the last change date search. Any metadata whose last change data is less than or equal to this data will be harvested. + - *Set*: An OAI-PMH server classifies metadata into sets (like categories in GeoNetwork). You can request all metadata records that belong to a set (and any of its subsets) by specifying the name of that set here. + - *Prefix*: 'Prefix' means metadata format. The oai_dc prefix must be supported by all OAI-PMH compliant servers. + + !!! note + + The 'OAI provider sets' drop down next to the *Set* text box and the 'OAI provider prefixes' drop down next to the *Prefix* textbox are initially blank. After specifying the connection URL, you can press the **Retrieve Info** button, which will connect to the remote OAI-PMH server, retrieve all supported sets and prefixes and fill the drop downs with these values. Selecting a value from either of these drop downs will fill the appropriate text box with the selected value. +- **Configure response processing for oaipmh** + - *Action on UUID collision*: When a harvester finds the same uuid on a record collected by another method (another harvester, importer, dashboard editor,...), should this record be skipped (default), overriden or generate a new UUID? + - *Validate records before import*: Defines the criteria to reject metadata that is invalid according to XML structure (XSD) and validation rules (schematron). + - Accept all metadata without validation. + - Accept metadata that are XSD valid. + - Accept metadata that are XSD and schematron valid. + - *XSL transformation to apply*: (Optional) The referenced XSL transform will be applied to each metadata record before it is added to GeoNetwork. + + - *Category*: (Optional) A GeoNetwork category to assign to each metadata record. + +- **Privileges** - Assign privileges to harvested metadata. -- **Options** - Scheduling Options. -- **Privileges** -- **Categories** !!! Notes - - if you request the oai_dc output format, GeoNetwork will convert it to Dublin Core format. - - when you edit a previously created OAIPMH harvester instance, both the *set* and *prefix* drop down lists will be empty. You have to press the retrieve info button again to connect to the remote server and retrieve set and prefix information. - - the id of the remote server must be a UUID. If not, metadata can be harvested but during hierarchical propagation id clashes could corrupt harvested metadata. + - If you request the oai_dc output format, GeoNetwork will convert it to Dublin Core format. + - When you edit a previously created OAIPMH harvester instance, both the *set* and *prefix* drop down lists will be empty. You have to press the retrieve info button again to connect to the remote server and retrieve set and prefix information. + - The id of the remote server must be a UUID. If not, metadata can be harvested but during hierarchical propagation id clashes could corrupt harvested metadata. diff --git a/docs/manual/docs/user-guide/harvesting/harvesting-ogcwxs.md b/docs/manual/docs/user-guide/harvesting/harvesting-ogcwxs.md index 52c88c134d4..70f45cf75d6 100644 --- a/docs/manual/docs/user-guide/harvesting/harvesting-ogcwxs.md +++ b/docs/manual/docs/user-guide/harvesting/harvesting-ogcwxs.md @@ -11,27 +11,46 @@ An OGC service implements a GetCapabilities operation that GeoNetwork, acting as ## Adding an OGC Service Harvester -Configuration options: - -- **Site** - - *Name* - The name of the catalogue and will be one of the search criteria. - - *Type* - The type of OGC service indicates if the harvester has to query for a specific kind of service. Supported type are WMS (1.0.0, 1.1.1, 1.3.0), WFS (1.0.0 and 1.1.0), WCS (1.0.0), WPS (0.4.0 and 1.0.0), CSW (2.0.2) and SOS (1.0.0). - - *Service URL* - The service URL is the URL of the service to contact (without parameters like "REQUEST=GetCapabilities", "VERSION=", \...). It has to be a valid URL like . - - *Metadata language* - Required field that will define the language of the metadata. It should be the language used by the OGC web service administrator. - - *ISO topic category* - Used to populate the topic category element in the metadata. It is recommended to choose one as the topic category is mandatory for the ISO19115/19139 standard if the hierarchical level is "datasets". - - *Type of import* - By default, the harvester produces one service metadata record. Check boxes in this group determine the other metadata that will be produced. - - *Create metadata for layer elements using GetCapabilities information*: Checking this option means that the harvester will loop over datasets served by the service as described in the GetCapabilities document. - - *Create metadata for layer elements using MetadataURL attributes*: Checkthis option means that the harvester will generate metadata from an XML document referenced in the MetadataUrl attribute of the dataset in the GetCapabilities document. If the document referred to by this attribute is not valid (eg. unknown schema, bad XML format), the GetCapabilities document is used as per the previous option. - - *Create thumbnails for WMS layers*: If harvesting from an OGC WMS, then checking this options means that thumbnails will be created during harvesting. - - *Target schema* - The metadata schema of the dataset metadata records that will be created by this harvester. - - *Icon* - The default icon displayed as attribution logo for metadata created by this harvester. -- **Options** - Scheduling Options. -- **Privileges** -- **Category for service** - Metadata for the harvested service is assigned to the category selected in this option (eg. "interactive resources"). -- **Category for datasets** - Metadata for the harvested datasets is assigned to the category selected in this option (eg. "datasets"). +To create a OGC Service harvester go to `Admin console` > `Harvesting` and select `Harvest from` > `OGC Web Services`: + +![](img/add-ogcwebservices-harvester.png) + +Providing the following information: + +- **Identification** + - *Node name and logo*: A unique name for the harvester and, optionally, a logo to assign to the harvester. + - *Group*: Group which owns the harvested records. Only the catalog administrator or users with the profile `UserAdmin` of this group can manage the harvester. + - *User*: User who owns the harvested records. + +- **Schedule**: Scheduling options to execute the harvester. If disabled, the harvester must be run manually from the harvester page. If enabled, a scheduling expression using cron syntax should be configured ([See examples](https://www.quartz-scheduler.org/documentation/quartz-2.1.7/tutorials/crontrigger)). + +- **Configure connection to OGC Web Services** + - *Service URL*: The service URL is the URL of the service to contact (without parameters like "REQUEST=GetCapabilities", "VERSION=", \...). It has to be a valid URL like . + - *Service type* - The type of OGC service indicates if the harvester has to query for a specific kind of service. Supported type are WMS (1.0.0, 1.1.1, 1.3.0), WFS (1.0.0 and 1.1.0), WCS (1.0.0), WPS (0.4.0 and 1.0.0), CSW (2.0.2) and SOS (1.0.0). + - *Remote authentication*: If checked, should be provided the credentials for basic HTTP authentication on the server. + +- **Configure response processing for ogcwxs** + - *Build service metadata record from a template*: + - *Category for service metadata*: (Optional) Metadata for the harvested service is assigned to the category selected in this option (eg. "interactive resources"). + - *Create record for each layer only using GetCapabilities information*: Checking this option means that the harvester will loop over datasets served by the service as described in the GetCapabilities document. + - *Import record for each layer using MetadataURL attributes*: Checkthis option means that the harvester will generate metadata from an XML document referenced in the MetadataUrl attribute of the dataset in the GetCapabilities document. If the document referred to by this attribute is not valid (eg. unknown schema, bad XML format), the GetCapabilities document is used as per the previous option. + - *Build dataset metadata records from a template* + - *Create thumbnail*: If checked, when harvesting from an OGC Web Map Service (WMS) that supports WGS84 projection, thumbnails for the layers metadata will be created during harvesting. + - *Category for datasets*: Metadata for the harvested datasets is assigned to the category selected in this option (eg. "datasets"). + + - *ISO category*: (Optional) Used to populate the topic category element in the metadata. It is recommended to choose one as the topic category is mandatory for the ISO19115/19139 standard if the hierarchical level is "datasets". + - *Metadata language*: Required field that will define the language of the metadata. It should be the language used by the OGC web service administrator. + - *Output schema*: The metadata schema of the dataset metadata records that will be created by this harvester. The value should be an XSLT process which is used by the harvester to convert the GetCapabilities document to metadata records from that schema. If in doubt, use the default value `iso19139`. + - *Validate records before import*: Defines the criteria to reject metadata that is invalid according to XML structure (XSD) and validation rules (schematron). + - Accept all metadata without validation. + - Accept metadata that are XSD valid. + - Accept metadata that are XSD and schematron valid. + - *XSL transformation to apply*: (Optional) The referenced XSL transform will be applied to each metadata record before it is added to GeoNetwork. + + +- **Privileges** - Assign privileges to harvested metadata. + !!! Notes - - every time the harvester runs, it will remove previously harvested records and create new records. GeoNetwork will generate the uuid for all metadata (both service and datasets). The exception to this rule is dataset metadata created using the MetadataUrl tag is in the GetCapabilities document, in that case, the uuid of the remote XML document is used instead - - thumbnails can only be generated when harvesting an OGC Web Map Service (WMS). The WMS should support the WGS84 projection - - the chosen *Target schema* must have the support XSLTs which are used by the harvester to convert the GetCapabilities statement to metadata records from that schema. If in doubt, use iso19139. + - Every time the harvester runs, it will remove previously harvested records and create new records. GeoNetwork will generate the uuid for all metadata (both service and datasets). The exception to this rule is dataset metadata created using the MetadataUrl tag is in the GetCapabilities document, in that case, the uuid of the remote XML document is used instead diff --git a/docs/manual/docs/user-guide/harvesting/harvesting-sde.md b/docs/manual/docs/user-guide/harvesting/harvesting-sde.md index 7f4f99cb913..32cdd4df780 100644 --- a/docs/manual/docs/user-guide/harvesting/harvesting-sde.md +++ b/docs/manual/docs/user-guide/harvesting/harvesting-sde.md @@ -1,55 +1,60 @@ # Harvesting an ARCSDE Node {#sde_harvester} -This is a harvesting protocol for metadata stored in an ArcSDE installation. +This is a harvesting protocol for metadata stored in an ArcSDE installation. The harvester identifies the ESRI metadata format: ESRI ISO, ESRI FGDC to apply the required xslts to transform metadata to ISO19139. ## Adding an ArcSDE harvester -The harvester identifies the ESRI metadata format: ESRI ISO, ESRI FGDC to apply the required xslts to transform metadata to ISO19139. Configuration options: +To create an ArcSDE harvester go to `Admin console` > `Harvesting` and select `Harvest from` > `ArcSDE`: + +![](img/add-arcsde-harvester.png) + +Providing the following information: - **Identification** - - *Name* - This is a short description of the node. It will be shown in the harvesting main page. - - *Group* - User admin of this group and catalog administrator can manage this node. - - *Harvester user* - User that owns the harvested metadata. -- **Schedule** - Schedule configuration to execute the harvester. -- **Configuration for protocol ArcSDE** - - *Server* - ArcSde server IP address or name. - - *Port* - ArcSde service port (typically 5151) or ArcSde database port, depending on the connection type selected, see below the *Connection type* section. - - *Database name* - ArcSDE instance name (typically esri_sde). - - *ArcSde version* - ArcSde version to harvest. The data model used by ArcSde is different depending on the ArcSde version. + - *Node name and logo*: A unique name for the harvester and, optionally, a logo to assign to the harvester. + - *Group*: Group which owns the harvested records. Only the catalog administrator or users with the profile `UserAdmin` of this group can manage the harvester. + - *User*: User who owns the harvested records. + +- **Schedule**: Scheduling options to execute the harvester. If disabled, the harvester must be run manually from the harvester page. If enabled, a scheduling expression using cron syntax should be configured ([See examples](https://www.quartz-scheduler.org/documentation/quartz-2.1.7/tutorials/crontrigger)). + +- **Configure connection to Database** + - *Server*: ArcSDE server IP address or name. + - *Port*: ArcSDE service port (typically 5151) or ArcSDE database port, depending on the connection type selected, see below the *Connection type* section. + - *Database name*: ArcSDE instance name (typically esri_sde). + - *ArcSDE version: ArcSDE version to harvest. The data model used by ArcSDE is different depending on the ArcSDE version. - *Connection type* - - *ArcSde service* - Uses the ArcSde service to retrieve the metadata. + - *ArcSDE service*: Uses the ArcSDE service to retrieve the metadata. !!! note - Additional installation steps are required to use the ArcSDE harvester because it needs proprietary ESRI Java api jars to be installed. - - ArcSDE Java API libraries need to be installed by the user in GeoNetwork (folder INSTALL_DIR_GEONETWORK/WEB-INF/lib), as these are proprietary libraries not distributed with GeoNetwork. - - The following jars are required: - - - jpe_sdk.jar - - jsde_sdk.jar - - dummy-api-XXX.jar must be removed from INSTALL_DIR/web/geonetwork/WEB-INF/lib + Additional installation steps are required to use the ArcSDE harvester because it needs proprietary ESRI Java api jars to be installed. + ArcSDE Java API libraries need to be installed by the user in GeoNetwork (folder `INSTALL_DIR_GEONETWORK/WEB-INF/lib`), as these are proprietary libraries not distributed with GeoNetwork. - - *Database direct connection* - Uses a database connection (JDBC) to retrieve the metadata. With + The following jars are required: - !!! note + - jpe_sdk.jar + - jsde_sdk.jar - Database direct connection requires to copy JDBC drivers in INSTALL_DIR_GEONETWORK/WEB-INF/lib. + `dummy-api-XXX.jar` must be removed from `INSTALL_DIR/web/geonetwork/WEB-INF/lib`. + - *Database direct connection*: Uses a database connection (JDBC) to retrieve the metadata. + + !!! note + + Database direct connection requires to copy JDBC drivers in `INSTALL_DIR_GEONETWORK/WEB-INF/lib`. !!! note Postgres JDBC drivers are distributed with GeoNetwork, but not for Oracle or SqlServer. - - *Database type* - ArcSde database type: Oracle, Postgres, SqlServer. Only available if connection type is configured to *Database direct connection*. - - *Username* - Username to connect to ArcSDE server. - - *Password* - Password of the ArcSDE user. -- **Advanced options for protocol arcsde** - - *Validate records before import* - Defines the criteria to reject metadata that is invalid according to XSD and schematron rules. + - *Database type* - ArcSDE database type: Oracle, Postgres, SqlServer. Only available if connection type is configured to *Database direct connection*. + - *Remote authentication*: Credentials to connect to the ArcSDE server. + +- **Configure response processing for arcsde** + - *Validate records before import*: Defines the criteria to reject metadata that is invalid according to XML structure (XSD) and validation rules (schematron). - Accept all metadata without validation. - Accept metadata that are XSD valid. - Accept metadata that are XSD and schematron valid. + - **Privileges** - Assign privileges to harvested metadata. diff --git a/docs/manual/docs/user-guide/harvesting/harvesting-simpleurl.md b/docs/manual/docs/user-guide/harvesting/harvesting-simpleurl.md index 775b4a9d1a9..e7243dc8421 100644 --- a/docs/manual/docs/user-guide/harvesting/harvesting-simpleurl.md +++ b/docs/manual/docs/user-guide/harvesting/harvesting-simpleurl.md @@ -4,47 +4,72 @@ This harvester connects to a remote server via a simple URL to retrieve metadata ## Adding a simple URL harvester -- **Site** - Options about the remote site. +To create a Simple URL harvester go to `Admin console` > `Harvesting` and select `Harvest from` > `Simple URL`: - - *Name* - This is a short description of the remote site. It will be shown in the harvesting main page as the name for this instance of the harvester. - - *Service URL* - The URL of the server to be harvested. This can include pagination params like `?start=0&rows=20` - - *loopElement* - Propery/element containing a list of the record entries. (Indicated as an absolute path from the document root.) eg. `/datasets` - - *numberOfRecordPath* : Property indicating the total count of record entries. (Indicated as an absolute path from the document root.) eg. `/nhits` - - *recordIdPath* : Property containing the record id. eg. `datasetid` - - *pageFromParam* : Property indicating the first record item on the current "page" eg. `start` - - *pageSizeParam* : Property indicating the number of records containned in the current "page" eg. `rows` - - *toISOConversion* : Name of the conversion schema to use, which must be available as XSL on the GN instance. eg. `OPENDATASOFT-to-ISO19115-3-2018` +![](img/add-simpleurl-harvester.png) - !!! note +Providing the following information: - GN looks for schemas by name in . These schemas might internally include schemas from other locations like . To indicate the `fromJsonOpenDataSoft` schema for example, from the latter location directly in the admin UI the following syntax can be used: `schema:iso19115-3.2018:convert/fromJsonOpenDataSoft`. +- **Identification** + - *Node name and logo*: A unique name for the harvester and, optionally, a logo to assign to the harvester. + - *Group*: Group which owns the harvested records. Only the catalog administrator or users with the profile `UserAdmin` of this group can manage the harvester. + - *User*: User who owns the harvested records. +- **Schedule**: Scheduling options to execute the harvester. If disabled, the harvester must be run manually from the harvester page. If enabled, a scheduling expression using cron syntax should be configured ([See examples](https://www.quartz-scheduler.org/documentation/quartz-2.1.7/tutorials/crontrigger)). - **Sample configuration for opendatasoft** +- **Configure connection to Simple URL** + - *URL* - The URL of the server to be harvested. This can include pagination params like `?start=0&rows=20` + - *Remote authentication*: If checked, should be provided the credentials for basic HTTP authentication on the server. + - *Element to loop on*: Propery/element containing a list of the record entries. (Indicated as an absolute path from the document root.) eg. `/datasets` + - *Element for the UUID of each record* : Property containing the record id. eg. `datasetid` + - *Pagination parameters*: (optional). + - *Element for the number of records to collect*: Property indicating the total count of record entries. (Indicated as an absolute path from the document root.) eg. `/nhits` + - *From URL parameter*: Property indicating the first record item on the current "page" eg. `start` + - *Size URL parameter*: Property indicating the number of records containned in the current "page" eg. `rows` + +- **Configure response processing for Simple URL** - - *loopElement* - `/datasets` - - *numberOfRecordPath* : `/nhits` - - *recordIdPath* : `datasetid` - - *pageFromParam* : `start` - - *pageSizeParam* : `rows` - - *toISOConversion* : `OPENDATASOFT-to-ISO19115-3-2018` + - *XSL transformation to apply*: Name of the conversion schema to use, which must be available as XSL on the GeoNetwork instance. eg. `OPENDATASOFT-to-ISO19115-3-2018` - **Sample configuration for ESRI** + !!! note - - *loopElement* - `/dataset` - - *numberOfRecordPath* : `/result/count` - - *recordIdPath* : `landingPage` - - *pageFromParam* : `start` - - *pageSizeParam* : `rows` - - *toISOConversion* : `ESRIDCAT-to-ISO19115-3-2018` + GN looks for schemas by name in . These schemas might internally include schemas from other locations like . To indicate the `fromJsonOpenDataSoft` schema for example, from the latter location directly in the admin UI the following syntax can be used: `schema:iso19115-3.2018:convert/fromJsonOpenDataSoft`. - **Sample configuration for DKAN** - - - *loopElement* - `/result/0` - - *numberOfRecordPath* : `/result/count` - - *recordIdPath* : `id` - - *pageFromParam* : `start` - - *pageSizeParam* : `rows` - - *toISOConversion* : `DKAN-to-ISO19115-3-2018` + - *Batch edits*: (Optional) Allows to update harvested records, using XPATH syntax. It can be used to add, replace or delete element. + - *Category*: (Optional) A GeoNetwork category to assign to each metadata record. + - *Validate records before import*: Defines the criteria to reject metadata that is invalid according to XML structure (XSD) and validation rules (schematron). + - Accept all metadata without validation. + - Accept metadata that are XSD valid. + - Accept metadata that are XSD and schematron valid. - **Privileges** - Assign privileges to harvested metadata. + + +## Sample configurations + +### Sample configuration for opendatasoft + +- *Element to loop on* - `/datasets` +- *Element for the number of records to collect* : `/nhits` +- *Element for the UUID of each record* : `datasetid` +- *From URL parameter* : `start` +- *Size URL parameter* : `rows` +- *XSL transformation to apply* : `OPENDATASOFT-to-ISO19115-3-2018` + +### Sample configuration for ESRI + +- *Element to loop on* - `/dataset` +- *Element for the number of records to collect* : `/result/count` +- *Element for the UUID of each record* : `landingPage` +- *From URL parameter* : `start` +- *Size URL parameter* : `rows` +- *XSL transformation to apply* : `ESRIDCAT-to-ISO19115-3-2018` + +### Sample configuration for DKAN + +- *Element to loop on* - `/result/0` +- *Element for the number of records to collect* : `/result/count` +- *Element for the UUID of each record* : `id` +- *From URL parameter* : `start` +- *Size URL parameter* : `rows` +- *XSL transformation to apply* : `DKAN-to-ISO19115-3-2018` diff --git a/docs/manual/docs/user-guide/harvesting/harvesting-thredds.md b/docs/manual/docs/user-guide/harvesting/harvesting-thredds.md index 2c988d58e34..bb4716c7508 100644 --- a/docs/manual/docs/user-guide/harvesting/harvesting-thredds.md +++ b/docs/manual/docs/user-guide/harvesting/harvesting-thredds.md @@ -4,35 +4,33 @@ THREDDS catalogs describe inventories of datasets. They are organised in a hiera ## Adding a THREDDS Catalog Harvester -The available options are: - -- **Site** - - *Name* - This is a short description of the THREDDS catalog. It will be shown in the harvesting main page as the name of this THREDDS harvester instance. - - *Catalog URL* - The remote URL of the THREDDS Catalog from which metadata will be harvested. This must be the xml version of the catalog (i.e. ending with .xml). The harvester will crawl through all datasets and services defined in this catalog creating metadata for them as specified by the options described further below. - - *Metadata language* - Use this option to specify the language of the metadata to be harvested. - - *ISO topic category* - Use this option to specify the ISO topic category of service metadata. - - *Create ISO19119 metadata for all services in catalog* - Select this option to generate iso19119 metadata for services defined in the THREDDS catalog (eg. OpenDAP, OGC WCS, ftp) and for the THREDDS catalog itself. - - *Create metadata for Collection datasets* - Select this option to generate metadata for each collection dataset (THREDDS dataset containing other datasets). Creation of metadata can be customised using options that are displayed when this option is selected as described further below. - - *Create metadata for Atomic datasets* - Select this option to generate metadata for each atomic dataset (THREDDS dataset not containing other datasets -- for example cataloguing a netCDF dataset). Creation of metadata can be customised using options that are displayed when this option is selected as described further below. - - *Ignore harvesting attribute* - Select this option to harvest metadata for selected datasets regardless of the harvest attribute for the dataset in the THREDDS catalog. If this option is not selected, metadata will only be created for datasets that have a harvest attribute set to true. - - *Extract DIF metadata elements and create ISO metadata* - Select this option to generate ISO metadata for datasets in the THREDDS catalog that have DIF metadata elements. When this option is selected a list of schemas is shown that have a DIFToISO.xsl stylesheet available (see for example `GEONETWORK_DATA_DIR/config/schema_plugins/iso19139/convert/DIFToISO.xsl`). Metadata is generated by reading the DIF metadata items in the THREDDS into a DIF format metadata record and then converting that DIF record to ISO using the DIFToISO stylesheet. - - *Extract Unidata dataset discovery metadata using fragments* - Select this option when the metadata in your THREDDS or netCDF/ncml datasets follows Unidata dataset discovery conventions (see ). You will need to write your own stylesheets to extract this metadata as fragments and define a template to combine with the fragments. When this option is selected the following additional options will be shown: - - *Select schema for output metadata records* - choose the ISO metadata schema or profile for the harvested metadata records. Note: only the schemas that have THREDDS fragment stylesheets will be displayed in the list (see the next option for the location of these stylesheets). - - *Stylesheet to create metadata fragments* - Select a stylesheet to use to convert metadata for the dataset (THREDDS metadata and netCDF ncml where applicable) into metadata fragments. These stylesheets can be found in the directory convert/ThreddsToFragments in the schema directory eg. for iso19139 this would be `GEONETWORK_DATA_DIR/config/schema_plugins/iso19139/convert/ThreddsToFragments`. - - *Create subtemplates for fragments and XLink them into template* - Select this option to create a subtemplate (=metadata fragment stored in GeoNetwork catalog) for each metadata fragment generated. - - *Template to combine with fragments* - Select a template that will be filled in with the metadata fragments generated for each dataset. The generated metadata fragments are used to replace referenced elements in the templates with an xlink to a subtemplate if the *Create subtemplates* option is checked. If *Create subtemplates* is not checked, then the fragments are simply copied into the template metadata record. - - For Atomic Datasets , one additional option is provided *Harvest new or modified datasets only*. If this option is checked only datasets that have been modified or didn't exist when the harvester was last run will be harvested. - - *Create Thumbnails* - Select this option to create thumbnails for WMS layers in referenced WMS services - - *Icon* - An icon to assign to harvested metadata. The icon will be used when showing search results. -- **Options** - Scheduling Options. -- **Privileges** -- **Category for Service** - Select the category to assign to the ISO19119 service records for the THREDDS services. -- **Category for Datasets** - Select the category to assign the generated metadata records (and any subtemplates) to. - -At the bottom of the page there are the following buttons: - -- **Back** - Go back to the main harvesting page. The harvesting definition is not added. -- **Save** - Saves this harvester definition creating a new harvesting instance. After the save operation has completed, the main harvesting page will be displayed. +To create a THREDDS Catalog harvester go to `Admin console` > `Harvesting` and select `Harvest from` > `Thredds Catalog`: + +![](img/add-threddscatalog-harvester.png) + +Providing the following information: + +- **Identification** + - *Node name and logo*: A unique name for the harvester and, optionally, a logo to assign to the harvester. + - *Group*: Group which owns the harvested records. Only the catalog administrator or users with the profile `UserAdmin` of this group can manage the harvester. + - *User*: User who owns the harvested records. + +- **Schedule**: Scheduling options to execute the harvester. If disabled, the harvester must be run manually from the harvester page. If enabled, a scheduling expression using cron syntax should be configured ([See examples](https://www.quartz-scheduler.org/documentation/quartz-2.1.7/tutorials/crontrigger)). + +- **Configure connection to Thredds catalog** + - *Service URL*: The remote URL of the THREDDS Catalog from which metadata will be harvested. This must be the xml version of the catalog (i.e. ending with .xml). The harvester will crawl through all datasets and services defined in this catalog creating metadata for them as specified by the options described further below. + +- **Configure response processing for thredds** + - *Language*: Use this option to specify the language of the metadata to be harvested. + - *ISO19115 Topic category for output metadata records*: Use this option to specify the ISO topic category of service metadata. + - *Create ISO19119 metadata for all services in the thredds catalog*: Select this option to generate iso19119 metadata for services defined in the THREDDS catalog (eg. OpenDAP, OGC WCS, ftp) and for the THREDDS catalog itself. + - *Select schema for output metadata records*: The metadata standard to create the metadata. It should be a valid metadata schema installed in GeoNetwork, by default `iso19139`. + - *Dataset title*: (Optional) Title for the dataset. Default is catalog url. + - *Dataset abstract*: (Optional) Abstract for the dataset. Default is 'Thredds Dataset'. + - *Geonetwork category to assign to dataset metadata records* - Select the category to assign to the ISO19119 service records for the THREDDS services. + - *Geonetwork category to assign to dataset metadata records* - Select the category to assign the generated metadata records (and any subtemplates) to. + +- **Privileges** - Assign privileges to harvested metadata. ## More about harvesting THREDDS DIF metadata elements with the THREDDS Harvester diff --git a/docs/manual/docs/user-guide/harvesting/harvesting-webdav.md b/docs/manual/docs/user-guide/harvesting/harvesting-webdav.md index 4313483f627..cdd6b12434a 100644 --- a/docs/manual/docs/user-guide/harvesting/harvesting-webdav.md +++ b/docs/manual/docs/user-guide/harvesting/harvesting-webdav.md @@ -4,19 +4,35 @@ This harvesting type uses the WebDAV (Distributed Authoring and Versioning) prot ## Adding a WebDAV harvester -- **Site** - Options about the remote site. - - *Subtype* - Select WebDAV or WAF according to the type of server being harvested. - - *Name* - This is a short description of the remote site. It will be shown in the harvesting main page as the name for this instance of the WebDAV harvester. - - *URL* - The remote URL from which metadata will be harvested. Each file found that ends with .xml is assumed to be a metadata record. - - *Icon* - An icon to assign to harvested metadata. The icon will be used when showing search results. - - *Use account* - Account credentials for basic HTTP authentication on the WebDAV/WAF server. -- **Options** - Scheduling options. -- **Options** - Specific harvesting options for this harvester. - - *Validate* - If checked, the metadata will be validated after retrieval. If the validation does not pass, the metadata will be skipped. - - *Recurse* - When the harvesting engine will find folders, it will recursively descend into them. -- **Privileges** - Assign privileges to harvested metadata. -- **Categories** +To create a WebDAV harvester go to `Admin console` > `Harvesting` and select `Harvest from` > `WebDAV / WAF`: + +![](img/add-webdav-harvester.png) + +Providing the following information: -!!! Notes +- **Identification** + - *Node name and logo*: A unique name for the harvester and, optionally, a logo to assign to the harvester. + - *Group*: Group which owns the harvested records. Only the catalog administrator or users with the profile `UserAdmin` of this group can manage the harvester. + - *User*: User who owns the harvested records. - - The same metadata could be harvested several times by different instances of the WebDAV harvester. This is not good practise because copies of the same metadata record will have a different UUID. +- **Schedule**: Scheduling options to execute the harvester. If disabled, the harvester must be run manually from the harvester page. If enabled, a scheduling expression using cron syntax should be configured ([See examples](https://www.quartz-scheduler.org/documentation/quartz-2.1.7/tutorials/crontrigger)). + +- **Configure connection to WebDAV / WAF** + - *URL*: The remote URL from which metadata will be harvested. Each file found that has the extension `.xml` is assumed to be a metadata record. + - *Type of protocol*: Select WebDAV or WAF according to the type of server being harvested. + - *Remote authentication*: If checked, should be provided the credentials for basic HTTP authentication on the WebDAV/WAF server. + - *Also search in subfolders*: When the harvesting engine will find folders, it will recursively descend into them. + +- **Configure response processing for webdav** + - *Action on UUID collision*: When a harvester finds the same uuid on a record collected by another method (another harvester, importer, dashboard editor,...), should this record be skipped (default), overriden or generate a new UUID? + - *XSL filter name to apply*: (Optional) The XSL filter is applied to each metadata record. The filter is a process which depends on the schema (see the `process` folder of the schemas). + + It could be composed of parameter which will be sent to XSL transformation using the following syntax: `anonymizer?protocol=MYLOCALNETWORK:FILEPATH&email=gis@organisation.org&thesaurus=MYORGONLYTHEASURUS` + + - *Validate records before import*: Defines the criteria to reject metadata that is invalid according to XML structure (XSD) and validation rules (schematron). + - Accept all metadata without validation. + - Accept metadata that are XSD valid. + - Accept metadata that are XSD and schematron valid. + - *Category*: (Optional) A GeoNetwork category to assign to each metadata record. + +- **Privileges** - Assign privileges to harvested metadata. diff --git a/docs/manual/docs/user-guide/harvesting/harvesting-wfs-features.md b/docs/manual/docs/user-guide/harvesting/harvesting-wfs-features.md index 16abfa13bb7..c198e5f5966 100644 --- a/docs/manual/docs/user-guide/harvesting/harvesting-wfs-features.md +++ b/docs/manual/docs/user-guide/harvesting/harvesting-wfs-features.md @@ -2,26 +2,43 @@ Metadata can be present in the tables of a relational databases, which are commonly used by many organisations. Putting an OGC Web Feature Service (WFS) over a relational database will allow metadata to be extracted via standard query mechanisms. This harvesting type allows the user to specify a GetFeature query and map information from the features to fragments of metadata that can be linked or copied into a template to create metadata records. +An OGC web feature service (WFS) implements a GetFeature query operation that returns data in the form of features (usually rows from related tables in a relational database). GeoNetwork, acting as a client, can read the GetFeature response and apply a user-supplied XSLT stylesheet to produce metadata fragments that can be linked or copied into a user-supplied template to build metadata records. + ## Adding an OGC WFS GetFeature Harvester -An OGC web feature service (WFS) implements a GetFeature query operation that returns data in the form of features (usually rows from related tables in a relational database). GeoNetwork, acting as a client, can read the GetFeature response and apply a user-supplied XSLT stylesheet to produce metadata fragments that can be linked or copied into a user-supplied template to build metadata records. +To create a OGC WFS GetFeature harvester go to `Admin console` > `Harvesting` and select `Harvest from` > `OGC WFS GetFeature`: + +![](img/add-wfsgetfeature-harvester.png) -The available options are: +Providing the following information: -- **Site** - - *Name* - This is a short description of the harvester. It will be shown in the harvesting main page as the name for this WFS GetFeature harvester. - - *Service URL* - The bare URL of the WFS service (no OGC params required) - - *Metadata language* - The language that will be used in the metadata records created by the harvester +- **Identification** + - *Node name and logo*: A unique name for the harvester and, optionally, a logo to assign to the harvester. + - *Group*: Group which owns the harvested records. Only the catalog administrator or users with the profile `UserAdmin` of this group can manage the harvester. + - *User*: User who owns the harvested records. + +- **Schedule**: Scheduling options to execute the harvester. If disabled, the harvester must be run manually from the harvester page. If enabled, a scheduling expression using cron syntax should be configured ([See examples](https://www.quartz-scheduler.org/documentation/quartz-2.1.7/tutorials/crontrigger)). + +- **Configure connection to OGC CSW 2.0.2** + - *Service URL*: The bare URL of the WFS service (no OGC params required). + - *Remote authentication*: If checked, should be provided the credentials for basic HTTP authentication on the WFS server. - *OGC WFS GetFeature Query* - The OGC WFS GetFeature query used to extract features from the WFS. - - *Schema for output metadata records* - choose the metadata schema or profile for the harvested metadata records. Note: only the schemas that have WFS fragment stylesheets will be displayed in the list (see the next option for the location of these stylesheets). - - *Stylesheet to create fragments* - User-supplied stylesheet that transforms the GetFeature response to a metadata fragments document (see below for the format of that document). Stylesheets exist in the WFSToFragments directory which is in the convert directory of the selected output schema. eg. for the iso19139 schema, this directory is `GEONETWORK_DATA_DIR/config/schema_plugins/iso19139/convert/WFSToFragments`. - - *Save large response to disk* - Check this box if you expect the WFS GetFeature response to be large (eg. greater than 10MB). If checked, the GetFeature response will be saved to disk in a temporary file. Each feature will then be extracted from the temporary file and used to create the fragments and metadata records. If not checked, the response will be held in RAM. - - *Create subtemplates* - Check this box if you want the harvested metadata fragments to be saved as subtemplates in the metadata catalog and xlink'd into the metadata template (see next option). If not checked, the fragments will be copied into the metadata template. - - *Template to use to build metadata using fragments* - Choose the metadata template that will be combined with the harvested metadata fragments to create metadata records. This is a standard GeoNetwork metadata template record. - - *Category for records built with linked fragments* - Choose the metadata template that will be combined with the harvested metadata fragments to create metadata records. This is a standard GeoNetwork metadata template record. -- **Options** -- **Privileges** -- **Category for subtemplates** - When fragments are saved to GeoNetwork as subtemplates they will be assigned to the category selected here. + +- **Configure response processing for wfsfeatures** + - *Language*: The language that will be used in the metadata records created by the harvester. + - *Metadata standard*: The metadata standard to create the metadata. It should be a valid metadata schema installed in GeoNetwork, by default `iso19139`. + - *Save large response to disk*: Check this box if you expect the WFS GetFeature response to be large (eg. greater than 10MB). If checked, the GetFeature response will be saved to disk in a temporary file. Each feature will then be extracted from the temporary file and used to create the fragments and metadata records. If not checked, the response will be held in RAM. + - *Stylesheet to create fragments*: User-supplied stylesheet that transforms the GetFeature response to a metadata fragments document (see below for the format of that document). Stylesheets exist in the WFSToFragments directory which is in the convert directory of the selected output schema. eg. for the iso19139 schema, this directory is `GEONETWORK_DATA_DIR/config/schema_plugins/iso19139/convert/WFSToFragments`. + - *Create subtemplates*: Check this box if you want the harvested metadata fragments to be saved as subtemplates in the metadata catalog and xlink'd into the metadata template (see next option). If not checked, the fragments will be copied into the metadata template. + - *Select template to combine with fragments*: Choose the metadata template that will be combined with the harvested metadata fragments to create metadata records. This is a standard GeoNetwork metadata template record. + - *Category for directory entries*: (Optional) When fragments are saved to GeoNetwork as subtemplates they will be assigned to the category selected here. + - *Validate records before import*: Defines the criteria to reject metadata that is invalid according to XML structure (XSD) and validation rules (schematron). + - Accept all metadata without validation. + - Accept metadata that are XSD valid. + - Accept metadata that are XSD and schematron valid. + +- **Privileges** - Assign privileges to harvested metadata. + ## More about turning the GetFeature Response into metadata fragments diff --git a/docs/manual/docs/user-guide/harvesting/harvesting-z3950.md b/docs/manual/docs/user-guide/harvesting/harvesting-z3950.md deleted file mode 100644 index 47722c37464..00000000000 --- a/docs/manual/docs/user-guide/harvesting/harvesting-z3950.md +++ /dev/null @@ -1,90 +0,0 @@ -# Z3950 Harvesting {#z3950_harvester} - -Z3950 is a remote search and harvesting protocol that is commonly used to permit search and harvest of metadata. Although the protocol is often used for library catalogs, significant geospatial metadata catalogs can also be searched using Z3950 (eg. the metadata collections of the Australian Government agencies that participate in the Australian Spatial Data Directory - ASDD). This harvester allows the user to specify a Z3950 query and retrieve metadata records from one or more Z3950 servers. - -## Adding a Z3950 Harvester - -The available options are: - -- **Site** - - *Name* - A short description of this Z3950 harvester. It will be shown in the harvesting main page using this name. - - *Z3950 Server(s)* - These are the Z3950 servers that will be searched. You can select one or more of these servers. - - *Z3950 Query* - Specify the Z3950 query to use when searching the selected Z3950 servers. At present this field is known to support the Prefix Query Format (also known as Prefix Query Notation) which is described at this URL: . See below for more information and some simple examples. - - *Icon* - An icon to assign to harvested metadata. The icon will be used when showing search results. -- **Options** - Scheduling options. -- **Harvested Content** - - *Apply this XSLT to harvested records* - Choose an XSLT here that will convert harvested records to a different format. - - *Validate* - If checked, records that do not/cannot be validated will be rejected. -- **Privileges** -- **Categories** - -!!! note - - this harvester automatically creates a new Category named after each of the Z3950 servers that return records. Records that are returned by a server are assigned to the category named after that server. - - -## More about PQF Z3950 Queries - -PQF is a rather arcane query language. It is based around the idea of attributes and attribute sets. The most common attribute set used for geospatial metadata in Z3950 servers is the GEO attribute set (which is an extension of the BIB-1 and GILS attribute sets - see ). So all PQF queries to geospatial metadata Z3950 servers should start off with @attrset geo. - -The most useful attribute types in the GEO attribute set are as follows: - -| @attr number | Meaning | Description | -|---------------|------------|--------------------------------------------------| -| 1 | Use | What field to search | -| 2 | Relation | How to compare the term specified | -| 4 | Structure | What type is the term? eg. date, numeric, phrase | -| 5 | Truncation | How to truncate eg. right | - -In GeoNetwork the numeric values that can be specified for `@attr 1` map to the lucene index field names as follows: - -| @attr 1= | Lucene index field | ISO19139 element | -|----------------------|-------------------------------|-------------------------------------------------------------------------------------------------------------| -| 1016 | any | All text from all metadata elements | -| 4 | title, altTitle | gmd:identificationInfo//gmd:citation//gmd:title/gco:CharacterString | -| 62 | abstract | gmd:identificationInfo//gmd:abstract/gco:CharacterString | -| 1012 | _changeDate | Not a metadata element (maintained by GeoNetwork) | -| 30 | createDate | gmd:MD_Metadata/gmd:dateStamp/gco:Date | -| 31 | publicationDate | gmd:identificationInfo//gmd:citation//gmd:date/gmd:='publication' | -| 2072 | tempExtentBegin | gmd:identificationInfo//gmd:extent//gmd:temporalElement//gml:begin(Position) | -| 2073 | tempExtentEnd | gmd:identificationInfo//gmd:extent//gmd:temporalElement//gml:end(Position) | -| 2012 | fileId | gmd:MD_Metadata/gmd:fileIdentifier/* | -| 12 | identifier | gmd:identificationInfo//gmd:citation//gmd:identifier//gmd:code/* | -| 21,29,2002,3121,3122 | keyword | gmd:identificationInfo//gmd:keyword/* | -| 2060 | northBL,eastBL,southBL,westBL | gmd:identificationInfo//gmd:extent//gmd:EX_GeographicBoundingBox/gmd:westBoundLongitude*/gco:Decimal (etc) | - -Note that this is not a complete set of the mappings between Z3950 GEO attribute set and the GeoNetwork lucene index field names for ISO19139. Check out INSTALL_DIR/web/geonetwork/xml/search/z3950Server.xsl and INSTALL_DIR/web/geonetwork/xml/schemas/iso19139/index-fields.xsl for more details and annexe A of the GEO attribute set for Z3950 at for more details. - -Common values for the relation attribute (`@attr=2`): - -| @attr 2= | Description | -|-----------|--------------------------| -| 1 | Less than | -| 2 | Less than or equal to | -| 3 | Equals | -| 4 | Greater than or equal to | -| 5 | Greater than | -| 6 | Not equal to | -| 7 | Overlaps | -| 8 | Fully enclosed within | -| 9 | Encloses | -| 10 | Fully outside of | - -So a simple query to get all metadata records that have the word 'the' in any field would be: - -`@attrset geo @attr 1=1016 the` - -- `@attr 1=1016` means that we are doing a search on any field in the metadata record - -A more sophisticated search on a bounding box might be formulated as: - -`@attrset geo @attr 1=2060 @attr 4=201 @attr 2=7 "-36.8262 142.6465 -44.3848 151.2598` - -- `@attr 1=2060` means that we are doing a bounding box search -- `@attr 4=201` means that the query contains coordinate strings -- `@attr 2=7` means that we are searching for records whose bounding box overlaps the query box specified at the end of the query - -!!! Notes - - - Z3950 servers must be configured for GeoNetwork in `INSTALL_DIR/web/geonetwork/WEB-INF/classes/JZKitConfig.xml.tem` - - every time the harvester runs, it will remove previously harvested records and create new ones. diff --git a/docs/manual/docs/user-guide/harvesting/img/add-arcsde-harvester.png b/docs/manual/docs/user-guide/harvesting/img/add-arcsde-harvester.png new file mode 100644 index 00000000000..258c163bfda Binary files /dev/null and b/docs/manual/docs/user-guide/harvesting/img/add-arcsde-harvester.png differ diff --git a/docs/manual/docs/user-guide/harvesting/img/add-csw-harvester.png b/docs/manual/docs/user-guide/harvesting/img/add-csw-harvester.png new file mode 100644 index 00000000000..e6e484359b9 Binary files /dev/null and b/docs/manual/docs/user-guide/harvesting/img/add-csw-harvester.png differ diff --git a/docs/manual/docs/user-guide/harvesting/img/add-filesystem-harvester.png b/docs/manual/docs/user-guide/harvesting/img/add-filesystem-harvester.png new file mode 100644 index 00000000000..0e0f0d66bfd Binary files /dev/null and b/docs/manual/docs/user-guide/harvesting/img/add-filesystem-harvester.png differ diff --git a/docs/manual/docs/user-guide/harvesting/img/add-geonetwork-3-harvester.png b/docs/manual/docs/user-guide/harvesting/img/add-geonetwork-3-harvester.png new file mode 100644 index 00000000000..002459bae7d Binary files /dev/null and b/docs/manual/docs/user-guide/harvesting/img/add-geonetwork-3-harvester.png differ diff --git a/docs/manual/docs/user-guide/harvesting/img/add-geoportalrest-harvester.png b/docs/manual/docs/user-guide/harvesting/img/add-geoportalrest-harvester.png new file mode 100644 index 00000000000..31d60f997e7 Binary files /dev/null and b/docs/manual/docs/user-guide/harvesting/img/add-geoportalrest-harvester.png differ diff --git a/docs/manual/docs/user-guide/harvesting/img/add-harvester.png b/docs/manual/docs/user-guide/harvesting/img/add-harvester.png new file mode 100644 index 00000000000..5d50e1dce3e Binary files /dev/null and b/docs/manual/docs/user-guide/harvesting/img/add-harvester.png differ diff --git a/docs/manual/docs/user-guide/harvesting/img/add-oaipmh-harvester.png b/docs/manual/docs/user-guide/harvesting/img/add-oaipmh-harvester.png new file mode 100644 index 00000000000..a6ad14e6a54 Binary files /dev/null and b/docs/manual/docs/user-guide/harvesting/img/add-oaipmh-harvester.png differ diff --git a/docs/manual/docs/user-guide/harvesting/img/add-ogcwebservices-harvester.png b/docs/manual/docs/user-guide/harvesting/img/add-ogcwebservices-harvester.png new file mode 100644 index 00000000000..2734781c718 Binary files /dev/null and b/docs/manual/docs/user-guide/harvesting/img/add-ogcwebservices-harvester.png differ diff --git a/docs/manual/docs/user-guide/harvesting/img/add-simpleurl-harvester.png b/docs/manual/docs/user-guide/harvesting/img/add-simpleurl-harvester.png new file mode 100644 index 00000000000..6f7af0255a9 Binary files /dev/null and b/docs/manual/docs/user-guide/harvesting/img/add-simpleurl-harvester.png differ diff --git a/docs/manual/docs/user-guide/harvesting/img/add-threddscatalog-harvester.png b/docs/manual/docs/user-guide/harvesting/img/add-threddscatalog-harvester.png new file mode 100644 index 00000000000..a326a4b7c79 Binary files /dev/null and b/docs/manual/docs/user-guide/harvesting/img/add-threddscatalog-harvester.png differ diff --git a/docs/manual/docs/user-guide/harvesting/img/add-webdav-harvester.png b/docs/manual/docs/user-guide/harvesting/img/add-webdav-harvester.png new file mode 100644 index 00000000000..4b36e089b8d Binary files /dev/null and b/docs/manual/docs/user-guide/harvesting/img/add-webdav-harvester.png differ diff --git a/docs/manual/docs/user-guide/harvesting/img/add-wfsgetfeature-harvester.png b/docs/manual/docs/user-guide/harvesting/img/add-wfsgetfeature-harvester.png new file mode 100644 index 00000000000..bd3646bc0cf Binary files /dev/null and b/docs/manual/docs/user-guide/harvesting/img/add-wfsgetfeature-harvester.png differ diff --git a/docs/manual/docs/user-guide/harvesting/img/harvester-history.png b/docs/manual/docs/user-guide/harvesting/img/harvester-history.png new file mode 100644 index 00000000000..f9064c1a8f3 Binary files /dev/null and b/docs/manual/docs/user-guide/harvesting/img/harvester-history.png differ diff --git a/docs/manual/docs/user-guide/harvesting/img/harvester-statistics.png b/docs/manual/docs/user-guide/harvesting/img/harvester-statistics.png new file mode 100644 index 00000000000..b311bb2ec8e Binary files /dev/null and b/docs/manual/docs/user-guide/harvesting/img/harvester-statistics.png differ diff --git a/docs/manual/docs/user-guide/harvesting/img/harvesters.png b/docs/manual/docs/user-guide/harvesting/img/harvesters.png new file mode 100644 index 00000000000..bd008fdef7c Binary files /dev/null and b/docs/manual/docs/user-guide/harvesting/img/harvesters.png differ diff --git a/docs/manual/docs/user-guide/harvesting/index.md b/docs/manual/docs/user-guide/harvesting/index.md index 46f52f782c5..abea85ff38c 100644 --- a/docs/manual/docs/user-guide/harvesting/index.md +++ b/docs/manual/docs/user-guide/harvesting/index.md @@ -6,7 +6,8 @@ Harvesting is the process of ingesting metadata from remote sources and storing The following sources can be harvested: -- [GeoNetwork 2.0 Harvester](harvesting-geonetwork.md) +- [GeoNetwork 2.1-3.X Harvester](harvesting-geonetwork.md) +- [GeoNetwork 2.0 Harvester](harvesting-geonetwork-2.md) - [Harvesting CSW services](harvesting-csw.md) - [Harvesting OGC Services](harvesting-ogcwxs.md) - [Simple URL harvesting (opendata)](harvesting-simpleurl.md) @@ -17,7 +18,6 @@ The following sources can be harvested: - [GeoPortal REST Harvesting](harvesting-geoportal.md) - [THREDDS Harvesting](harvesting-thredds.md) - [WFS GetFeature Harvesting](harvesting-wfs-features.md) -- [Z3950 Harvesting](harvesting-z3950.md) ## Mechanism overview @@ -134,79 +134,45 @@ The script will add the certificate to the JVM keystore, if you run it as follow $ ./ssl_key_import.sh https_server_name 443 -## The main page +## Harvesting page -To access the harvesting main page you have to be logged in as an administrator. From the administration page, select the harvest shortcut. The harvesting main page will then be displayed. +To access the harvesting main page you have to be logged in with a profile `Administrator` or `UserAdmin`. From the `Admin console` menu, select the option `Harvesting`. -The page shows a list of the currently defined harvesters and a set of buttons for management functions. The meaning of each column in the list of harvesters is as follows: +The page shows a list of the currently defined harvesters with information about the status of the harvesters: -1. *Select* Check box to select one or more harvesters. The selected harvesters will be affected by the first row of buttons (activate, deactivate, run, remove). For example, if you select three harvesters and press the Remove button, they will all be removed. -2. *Name* This is the harvester name provided by the administrator. -3. *Type* The harvester type (eg. GeoNetwork, WebDAV etc\...). -4. *Status* An icon showing current status. See [Harvesting Status and Error Icons](index.md#admin_harvesting_status) for the different icons and status descriptions. -5. *Errors* An icon showing the result of the last harvesting run, which could have succeeded or not. See [Harvesting Status and Error Icons](index.md#admin_harvesting_status) for the different icons and error descriptions. Hovering the cursor over the icon will show detailed information about the last harvesting run. -6. *Run at* and *Every*: Scheduling of harvester runs. Essentially the time of the day + how many hours between repeats and on which days the harvester will run. -7. *Last run* The date, in ISO 8601 format, of the most recent harvesting run. -8. *Operation* A list of buttons/links to operations on a harvester. - - Selecting *Edit* will allow you to change the parameters for a harvester. - - Selecting *Clone* will allow you to create a clone of this harvester and start editing the details of the clone. - - Selecting *History* will allow you to view/change the harvesting history for a harvester - see [Harvest History](index.md#harvest_history). +![](img/harvesters.png) -At the bottom of the list of harvesters are two rows of buttons. The first row contains buttons that can operate on a selected set of harvesters. You can select the harvesters you want to operate on using the check box in the Select column and then press one of these buttons. When the button finishes its action, the check boxes are cleared. Here is the meaning of each button: +The following information is shown for each harvester: -1. *Activate* When a new harvester is created, the status is *inactive*. Use this button to make it *active* and start the harvester(s) according to the schedule it has/they have been configured to use. -2. *Deactivate* Stops the harvester(s). Note: this does not mean that currently running harvest(s) will be stopped. Instead, it means that the harvester(s) will not be scheduled to run again. -3. *Run* Start the selected harvesters immediately. This is useful for testing harvester setups. -4. *Remove* Remove all currently selected harvesters. A dialogue will ask the user to confirm the action. +- **Last run**: Date on which the harvester was last run. +- **Total**: It is the total number of metadata found remotely. Metadata with the same id are considered as one. +- **Updated**: Number of metadata that are present locally but needed to be updated because their last modification date was different from the remote one. +- **Unchanged**: Number of local metadata that have not been modified. Its remote last modification date has not changed. -The second row contains general purpose buttons. Here is the meaning of each button: +At the bottom of the harvester list there are the following buttons: -1. *Back* Simply returns to the main administration page. -2. *Add* This button creates a new harvester. -3. *Refresh* Refreshes the current list of harvesters from the server. This can be useful to see if the harvesting list has been altered by someone else or to get the status of any running harvesters. -4. *History* Show the harvesting history of all harvesters. See [Harvest History](index.md#harvest_history) for more details. +1. *Harvest from*: Allows you to select the type of harvester to create. +2. *Clone*: Creates a new harvester, using the information of an existing harvester. +3. *Refresh*: Refreshes the list of harvesters. -## Harvesting Status and Error Icons {#admin_harvesting_status} +### Adding new harvesters -## Harvesting result tips +To add a new harvester, click on the `Harvest from` button. A drop-down list with all available harvesting protocols will appear. -When a harvester runs and completes, a tool tip showing detailed information about the harvesting process is shown in the **Errors** column for the harvester. If the harvester succeeded then hovering the cursor over the tool tip will show a table, with some rows labelled as follows: +![](img/add-harvester.png) -- **Total** - This is the total number of metadata found remotely. Metadata with the same id are considered as one. -- **Added** - Number of metadata added to the system because they were not present locally. -- **Removed** - Number of metadata that have been removed locally because they are not present in the remote server anymore. -- **Updated** - Number of metadata that are present locally but that needed to be updated because their last change date was different from the remote one. -- **Unchanged** - Local metadata left unchanged. Their remote last change date did not change. -- **Unknown schema** - Number of skipped metadata because their format was not recognised by GeoNetwork. -- **Unretrievable** - Number of metadata that were ready to be retrieved from the remote server but for some reason there was an exception during the data transfer process. -- **Bad Format** - Number of skipped metadata because they did not have a valid XML representation. -- **Does not validate** - Number of metadata which did not validate against their schema. These metadata were harvested with success but skipped due to the validation process. Usually, there is an option to force validation: if you want to harvest these metadata anyway, simply turn/leave it off. -- **Thumbnails/Thumbnails failed** - Number of metadata thumbnail images added/that could not be added due to some failure. -- **Metadata URL attribute used** - Number of layers/featuretypes/coverages that had a metadata URL that could be used to link to a metadata record (OGC Service Harvester only). -- **Services added** - Number of ISO19119 service records created and added to the catalogue (for THREDDS catalog harvesting only). -- **Collections added** - Number of collection dataset records added to the catalogue (for THREDDS catalog harvesting only). -- **Atomics added** - Number of atomic dataset records added to the catalogue (for THREDDS catalog harvesting only). -- **Subtemplates added** - Number of subtemplates (= fragment visible in the catalog) added to the metadata catalog. -- **Subtemplates removed** - Number of subtemplates (= fragment visible in the catalog) removed from the metadata catalog. -- **Fragments w/Unknown schema** - Number of fragments which have an unknown metadata schema. -- **Fragments returned** - Number of fragments returned by the harvester. -- **Fragments matched** - Number of fragments that had identifiers that in the template used by the harvester. -- **Existing datasets** - Number of metadata records for datasets that existed when the THREDDS harvester was run. -- **Records built** - Number of records built by the harvester from the template and fragments. -- **Could not insert** - Number of records that the harvester could not insert into the catalog (usually because the record was already present eg. in the Z3950 harvester this can occur if the same record is harvested from different servers). +You can choose the type of harvesting you want to do. Supported harvesters and details on what to do next can be found in the following sections. -## Adding new harvesters +### Harvester History {#harvest_history} -The Add button in the main page allows you to add new harvesters. A drop down list is then shown with all the available harvester protocols. +Each time a harvester is run, a log file is generated of what was harvested and/or what went wrong (e.g., an exception report). To view the harvester history, select a harvester in the harvester list and select the `Harvester history` tab on the harvester page: -You can choose the type of harvest you intend to perform and press *Add* to begin the process of adding the harvester. The supported harvesters and details of what to do next are in the following sections: +![](img/harvester-history.png) -## Harvest History {#harvest_history} +Once the harvester history is displayed, it is possible to download the log file of the harvester run and delete the harvester history. -Each time a harvester is run, it generates a status report of what was harvested and/or what went wrong (eg. exception report). These reports are stored in a table in the database used by GeoNetwork. The entire harvesting history for all harvesters can be recalled using the History button on the Harvesting Management page. The harvest history for an individual harvester can also be recalled using the History link in the Operations for that harvester. +### Harvester records -Once the harvest history has been displayed it is possible to: +When a harvester is executed, you can see the list of harvested metadata and some statistics about the metadata. Select a harvester in the list of harvesters and select the `Metadata records` tab on the harvester page: -- expand the detail of any exceptions -- sort the history by harvest date (or in the case of the history of all harvesters, by harvester name) -- delete any history entry or the entire history +![](img/harvester-statistics.png) diff --git a/docs/manual/mkdocs.yml b/docs/manual/mkdocs.yml index 60c763a4cc0..8f4683ec76f 100644 --- a/docs/manual/mkdocs.yml +++ b/docs/manual/mkdocs.yml @@ -283,6 +283,7 @@ nav: - user-guide/harvesting/harvesting-csw.md - user-guide/harvesting/harvesting-filesystem.md - user-guide/harvesting/harvesting-geonetwork.md + - user-guide/harvesting/harvesting-geonetwork-2.md - user-guide/harvesting/harvesting-geoportal.md - user-guide/harvesting/harvesting-oaipmh.md - user-guide/harvesting/harvesting-ogcwxs.md @@ -291,7 +292,6 @@ nav: - user-guide/harvesting/harvesting-thredds.md - user-guide/harvesting/harvesting-webdav.md - user-guide/harvesting/harvesting-wfs-features.md - - user-guide/harvesting/harvesting-z3950.md - user-guide/export/index.md - 'Administration': - administrator-guide/index.md diff --git a/harvesters/src/main/java/org/fao/geonet/kernel/harvest/harvester/HarvesterUtil.java b/harvesters/src/main/java/org/fao/geonet/kernel/harvest/harvester/HarvesterUtil.java index cf30c71312c..ce411b33256 100644 --- a/harvesters/src/main/java/org/fao/geonet/kernel/harvest/harvester/HarvesterUtil.java +++ b/harvesters/src/main/java/org/fao/geonet/kernel/harvest/harvester/HarvesterUtil.java @@ -23,18 +23,19 @@ package org.fao.geonet.kernel.harvest.harvester; +import java.nio.file.Files; +import java.nio.file.Path; +import java.util.HashMap; +import java.util.Map; +import org.fao.geonet.ApplicationContextHolder; import org.fao.geonet.constants.Geonet; import org.fao.geonet.domain.Pair; +import org.fao.geonet.kernel.GeonetworkDataDirectory; import org.fao.geonet.kernel.schema.MetadataSchema; import org.fao.geonet.utils.Xml; import org.jdom.Element; import org.slf4j.LoggerFactory; -import java.nio.file.Files; -import java.nio.file.Path; -import java.util.HashMap; -import java.util.Map; - /** * Created by francois on 3/7/14. */ @@ -74,8 +75,7 @@ public static Element processMetadata(MetadataSchema metadataSchema, Element md, String processName, Map processParams) { - - Path filePath = metadataSchema.getSchemaDir().resolve("process").resolve(processName + ".xsl"); + Path filePath = ApplicationContextHolder.get().getBean(GeonetworkDataDirectory.class).getXsltConversion(processName); if (!Files.exists(filePath)) { LOGGER.info(" processing instruction not found for {} schema. metadata not filtered.", metadataSchema.getName()); } else { diff --git a/harvesters/src/main/java/org/fao/geonet/kernel/harvest/harvester/webdav/Harvester.java b/harvesters/src/main/java/org/fao/geonet/kernel/harvest/harvester/webdav/Harvester.java index 81dad939cad..cf8717e5213 100644 --- a/harvesters/src/main/java/org/fao/geonet/kernel/harvest/harvester/webdav/Harvester.java +++ b/harvesters/src/main/java/org/fao/geonet/kernel/harvest/harvester/webdav/Harvester.java @@ -1,5 +1,5 @@ //============================================================================= -//=== Copyright (C) 2001-2007 Food and Agriculture Organization of the +//=== Copyright (C) 2001-2024 Food and Agriculture Organization of the //=== United Nations (FAO-UN), United Nations World Food Programme (WFP) //=== and United Nations Environment Programme (UNEP) //=== @@ -22,32 +22,21 @@ //============================================================================== package org.fao.geonet.kernel.harvest.harvester.webdav; -import java.util.LinkedList; -import java.util.List; -import java.util.UUID; +import java.util.*; import java.util.concurrent.atomic.AtomicBoolean; import org.apache.commons.lang.StringUtils; import org.fao.geonet.GeonetContext; import org.fao.geonet.Logger; import org.fao.geonet.constants.Geonet; -import org.fao.geonet.domain.AbstractMetadata; -import org.fao.geonet.domain.ISODate; -import org.fao.geonet.domain.Metadata; -import org.fao.geonet.domain.MetadataType; +import org.fao.geonet.domain.*; import org.fao.geonet.exceptions.NoSchemaMatchesException; import org.fao.geonet.kernel.DataManager; import org.fao.geonet.kernel.SchemaManager; import org.fao.geonet.kernel.UpdateDatestamp; import org.fao.geonet.kernel.datamanager.IMetadataManager; import org.fao.geonet.kernel.harvest.BaseAligner; -import org.fao.geonet.kernel.harvest.harvester.CategoryMapper; -import org.fao.geonet.kernel.harvest.harvester.GroupMapper; -import org.fao.geonet.kernel.harvest.harvester.HarvestError; -import org.fao.geonet.kernel.harvest.harvester.HarvestResult; -import org.fao.geonet.kernel.harvest.harvester.IHarvester; -import org.fao.geonet.kernel.harvest.harvester.RecordInfo; -import org.fao.geonet.kernel.harvest.harvester.UriMapper; +import org.fao.geonet.kernel.harvest.harvester.*; import org.fao.geonet.kernel.search.IndexingMode; import org.fao.geonet.repository.MetadataRepository; import org.fao.geonet.repository.OperationAllowedRepository; @@ -94,7 +83,9 @@ class Harvester extends BaseAligner implements IHarvester errors = new LinkedList(); + private List errors = new LinkedList<>(); + private String processName; + private Map processParams = new HashMap<>(); public Harvester(AtomicBoolean cancelMonitor, Logger log, ServiceContext context, WebDavParams params) { super(cancelMonitor); @@ -154,6 +145,10 @@ private void align(final List files) throws Exception { localGroups = new GroupMapper(context); localUris = new UriMapper(context, params.getUuid()); + Pair> filter = HarvesterUtil.parseXSLFilter(params.xslfilter); + processName = filter.one(); + processParams = filter.two(); + //----------------------------------------------------------------------- //--- remove old metadata for (final String uri : localUris.getUris()) { @@ -259,6 +254,7 @@ private void addMetadata(RemoteFile rf) throws Exception { case SKIP: log.info("Skipping record with uuid " + uuid); result.uuidSkipped++; + return; default: return; } @@ -292,6 +288,13 @@ private void addMetadata(RemoteFile rf) throws Exception { md = translateMetadataContent(context, md, schema); } + if (StringUtils.isNotEmpty(params.xslfilter)) { + md = HarvesterUtil.processMetadata(dataMan.getSchema(schema), + md, processName, processParams); + + schema = dataMan.autodetectSchema(md); + } + // // insert metadata // @@ -310,6 +313,11 @@ private void addMetadata(RemoteFile rf) throws Exception { date = rf.getChangeDate(); } } + + if (date == null) { + date = new ISODate(); + } + AbstractMetadata metadata = new Metadata(); metadata.setUuid(uuid); metadata.getDataInfo(). @@ -385,11 +393,11 @@ private Element retrieveMetadata(RemoteFile rf) { * harvester are applied. Also, it changes the ownership of the record so it is assigned to the * new harvester that last updated it. * @param rf - * @param record + * @param recordInfo * @param force * @throws Exception */ - private void updateMetadata(RemoteFile rf, RecordInfo record, Boolean force) throws Exception { + private void updateMetadata(RemoteFile rf, RecordInfo recordInfo, boolean force) throws Exception { Element md = null; // Get the change date from the metadata content. If not possible, get it from the file change date if available @@ -411,8 +419,8 @@ private void updateMetadata(RemoteFile rf, RecordInfo record, Boolean force) thr //Update only if different String uuid = dataMan.extractUUID(schema, md); - if (!record.uuid.equals(uuid)) { - md = dataMan.setUUID(schema, record.uuid, md); + if (!recordInfo.uuid.equals(uuid)) { + md = dataMan.setUUID(schema, recordInfo.uuid, md); } } catch (Exception e) { log.error(" - Failed to set uuid for metadata with remote path : " + rf.getPath()); @@ -424,7 +432,7 @@ private void updateMetadata(RemoteFile rf, RecordInfo record, Boolean force) thr date = dataMan.extractDateModified(schema, md); } catch (Exception ex) { log.error("WebDavHarvester - updateMetadata - Can't get metadata modified date for metadata id= " - + record.id + ", using current date for modified date"); + + recordInfo.id + ", using current date for modified date"); // WAF harvester, rf.getChangeDate() returns null if (rf.getChangeDate() != null) { date = rf.getChangeDate().getDateAndTime(); @@ -434,7 +442,7 @@ private void updateMetadata(RemoteFile rf, RecordInfo record, Boolean force) thr } - if (!force && !rf.isMoreRecentThan(record.changeDate)) { + if (!force && !rf.isMoreRecentThan(recordInfo.changeDate)) { if (log.isDebugEnabled()) log.debug(" - Metadata XML not changed for path : " + rf.getPath()); result.unchangedMetadata++; @@ -454,8 +462,8 @@ private void updateMetadata(RemoteFile rf, RecordInfo record, Boolean force) thr //Update only if different String uuid = dataMan.extractUUID(schema, md); - if (!record.uuid.equals(uuid)) { - md = dataMan.setUUID(schema, record.uuid, md); + if (!recordInfo.uuid.equals(uuid)) { + md = dataMan.setUUID(schema, recordInfo.uuid, md); } } catch (Exception e) { log.error(" - Failed to set uuid for metadata with remote path : " + rf.getPath()); @@ -467,7 +475,7 @@ private void updateMetadata(RemoteFile rf, RecordInfo record, Boolean force) thr date = dataMan.extractDateModified(schema, md); } catch (Exception ex) { log.error("WebDavHarvester - updateMetadata - Can't get metadata modified date for metadata id= " - + record.id + ", using current date for modified date"); + + recordInfo.id + ", using current date for modified date"); // WAF harvester, rf.getChangeDate() returns null if (rf.getChangeDate() != null) { date = rf.getChangeDate().getDateAndTime(); @@ -475,12 +483,16 @@ private void updateMetadata(RemoteFile rf, RecordInfo record, Boolean force) thr } } - // Translate metadata if (params.isTranslateContent()) { md = translateMetadataContent(context, md, schema); } + if (StringUtils.isNotEmpty(params.xslfilter)) { + md = HarvesterUtil.processMetadata(dataMan.getSchema(schema), + md, processName, processParams); + } + // // update metadata // @@ -488,7 +500,7 @@ private void updateMetadata(RemoteFile rf, RecordInfo record, Boolean force) thr boolean ufo = false; String language = context.getLanguage(); - final AbstractMetadata metadata = metadataManager.updateMetadata(context, record.id, md, validate, ufo, language, + final AbstractMetadata metadata = metadataManager.updateMetadata(context, recordInfo.id, md, validate, ufo, language, date, false, IndexingMode.none); if(force) { @@ -502,15 +514,15 @@ private void updateMetadata(RemoteFile rf, RecordInfo record, Boolean force) thr //--- the administrator could change privileges and categories using the //--- web interface so we have to re-set both OperationAllowedRepository repository = context.getBean(OperationAllowedRepository.class); - repository.deleteAllByMetadataId(Integer.parseInt(record.id)); - addPrivileges(record.id, params.getPrivileges(), localGroups, context); + repository.deleteAllByMetadataId(Integer.parseInt(recordInfo.id)); + addPrivileges(recordInfo.id, params.getPrivileges(), localGroups, context); metadata.getCategories().clear(); addCategories(metadata, params.getCategories(), localCateg, context, null, true); dataMan.flush(); - dataMan.indexMetadata(record.id, true); + dataMan.indexMetadata(recordInfo.id, true); } } diff --git a/harvesters/src/main/java/org/fao/geonet/kernel/harvest/harvester/webdav/WebDavHarvester.java b/harvesters/src/main/java/org/fao/geonet/kernel/harvest/harvester/webdav/WebDavHarvester.java index e6cc3af1a9d..e745a5b3311 100644 --- a/harvesters/src/main/java/org/fao/geonet/kernel/harvest/harvester/webdav/WebDavHarvester.java +++ b/harvesters/src/main/java/org/fao/geonet/kernel/harvest/harvester/webdav/WebDavHarvester.java @@ -1,5 +1,5 @@ //============================================================================= -//=== Copyright (C) 2001-2007 Food and Agriculture Organization of the +//=== Copyright (C) 2001-2024 Food and Agriculture Organization of the //=== United Nations (FAO-UN), United Nations World Food Programme (WFP) //=== and United Nations Environment Programme (UNEP) //=== @@ -28,40 +28,23 @@ import java.sql.SQLException; -//============================================================================= - public class WebDavHarvester extends AbstractHarvester { - //--------------------------------------------------------------------------- - //--- - //--- Add - //--- - //--------------------------------------------------------------------------- - @Override protected WebDavParams createParams() { return new WebDavParams(dataMan); } //--------------------------------------------------------------------------- + @Override protected void storeNodeExtra(WebDavParams params, String path, String siteId, String optionsId) throws SQLException { harvesterSettingsManager.add("id:" + siteId, "url", params.url); harvesterSettingsManager.add("id:" + siteId, "icon", params.icon); harvesterSettingsManager.add("id:" + optionsId, "validate", params.getValidate()); harvesterSettingsManager.add("id:" + optionsId, "recurse", params.recurse); harvesterSettingsManager.add("id:" + optionsId, "subtype", params.subtype); + harvesterSettingsManager.add("id:" + siteId, "xslfilter", params.xslfilter); } - //--------------------------------------------------------------------------- - //--- - //--- Variables - //--- - //--------------------------------------------------------------------------- - - //--------------------------------------------------------------------------- - //--- - //--- Harvest - //--- - //--------------------------------------------------------------------------- public void doHarvest(Logger log) throws Exception { log.info("WebDav doHarvest start"); Harvester h = new Harvester(cancelMonitor, log, context, params); diff --git a/harvesters/src/main/java/org/fao/geonet/kernel/harvest/harvester/webdav/WebDavParams.java b/harvesters/src/main/java/org/fao/geonet/kernel/harvest/harvester/webdav/WebDavParams.java index d264bb908fb..c32bfd40cda 100644 --- a/harvesters/src/main/java/org/fao/geonet/kernel/harvest/harvester/webdav/WebDavParams.java +++ b/harvesters/src/main/java/org/fao/geonet/kernel/harvest/harvester/webdav/WebDavParams.java @@ -1,5 +1,5 @@ //============================================================================= -//=== Copyright (C) 2001-2007 Food and Agriculture Organization of the +//=== Copyright (C) 2001-2024 Food and Agriculture Organization of the //=== United Nations (FAO-UN), United Nations World Food Programme (WFP) //=== and United Nations Environment Programme (UNEP) //=== @@ -29,61 +29,44 @@ import org.fao.geonet.kernel.harvest.harvester.AbstractParams; import org.jdom.Element; -//============================================================================= - public class WebDavParams extends AbstractParams { - //-------------------------------------------------------------------------- - //--- - //--- Constructor - //--- - //-------------------------------------------------------------------------- + /** * url of webdav folder to harvest */ public String url; - //--------------------------------------------------------------------------- - //--- - //--- Create : called when a new entry must be added. Reads values from the - //--- provided entry, providing default values - //--- - //--------------------------------------------------------------------------- /** * Icon to use for harvester */ public String icon; - //--------------------------------------------------------------------------- - //--- - //--- Update : called when an entry has changed and variables must be updated - //--- - //--------------------------------------------------------------------------- /** * If true recurse into directories. */ public boolean recurse; - //--------------------------------------------------------------------------- - //--- - //--- Other API methods - //--- - //--------------------------------------------------------------------------- /** * Flag indicating if WAFRetriever or WebDavRetriever should be used. */ public String subtype; - //--------------------------------------------------------------------------- - //--- - //--- Variables - //--- - //--------------------------------------------------------------------------- + /** + * The filter is a process (see schema/process folder) which depends on the schema. It could be + * composed of parameter which will be sent to XSL transformation using the following syntax : + *
+     * anonymizer?protocol=MYLOCALNETWORK:FILEPATH&email=gis@organisation.org&thesaurus=MYORGONLYTHEASURUS
+     * 
+ */ + public String xslfilter; + public WebDavParams(DataManager dm) { super(dm); } + @Override public void create(Element node) throws BadInputEx { super.create(node); @@ -92,12 +75,14 @@ public void create(Element node) throws BadInputEx { url = Util.getParam(site, "url", ""); icon = Util.getParam(site, "icon", ""); + xslfilter = Util.getParam(site, "xslfilter", ""); recurse = Util.getParam(opt, "recurse", false); subtype = Util.getParam(opt, "subtype", ""); } + @Override public void update(Element node) throws BadInputEx { super.update(node); @@ -106,6 +91,7 @@ public void update(Element node) throws BadInputEx { url = Util.getParam(site, "url", url); icon = Util.getParam(site, "icon", icon); + xslfilter = Util.getParam(site, "xslfilter", ""); recurse = Util.getParam(opt, "recurse", recurse); subtype = Util.getParam(opt, "subtype", subtype); @@ -117,6 +103,7 @@ public WebDavParams copy() { copy.url = url; copy.icon = icon; + copy.xslfilter = xslfilter; copy.setValidate(getValidate()); copy.recurse = recurse; @@ -131,7 +118,3 @@ public String getIcon() { return icon; } } - -//============================================================================= - - diff --git a/web-ui/src/main/resources/catalog/templates/admin/harvest/type/webdav.html b/web-ui/src/main/resources/catalog/templates/admin/harvest/type/webdav.html index 8057edba033..77ea9d96c9c 100644 --- a/web-ui/src/main/resources/catalog/templates/admin/harvest/type/webdav.html +++ b/web-ui/src/main/resources/catalog/templates/admin/harvest/type/webdav.html @@ -98,6 +98,21 @@
filteringAndProcessing +
+ +
+

geonetwork-xslfilterHelp

+
+