Documentation / update documentation for harvesters and add Geonetwor…

…k 2.1-3.X harvester page
GeoCat · Jul 4, 2024 · 68f3696 · 68f3696
1 parent e8d25d4
commit 68f3696
Show file tree

Hide file tree

Showing 11 changed files with 84 additions and 70 deletions.
diff --git a/docs/manual/docs/user-guide/harvesting/harvesting-geonetwork-2.md b/docs/manual/docs/user-guide/harvesting/harvesting-geonetwork-2.md
@@ -0,0 +1,9 @@
+# GeoNetwork 2.0 Harvester {#gn2_harvester}
+
+## Upgrading from GeoNetwork 2.0 Guidance
+
+GeoNetwork 2.1 introduced a new powerful harvesting engine which is not compatible with GeoNetwork version 2.0 based catalogues.
+
+* Harvesting metadata from a v2.0 server requires this harvesting type.
+* Old 2.0 servers can still harvest from 2.1 servers
+* Due to the fact that GeoNetwork 2.0 is no longer suitable for production use, this harvesting type is deprecated.
diff --git a/docs/manual/docs/user-guide/harvesting/harvesting-geonetwork.md b/docs/manual/docs/user-guide/harvesting/harvesting-geonetwork.md
@@ -1,9 +1,40 @@
-# GeoNetwork 2.0 Harvester {#gn2_harvester}
+# GeoNetwork 2.1-3.X Harvester
 
-## Upgrading from GeoNetwork 2.0 Guidance
+This harvester will connect to a remote GeoNetwork server that uses versions from 2.1-3.X and retrieve metadata records that match the query parameters.
 
-GeoNetwork 2.1 introduced a new powerful harvesting engine which is not compatible with GeoNetwork version 2.0 based catalogues.
+## Adding a GeoNetwork 2.1-3.X harvester
 
-* Harvesting metadata from a v2.0 server requires this harvesting type.
-* Old 2.0 servers can still harvest from 2.1 servers
-* Due to the fact that GeoNetwork 2.0 is no longer suitable for production use, this harvesting type is deprecated.
+To create a GeoNetwork 2.1-3.X harvester go to `Admin console` > `Harvesting` and select `Harvest from` > `GeoNetwork (from 2.1 to 3.x)`:
+
+![](img/add-geonetwork-3-harvester.png)
+
+Providing the following information:
+
+-   **Identification**
+    -   *Node name and logo*: A unique name for the harvester and optionally a logo to assign to the harvester.
+    -   *Group*: Group which owns the harvested records. Only the catalog administrator or users with the profile `UserAdmin` of this group can manage the harvester.
+    -   *User*: User who owns the harvested records.
+
+-   **Schedule**: Scheduling options to execute the harvester. If disabled, the harvester should be executed manually from the harvesters page. If enabled a schedule expression using cron syntax should be configured ([See examples](https://www.quartz-scheduler.org/documentation/quartz-2.1.7/tutorials/crontrigger)).
+
+-   **Configure connection to GeoNetwork (from 2.1 to 3.x)**
+    -   *Catalog URL*: 
+        - The remote URL of the GeoNetwork server from which metadata will be harvested. The URL should contain the catalog name, for example: http://www.fao.org/geonetwork.
+        - Additionally, it should be configured the node name, usually the value `srv`.
+    -   *Search filter*: (Optional)  Define the filter to retrieve the remote metadata.
+    -   *Catalog*: (Optional) Select the portal in the remote server to harvest.
+    -   *Remote authentication*: If checked, should be provided the credentials for basic HTTP authentication on the WebDAV/WAF server.
+    -   *Use full MEF format*: If checked, uses MEF format instead of XML to retrieve the remote metadata. Recommended to metadata with files.
+
+-   **Configure response processing for GeoNetwork**
+    -   *Action on UUID collision*: When a harvester finds the same uuid on a record collected by another method (another harvester, importer, dashboard editor,...), should this record be skipped (default), overriden or generate a new UUID?
+    -   *Use change date for comparison*: If checked, uses change date to detect changes on remote server.
+    -   *Set category if it exists locally*: If checked, uses the category set on the metadata in the remote server also locally (assuming it exists locally). Applies only when using MEF format for the harvesting.
+    -   *Category*: (Optional) A GeoNetwork category to assign to each metadata record.
+    -   *XSL filter name to apply*: (Optional) The XSL filter is applied to each metadata record.  The filter is a process which depends on the schema (see the `process` folder of the schemas).
+
+        It could be composed of parameter which will be sent to XSL transformation using the following syntax: `anonymizer?protocol=MYLOCALNETWORK:FILEPATH&[email protected]&thesaurus=MYORGONLYTHEASURUS`
+
+    -   *Validate records before import*: If checked, the metadata will be validated after retrieval. If the validation does not pass, the metadata will be skipped.
+
+-   **Privileges** - Assign privileges to harvested metadata.
diff --git a/docs/manual/docs/user-guide/harvesting/harvesting-webdav.md b/docs/manual/docs/user-guide/harvesting/harvesting-webdav.md
@@ -4,6 +4,12 @@ This harvesting type uses the WebDAV (Distributed Authoring and Versioning) prot
 
 ## Adding a WebDAV harvester
 
+To create a WebDAV harvester go to `Admin console` > `Harvesting` and select `Harvest from` > `WebDAV / WAF`:
+
+![](img/add-webdav-harvester.png)
+
+Providing the following information:
+
 -   **Identification**
     -   *Node name and logo*: A unique name for the harvester and optionally a logo to assign to the harvester. 
     -   *Group*: Group which owns the harvested records. Only the catalog administrator or users with the profile `UserAdmin` of this group can manage the harvester.
@@ -19,11 +25,11 @@ This harvesting type uses the WebDAV (Distributed Authoring and Versioning) prot
 
 -   **Configure response processing for webdav**
     -   *Action on UUID collision*: When a harvester finds the same uuid on a record collected by another method (another harvester, importer, dashboard editor,...), should this record be skipped (default), overriden or generate a new UUID?
-    -   *XSL filter name to apply*: (Optional) The XSL filter is applied to each metadata record.
+    -   *XSL filter name to apply*: (Optional) The XSL filter is applied to each metadata record.  The filter is a process which depends on the schema (see the `process` folder of the schemas). 
+
+        It could be composed of parameter which will be sent to XSL transformation using the following syntax: `anonymizer?protocol=MYLOCALNETWORK:FILEPATH&[email protected]&thesaurus=MYORGONLYTHEASURUS`
+
     -   *Validate records before import*: If checked, the metadata will be validated after retrieval. If the validation does not pass, the metadata will be skipped.
     -   *Category*: (Optional) A GeoNetwork category to assign to each metadata record.
+
 -   **Privileges** - Assign privileges to harvested metadata.
-
-!!! Notes
-
-    -   The same metadata could be harvested several times by different instances of the WebDAV harvester. This is not good practise because copies of the same metadata record will have a different UUID.
diff --git a/docs/manual/docs/user-guide/harvesting/img/add-geonetwork-3-harvester.png b/docs/manual/docs/user-guide/harvesting/img/add-geonetwork-3-harvester.png
diff --git a/docs/manual/docs/user-guide/harvesting/img/add-harvester.png b/docs/manual/docs/user-guide/harvesting/img/add-harvester.png
diff --git a/docs/manual/docs/user-guide/harvesting/img/add-webdav-harvester.png b/docs/manual/docs/user-guide/harvesting/img/add-webdav-harvester.png
diff --git a/docs/manual/docs/user-guide/harvesting/img/harvester-history.png b/docs/manual/docs/user-guide/harvesting/img/harvester-history.png
diff --git a/docs/manual/docs/user-guide/harvesting/img/harvester-statistics.png b/docs/manual/docs/user-guide/harvesting/img/harvester-statistics.png
diff --git a/docs/manual/docs/user-guide/harvesting/img/harvesters.png b/docs/manual/docs/user-guide/harvesting/img/harvesters.png
diff --git a/docs/manual/docs/user-guide/harvesting/index.md b/docs/manual/docs/user-guide/harvesting/index.md
@@ -6,7 +6,8 @@ Harvesting is the process of ingesting metadata from remote sources and storing
 
 The following sources can be harvested:
 
--   [GeoNetwork 2.0 Harvester](harvesting-geonetwork.md)
+-   [GeoNetwork 2.1-3.X Harvester](harvesting-geonetwork.md)
+-   [GeoNetwork 2.0 Harvester](harvesting-geonetwork-2.md)
 -   [Harvesting CSW services](harvesting-csw.md)
 -   [Harvesting OGC Services](harvesting-ogcwxs.md)
 -   [Simple URL harvesting (opendata)](harvesting-simpleurl.md)
@@ -134,79 +135,45 @@ The script will add the certificate to the JVM keystore, if you run it as follow
 
     $ ./ssl_key_import.sh https_server_name 443
 
-## The main page
+## Harvesting page
 
-To access the harvesting main page you have to be logged in as an administrator. From the administration page, select the harvest shortcut. The harvesting main page will then be displayed.
+To access the harvesting main page you have to be logged in with a profile `Administrator` or `UserAdmin`. From the `Admin console` menu, select the `Harvesting`. The harvesting page will then be displayed.
 
-The page shows a list of the currently defined harvesters and a set of buttons for management functions. The meaning of each column in the list of harvesters is as follows:
+The page shows a list of the currently defined harvesters with information about the harvesters statuses:
 
-1.  *Select* Check box to select one or more harvesters. The selected harvesters will be affected by the first row of buttons (activate, deactivate, run, remove). For example, if you select three harvesters and press the Remove button, they will all be removed.
-2.  *Name* This is the harvester name provided by the administrator.
-3.  *Type* The harvester type (eg. GeoNetwork, WebDAV etc\...).
-4.  *Status* An icon showing current status. See [Harvesting Status and Error Icons](index.md#admin_harvesting_status) for the different icons and status descriptions.
-5.  *Errors* An icon showing the result of the last harvesting run, which could have succeeded or not. See [Harvesting Status and Error Icons](index.md#admin_harvesting_status) for the different icons and error descriptions. Hovering the cursor over the icon will show detailed information about the last harvesting run.
-6.  *Run at* and *Every*: Scheduling of harvester runs. Essentially the time of the day + how many hours between repeats and on which days the harvester will run.
-7.  *Last run* The date, in ISO 8601 format, of the most recent harvesting run.
-8.  *Operation* A list of buttons/links to operations on a harvester.
-    -   Selecting *Edit* will allow you to change the parameters for a harvester.
-    -   Selecting *Clone* will allow you to create a clone of this harvester and start editing the details of the clone.
-    -   Selecting *History* will allow you to view/change the harvesting history for a harvester - see [Harvest History](index.md#harvest_history).
+![](img/harvesters.png)
 
-At the bottom of the list of harvesters are two rows of buttons. The first row contains buttons that can operate on a selected set of harvesters. You can select the harvesters you want to operate on using the check box in the Select column and then press one of these buttons. When the button finishes its action, the check boxes are cleared. Here is the meaning of each button:
+For each harvester is displayed the following information:
 
-1.  *Activate* When a new harvester is created, the status is *inactive*. Use this button to make it *active* and start the harvester(s) according to the schedule it has/they have been configured to use.
-2.  *Deactivate* Stops the harvester(s). Note: this does not mean that currently running harvest(s) will be stopped. Instead, it means that the harvester(s) will not be scheduled to run again.
-3.  *Run* Start the selected harvesters immediately. This is useful for testing harvester setups.
-4.  *Remove* Remove all currently selected harvesters. A dialogue will ask the user to confirm the action.
+-   **Last run**: Date when the harvester was executed last time.
+-   **Total**: This is the total number of metadata found remotely. Metadata with the same id are considered as one.
+-   **Updated**: Number of metadata that are present locally but that needed to be updated because their last change date was different from the remote one.
+-   **Unchanged**: Local metadata left unchanged. Their remote last change date did not change.
 
-The second row contains general purpose buttons. Here is the meaning of each button:
+At the bottom of the list of harvesters there are the following buttons:
 
-1.  *Back* Simply returns to the main administration page.
-2.  *Add* This button creates a new harvester.
-3.  *Refresh* Refreshes the current list of harvesters from the server. This can be useful to see if the harvesting list has been altered by someone else or to get the status of any running harvesters.
-4.  *History* Show the harvesting history of all harvesters. See [Harvest History](index.md#harvest_history) for more details.
+1. *Harvest from*: Allows to select the type of harvester to create.
+2. *Clone*: Creates a new harvester, using the information of an existing harvester.
+3. *Refresh*: Refreshes the list of harvesters.
 
-## Harvesting Status and Error Icons {#admin_harvesting_status}
+### Adding new harvesters
 
-## Harvesting result tips
+To add a new harvester click in the `Harvest from` button. A dropdown list is then shown with all the available harvester protocols.
 
-When a harvester runs and completes, a tool tip showing detailed information about the harvesting process is shown in the **Errors** column for the harvester. If the harvester succeeded then hovering the cursor over the tool tip will show a table, with some rows labelled as follows:
+![](img/add-harvester.png)
 
--   **Total** - This is the total number of metadata found remotely. Metadata with the same id are considered as one.
--   **Added** - Number of metadata added to the system because they were not present locally.
--   **Removed** - Number of metadata that have been removed locally because they are not present in the remote server anymore.
--   **Updated** - Number of metadata that are present locally but that needed to be updated because their last change date was different from the remote one.
--   **Unchanged** - Local metadata left unchanged. Their remote last change date did not change.
--   **Unknown schema** - Number of skipped metadata because their format was not recognised by GeoNetwork.
--   **Unretrievable** - Number of metadata that were ready to be retrieved from the remote server but for some reason there was an exception during the data transfer process.
--   **Bad Format** - Number of skipped metadata because they did not have a valid XML representation.
--   **Does not validate** - Number of metadata which did not validate against their schema. These metadata were harvested with success but skipped due to the validation process. Usually, there is an option to force validation: if you want to harvest these metadata anyway, simply turn/leave it off.
--   **Thumbnails/Thumbnails failed** - Number of metadata thumbnail images added/that could not be added due to some failure.
--   **Metadata URL attribute used** - Number of layers/featuretypes/coverages that had a metadata URL that could be used to link to a metadata record (OGC Service Harvester only).
--   **Services added** - Number of ISO19119 service records created and added to the catalogue (for THREDDS catalog harvesting only).
--   **Collections added** - Number of collection dataset records added to the catalogue (for THREDDS catalog harvesting only).
--   **Atomics added** - Number of atomic dataset records added to the catalogue (for THREDDS catalog harvesting only).
--   **Subtemplates added** - Number of subtemplates (= fragment visible in the catalog) added to the metadata catalog.
--   **Subtemplates removed** - Number of subtemplates (= fragment visible in the catalog) removed from the metadata catalog.
--   **Fragments w/Unknown schema** - Number of fragments which have an unknown metadata schema.
--   **Fragments returned** - Number of fragments returned by the harvester.
--   **Fragments matched** - Number of fragments that had identifiers that in the template used by the harvester.
--   **Existing datasets** - Number of metadata records for datasets that existed when the THREDDS harvester was run.
--   **Records built** - Number of records built by the harvester from the template and fragments.
--   **Could not insert** - Number of records that the harvester could not insert into the catalog (usually because the record was already present eg. in the Z3950 harvester this can occur if the same record is harvested from different servers).
+You can choose the type of harvest you intend to perform. The supported harvesters and details of what to do next are in the following sections:
 
-## Adding new harvesters
+### Harvester History {#harvest_history}
 
-The Add button in the main page allows you to add new harvesters. A drop down list is then shown with all the available harvester protocols.
+Each time a harvester is run, it generates a log file of what was harvested and/or what went wrong (eg. exception report). To view the harvester history, select a harvester in the harvesters list and select the tab `Harvester history` in the harvester page:
 
-You can choose the type of harvest you intend to perform and press *Add* to begin the process of adding the harvester. The supported harvesters and details of what to do next are in the following sections:
+![](img/harvester-history.png)
 
-## Harvest History {#harvest_history}
+Once the harvest history has been displayed it is possible to download the log file of the harvester execution and delete the harvester history.
 
-Each time a harvester is run, it generates a status report of what was harvested and/or what went wrong (eg. exception report). These reports are stored in a table in the database used by GeoNetwork. The entire harvesting history for all harvesters can be recalled using the History button on the Harvesting Management page. The harvest history for an individual harvester can also be recalled using the History link in the Operations for that harvester.
+### Harvester records
 
-Once the harvest history has been displayed it is possible to:
+When a harvester is executed, you can view the list of metadata harvested and some statistics about the metadata. Select a harvester in the harvesters list and select the tab `Metadata records` in the harvester page: 
 
--   expand the detail of any exceptions
--   sort the history by harvest date (or in the case of the history of all harvesters, by harvester name)
--   delete any history entry or the entire history
+![](img/harvester-statistics.png)
diff --git a/docs/manual/mkdocs.yml b/docs/manual/mkdocs.yml
@@ -294,6 +294,7 @@ nav:
       - user-guide/harvesting/harvesting-csw.md
       - user-guide/harvesting/harvesting-filesystem.md
       - user-guide/harvesting/harvesting-geonetwork.md
+      - user-guide/harvesting/harvesting-geonetwork-2.md
       - user-guide/harvesting/harvesting-geoportal.md
       - user-guide/harvesting/harvesting-oaipmh.md
       - user-guide/harvesting/harvesting-ogcwxs.md