Improve crawler source workflow. #281

dblodgett-usgs · 2022-05-04T02:52:08Z

Currently, the crawler source TSV file is cumbersome and adding a new crawler source requires a fairly heavy database operation.

We've always imagined having a UI of some kind that allowed registration of new crawler sources.

Let's work in that direction by implementing a stand alone crawler source json object for each source that can be validated and may grow over time per #63 -- a github action could then be set up to test the json objects and create the crawler source tsv file?

In a future sprint, we could build a UI around the json objects such that the contribution model is no longer through a PR but through some kind of interface.

EthanGrahn · 2022-09-29T01:31:29Z

If we take the route of JSON files representing each source, it would be an option to have the crawler update the crawler source table at the start of its run rather than requiring a liquibase execution first. Having the crawler manage its own table makes sense to me and avoiding table data loading with liquibase should help simplify db management long term. (manage schema, not data)

dblodgett-usgs · 2022-09-30T10:57:32Z

Good call -- I think this would be a big improvement. Let's keep an eye on it. I'd meant to fit this in last year and it didn't happen. It's for sure a priority.

EthanGrahn · 2022-10-03T15:04:25Z

Example JSON file for a single source:

{
	"crawlerSourceId" : 5,
	"sourceName" : "NWIS Surface Water Sites",
	"sourceSuffix" : "nwissite",
	"sourceUri" : "https://www.sciencebase.gov/catalog/file/get/60c7b895d34e86b9389b2a6c?name=usgs_nldi_gages.geojson",
	"featureId" : "provider_id",
	"featureName" : "name",
	"featureUri" : "subjectOf",
	"featureReach" : "nhdpv2_REACHCODE",
	"featureMeasure" : "nhdpv2_REACH_measure",
	"ingestType" : "reach",
	"featureType" : "hydrolocation"
}

The key names map directly to the database columns, but we could simplify them as needed.

EthanGrahn · 2022-10-28T18:04:42Z

Here's a first draft of a schema to validate the crawler source files. Let me know what you think and any suggestions for better descriptions/names @dblodgett-usgs It has very loose validation with most of it being character limits that match the database table.

{
    "$schema": "http://json-schema.org/draft/2020-12/schema",
    "title": "Crawler Source",
    "description": "A source from which the Crawler can ingest features.",
    "type": "object",
    "properties": {
        "id": {
            "description": "The unique identifier for the source",
            "type": "integer",
            "minimum": 0,
            "maximum": 2147483647
        },
        "name": {
            "description": "A human readable name for the source",
            "type": "string",
            "pattern": "^[0-9a-zA-z_-]{1,500}$"
        },
        "suffix": {
            "description": "Unique suffix for database and service use",
            "type": "string",
            "pattern": "^[0-9a-zA-z_-]{1,1000}$"
        },
        "uri": {
            "description": "Source location to download GeoJSON features",
            "type": "string",
            "pattern": "^.{1,256}$"
        },
        "feature": {
            "description": "Metadata of the features",
            "type": {
                "$ref": "#/$defs/feature"
            }
        },
        "ingestType": {
            "description": "Method used to index feature",
            "type": "string",
            "pattern": "^(reach|point)$"
        }
    },
    "required": [
        "id",
        "name",
        "suffix",
        "uri",
        "feature",
        "ingestType"
    ],
    "$defs": {
        "feature": {
            "type": "object",
            "required": [
                "id",
                "type",
                "name",
                "uri"
            ],
            "properties": {
                "id": {
                    "type": "string",
                    "description": "Key name that maps to the ID of the feature",
                    "pattern": "^.{1,500}$"
                },
                "type": {
                    "type": "string",
                    "description": "Associated location type for this feature",
                    "pattern": "^(hydrolocation|type|varies)$"
                },
                "name": {
                    "type": "string",
                    "description": "Key name that maps to the name of the feature",
                    "pattern": "^.{1,500}$"
                },
                "uri": {
                    "type": "string",
                    "description": "Key name that maps to the URI of the feature",
                    "pattern": "^.{1,256}$"
                },
                "reach": {
                    "type": "string",
                    "description": "Key name that maps to the reachcode of the feature",
                    "pattern": "^.{1,500}$"
                },
                "measure": {
                    "type": "string",
                    "description": "Key name that maps to the measure of the feature",
                    "pattern": "^.{1,500}$"
                }
            }
        }
    }
}

EthanGrahn · 2022-10-28T18:58:55Z

Optionally, we could validate the source URI with a HEAD request to check if we get a 200 response.

dblodgett-usgs · 2022-10-28T21:50:03Z

I like that idea. I think we should go ahead and get this implemented.

dblodgett-usgs added this to the Spring 2022 milestone May 4, 2022

dblodgett-usgs removed this from the Spring 2022 milestone Sep 30, 2022

gzt5142 mentioned this issue Apr 24, 2023

Define & Standardize crawler source data workflow #385

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve crawler source workflow. #281

Improve crawler source workflow. #281

dblodgett-usgs commented May 4, 2022

EthanGrahn commented Sep 29, 2022

dblodgett-usgs commented Sep 30, 2022

EthanGrahn commented Oct 3, 2022

EthanGrahn commented Oct 28, 2022 •

edited

Loading

EthanGrahn commented Oct 28, 2022

dblodgett-usgs commented Oct 28, 2022

Improve crawler source workflow. #281

Improve crawler source workflow. #281

Comments

dblodgett-usgs commented May 4, 2022

EthanGrahn commented Sep 29, 2022

dblodgett-usgs commented Sep 30, 2022

EthanGrahn commented Oct 3, 2022

EthanGrahn commented Oct 28, 2022 • edited Loading

EthanGrahn commented Oct 28, 2022

dblodgett-usgs commented Oct 28, 2022

EthanGrahn commented Oct 28, 2022 •

edited

Loading