Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve crawler source workflow. #281

Open
dblodgett-usgs opened this issue May 4, 2022 · 6 comments
Open

Improve crawler source workflow. #281

dblodgett-usgs opened this issue May 4, 2022 · 6 comments

Comments

@dblodgett-usgs
Copy link
Member

Currently, the crawler source TSV file is cumbersome and adding a new crawler source requires a fairly heavy database operation.

We've always imagined having a UI of some kind that allowed registration of new crawler sources.

Let's work in that direction by implementing a stand alone crawler source json object for each source that can be validated and may grow over time per #63 -- a github action could then be set up to test the json objects and create the crawler source tsv file?

In a future sprint, we could build a UI around the json objects such that the contribution model is no longer through a PR but through some kind of interface.

@dblodgett-usgs dblodgett-usgs added this to the Spring 2022 milestone May 4, 2022
@EthanGrahn
Copy link

If we take the route of JSON files representing each source, it would be an option to have the crawler update the crawler source table at the start of its run rather than requiring a liquibase execution first. Having the crawler manage its own table makes sense to me and avoiding table data loading with liquibase should help simplify db management long term. (manage schema, not data)

@dblodgett-usgs
Copy link
Member Author

Good call -- I think this would be a big improvement. Let's keep an eye on it. I'd meant to fit this in last year and it didn't happen. It's for sure a priority.

@dblodgett-usgs dblodgett-usgs removed this from the Spring 2022 milestone Sep 30, 2022
@EthanGrahn
Copy link

Example JSON file for a single source:

{
	"crawlerSourceId" : 5,
	"sourceName" : "NWIS Surface Water Sites",
	"sourceSuffix" : "nwissite",
	"sourceUri" : "https://www.sciencebase.gov/catalog/file/get/60c7b895d34e86b9389b2a6c?name=usgs_nldi_gages.geojson",
	"featureId" : "provider_id",
	"featureName" : "name",
	"featureUri" : "subjectOf",
	"featureReach" : "nhdpv2_REACHCODE",
	"featureMeasure" : "nhdpv2_REACH_measure",
	"ingestType" : "reach",
	"featureType" : "hydrolocation"
}

The key names map directly to the database columns, but we could simplify them as needed.

@EthanGrahn
Copy link

EthanGrahn commented Oct 28, 2022

Here's a first draft of a schema to validate the crawler source files. Let me know what you think and any suggestions for better descriptions/names @dblodgett-usgs It has very loose validation with most of it being character limits that match the database table.

{
    "$schema": "http://json-schema.org/draft/2020-12/schema",
    "title": "Crawler Source",
    "description": "A source from which the Crawler can ingest features.",
    "type": "object",
    "properties": {
        "id": {
            "description": "The unique identifier for the source",
            "type": "integer",
            "minimum": 0,
            "maximum": 2147483647
        },
        "name": {
            "description": "A human readable name for the source",
            "type": "string",
            "pattern": "^[0-9a-zA-z_-]{1,500}$"
        },
        "suffix": {
            "description": "Unique suffix for database and service use",
            "type": "string",
            "pattern": "^[0-9a-zA-z_-]{1,1000}$"
        },
        "uri": {
            "description": "Source location to download GeoJSON features",
            "type": "string",
            "pattern": "^.{1,256}$"
        },
        "feature": {
            "description": "Metadata of the features",
            "type": {
                "$ref": "#/$defs/feature"
            }
        },
        "ingestType": {
            "description": "Method used to index feature",
            "type": "string",
            "pattern": "^(reach|point)$"
        }
    },
    "required": [
        "id",
        "name",
        "suffix",
        "uri",
        "feature",
        "ingestType"
    ],
    "$defs": {
        "feature": {
            "type": "object",
            "required": [
                "id",
                "type",
                "name",
                "uri"
            ],
            "properties": {
                "id": {
                    "type": "string",
                    "description": "Key name that maps to the ID of the feature",
                    "pattern": "^.{1,500}$"
                },
                "type": {
                    "type": "string",
                    "description": "Associated location type for this feature",
                    "pattern": "^(hydrolocation|type|varies)$"
                },
                "name": {
                    "type": "string",
                    "description": "Key name that maps to the name of the feature",
                    "pattern": "^.{1,500}$"
                },
                "uri": {
                    "type": "string",
                    "description": "Key name that maps to the URI of the feature",
                    "pattern": "^.{1,256}$"
                },
                "reach": {
                    "type": "string",
                    "description": "Key name that maps to the reachcode of the feature",
                    "pattern": "^.{1,500}$"
                },
                "measure": {
                    "type": "string",
                    "description": "Key name that maps to the measure of the feature",
                    "pattern": "^.{1,500}$"
                }
            }
        }
    }
}

@EthanGrahn
Copy link

Optionally, we could validate the source URI with a HEAD request to check if we get a 200 response.

@dblodgett-usgs
Copy link
Member Author

I like that idea. I think we should go ahead and get this implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants