Define & Standardize crawler source data workflow #385

gzt5142 · 2023-04-24T12:29:15Z

The workflow by which crawler sources are defined and included in the ingest/harvest process is loosely defined. The storage mechanism needs formal definition also ( See #281 ).

To-Do:

Define and document crawler data maintenance workflow
- How are sources stored and structured? (JSON, CSV, SQL, etc)
- How are new sources included?
- How is an existing data source modified?
- What is the update cycle?
Create standardized python functions/objects/methods for use across the whole project.
- pydantic and/or ORM models to define schema and validation rules
- Validation of sources -- define the requirements for a source to be ingestible.
Adapt the crawler source to populate its source table from arbitrary store (see Repository Pattern gzt5142/nldi-crawler-py#33); code to interface rather than implementation.

gzt5142 self-assigned this Apr 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define & Standardize crawler source data workflow #385

Define & Standardize crawler source data workflow #385

gzt5142 commented Apr 24, 2023

Define & Standardize crawler source data workflow #385

Define & Standardize crawler source data workflow #385

Comments

gzt5142 commented Apr 24, 2023