Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define & Standardize crawler source data workflow #385

Open
3 tasks
gzt5142 opened this issue Apr 24, 2023 · 0 comments
Open
3 tasks

Define & Standardize crawler source data workflow #385

gzt5142 opened this issue Apr 24, 2023 · 0 comments
Assignees

Comments

@gzt5142
Copy link
Collaborator

gzt5142 commented Apr 24, 2023

The workflow by which crawler sources are defined and included in the ingest/harvest process is loosely defined. The storage mechanism needs formal definition also ( See #281 ).

To-Do:

  • Define and document crawler data maintenance workflow
    • How are sources stored and structured? (JSON, CSV, SQL, etc)
    • How are new sources included?
    • How is an existing data source modified?
    • What is the update cycle?
  • Create standardized python functions/objects/methods for use across the whole project.
    • pydantic and/or ORM models to define schema and validation rules
    • Validation of sources -- define the requirements for a source to be ingestible.
  • Adapt the crawler source to populate its source table from arbitrary store (see Repository Pattern gzt5142/nldi-crawler-py#33); code to interface rather than implementation.
@gzt5142 gzt5142 self-assigned this Apr 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant