Skip to content
This repository has been archived by the owner on Sep 20, 2024. It is now read-only.

Ed Scraping: Overall Scraping Roadmap (WIP) #1

Open
estebanruseler opened this issue Feb 19, 2020 · 0 comments
Open

Ed Scraping: Overall Scraping Roadmap (WIP) #1

estebanruseler opened this issue Feb 19, 2020 · 0 comments

Comments

@estebanruseler
Copy link

estebanruseler commented Feb 19, 2020

A four stage roadmap

There are 4 phases to the scraping approach:

  • 1st phase - One-off catalogue population
  • 2nd phase - Ongoing catalogue population & technical debugging of crashes
  • 3rd phase - Ed can CRUD harvesters/scrappers & non-technical debugging of crashes
  • 4th phase - Move away from scraping and towards structured data-pipeline based on a data strategy

1st phase - One-off catalogue population & technical debugging of crashes

To scrape the Dept Ed websites and use data wrangling to populate the ODP with metadata (and maybe data). This will start with a 2-week sprint to validate that this approach will likely meet the coverage and metadata quality objectives.

The scraping / data wrangling output will be a data.json that will be ingested by a legacy Harvester.

This will be a one-off process to support the launch of the catalog with as much coverage as possible.

It will start with a 3 week test (1 week to prepare for a sprint, 2 week sprint) - at the end of this we will decide whether the scrapping approach is a viable option to populate the catalog.

2nd phase - Ongoing catalogue population

To build on the above so that:

  • Can be run at intervals or on-demand by developer
  • Before loading in data into the portal, check for diff
  • Scraping pipeline works with data.json "aka proper" pipeline (no duplication and Harvester favors the data.json ww2 file over web scrapping)

This will probably involve using the NG Harvester pipeline, ie the backend NG Harvester infra but none of the front-end customization.

3rd phase - CRUD harvesters/scrappers

The Dept Ed can view, create, update and delete scraped data pipelines using a WUI and see logs of issues that have gone wrong.

4th phase - Moving away from scraping

Due to the inherent problems with scrapping, the 4th phase is to have a department-wide strategy strategy

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant