Download and inspect Ofsted reports for keywords. This code will:
- Download a list of schools (
scrape_search_pages
) - Download a list of reports associated with those schools (
scrape_school_pages
) - Download a subset of those reports (
download_report_pdfs
) - Convert .pdf reports to .txt (
convert_pdfs
) - Parse .txt for keywords using regular expressions (
scan_reports
)
git clone https://github.com/jdkram/ofsted-report-scraper
cd ofsted-report-scraper
gem install bundler
bundle install
- Modify
task.rb
- specify school types, reports types etc. - Run with
ruby task.rb
(orcaffeinate ruby task.rb
to keep machine awake for long downloads).
Please note that scrape_search_pages
and scrape_school_pages
don't currently handle being interrupted well as they don't record their progress.
scrape_search_pages
and scrape_school_pages
both sleep rand(0.1..0.6)
(a random time between 0.1 and 0.6 seconds) between calls to ease the request rate on their site. download_report_pdfs
sleeps for a slightly longer 1-2 seconds, for no particular reason other than this tends to be a large number of consecutive requests.