Node.js app

This folder contains

Further documentation is available in code.

Architecture

HTML dataset is expected to be downloaded and available locally.
A page from the dataset is loaded into Chromium browser using Puppeteer.
The browser starts downloading other assets that are not available locally like images, CSS and JavaScript files. These requests are intercepted and replaced with links to Wayback Machine if necessary Responses are stored offline so they don't need to be requested again later.
Visual attributes are computed for each element in the page and saved alongside each page to a JSON file which is later loaded by the Python machine learning code.