Skip to content

Latest commit

 

History

History
127 lines (88 loc) · 3.43 KB

extractor.md

File metadata and controls

127 lines (88 loc) · 3.43 KB

Visual extractor

Visual extractor is a Node.js app with command-line interface.

Pre-requisites

  1. The following commands assume the current working directory is js.

    cd js
  2. Install Node.js packages.

    pnpm install
  3. To get started with the extraction, execute:

    pnpm start -- --help

Development

To debug with parameters, open JavaScript Debug Terminal in Visual Studio Code and start your command with:

cd js
node -r ts-node/register/transpile-only index.ts

To type-check the code, run pnpm test.

More documentation is available in code. For an overview, see js/README.md.

Cookbook

Following commands can be used to extract visuals from concrete HTML datasets.

SWDE dataset

Extracting visuals from one website of the SWDE dataset:

pnpm start -- -g 'camera/camera-amazon*/????.htm' -T=1000 -t=500 -j=8 -S

Validating it:

(cd .. && python -m awe.data.validate --visuals -v camera --save-list=data/invalid_pages.txt amazon)

And re-scraping invalid pages:

pnpm start -- -d ../ --files=../data/invalid_pages.txt -T=1000 -j=8 -S

And re-validate them:

(cd .. && python -m awe.data.validate --visuals -v camera --read-list=data/invalid_pages.txt --save-back amazon)

Apify dataset

Extracting visuals from Apify dataset (note that some pages require enabled JavaScript, hence the -S option):

pnpm start -- -d ../data/apify/alzaEn -g 'pages/localized_html_*.htm' -o -T=1000 -j=8 -SH
pnpm start -- -d ../data/apify/asosEn -g 'pages/localized_html_*.htm' -o -T=1000 -j=8 -SH
pnpm start -- -d ../data/apify/bestbuyEn -g 'pages/localized_html_*.htm' -o -T=1000 -j=8 -SH
pnpm start -- -d ../data/apify/bloomingdalesEn -g 'pages/localized_html_*.htm' -o -T=1000 -j=8 -SH
pnpm start -- -d ../data/apify/conradEn -g 'pages/localized_html_*.htm' -o -T=1000 -j=8 -SH
pnpm start -- -d ../data/apify/etsyEn -g 'pages/localized_html_*.htm' -o -T=1000 -j=8 -SH
pnpm start -- -d ../data/apify/ikeaEn -g 'pages/localized_html_*.htm' -o -T=1000 -j=8 -SH
pnpm start -- -d ../data/apify/notinoEn -g 'pages/localized_html_*.htm' -o -T=1000 -j=8 -Z
pnpm start -- -d ../data/apify/radioshackEn -g 'pages/localized_html_*.htm' -o -T=1000 -j=8 -SH
pnpm start -- -d ../data/apify/tescoEn -g 'pages/localized_html_*.htm' -o -T=1000 -j=8 -SH

And validating them:

(cd .. && python -m awe.data.validate --visuals --max-errors=1 [<website_name>])

To save a list of invalid, replace --max-errors=1 with --save-list=data/invalid_pages.txt -q and re-scrape using -d ../ --files=../data/invalid_pages.txt instead of -d -g arguments, e.g.:

pnpm start -- -d ../ --files=../data/invalid_pages.txt -o -T=1000 -j=8 -SH

To add new website to the above list (and determine which parameters are needed), try extracting a few pages with screenshots:

pnpm start -- -d ../data/apify/alzaEn -g 'pages/localized_html_*.htm' -o -T=1000 -SH -t=1 -m=2

And validate them (including manually running awe/data/set/explore.ipynb):

(cd .. && python -m awe.data.validate --visuals --skip-without-visuals --max-errors=1 alzaEn)

To take 3 screenshots of a website:

pnpm start -- -d ../data/apify/alzaEn -g 'pages/localized_html_*.htm' -oRH -t=1 -T=1000 -m=3 -S

To blend JSON and HTML into XML (not used yet):

pnpm start -- -d ../data/apify/alzaEn -g 'pages/localized_html_*.htm' -B