The data loading code assumes that the SWDE dataset is in folder data/swde
and the Apify dataset is in folder data/apify
.
Only the former is currently open-source.
Upon first use (e.g., as part of training),
the datasets will be preprocessed into SQLite files.
To get the original SWDE dataset, follow these steps.
-
Download
swde.zip
from Internet Archive.wget https://web.archive.org/web/20210630013015id_/https://codeplexarchive.blob.core.windows.net/archive/projects/swde/swde.zip
-
Extract HTML files into
data/swde/data/<vertical>/<website>/<page>.htm
.mkdir -p data/swde/data unzip swde.zip -d data/swde rm swde.zip unzip data/swde/sourceCode/sourceCode.zip -d data/swde/data rm data/swde/sourceCode/sourceCode.zip for f in data/swde/data/*.7z do d=${f%.7z} 7zz x $f -o $d rm $f done
-
Extract visuals using the visual extractor.
Alternatively, download pre-extracted visuals along with HTML files:
git clone https://github.com/jjonescz/swde-visual data/swde
Pre-trained GloVe embeddings are downloaded automatically before first use. To pre-download them, execute:
python -m awe.data.glove