Skip to content

Latest commit

 

History

History
31 lines (21 loc) · 583 Bytes

README.md

File metadata and controls

31 lines (21 loc) · 583 Bytes

Content Extraction

Requirements

The script requires yarn and pipenv

Retriever

cd ./retriever

# Install dependencies
yarn install

# It is assumed that data.json is at the project root
node index.js '../data.json' '/tmp/extraction'

The file data.json is list of id, webId and url field.

Extractor

cd ./extractor
# Install dependencies
pipenv install
# If you are not inside pipenv shell, run `pipenv shell`
python main.py '../data.json' '/tmp/extraction' '/tmp/output.json'

The final output will be saved in /tmp/output.json.