Open Extraction Works

A set of services to spider and extract metadata (abstracts, titles, authors, pdf links) from given URLs.

Services may be run individually as command line applications, or as service that runs on a schedule (using PM2)

Overview

This project provides a set of services to take the URL of a webpage for a research paper, and extract metadata for that paper. Metadata includes the abstract, title, authors, and a URL for a PDF version of the paper. Spidering is done using an automated Chrome browser. Once a URL is loaded into the browser, a series of extraction rules are applied. The extractor checks for commonly used metadata schemas, including Highwire Press, Dublin Core, OpenGraph, along with non-standard variations of the most common schemas, and a growing list of journal and domain-specific methods of embedding metadata in the head or body of a webpage. Once a suitable schema is identified, the metadata fields are saved to a local file system, then returned to the caller. If changes are made to the spidering and/or extractor, such that re-running the system produces different results, that fact is returned along with the results.

Dev/Production machine setup

Requirements

node >= v16.15
rush npm install -g @microsoft/rush
pm2 npm install -g pm2
docsify-cli npm install -g docsify-cli
Chrome dependencies (runs headless via puppeteer) sudo apt install libnss3-dev libatk1.0-0 libatk-bridge2.0-0 libcups2 libgbm1 libpangocairo-1.0-0 libgtk-3-0
HTML Tidy sudo apt install tidy
dotenv (optional, helps with development)
MongoDB, tested against v5.0

Project install/update/building

Run rush commands from project root

Initial installation rush install
Update dependencies after version bump in package.json rush update
Build/Full rebuild rush build rush build:watch rush rebuild
Run tests rush test

Name		Name	Last commit message	Last commit date
Latest commit History 356 Commits
.vscode		.vscode
bin		bin
common		common
docker		docker
docs		docs
lib		lib
packages		packages
.eslintignore		.eslintignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.ignore		.ignore
.travis.yml		.travis.yml
project-todo.org		project-todo.org
readme.md		readme.md
rush.json		rush.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Open Extraction Works

Overview

Dev/Production machine setup

Requirements

Project install/update/building

About

Releases

Packages

Languages

openreview/open-meta-extraction

Folders and files

Latest commit

History

Repository files navigation

Open Extraction Works

Overview

Dev/Production machine setup

Requirements

Project install/update/building

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages