Gathering

The main module, enabling communication between all other parts of the system. The gathering module consists of a scraper, document store, database and the central backend. This module's code can be found in the gathering directory.

The scraper is responsible for crawling web pages from the World Health Organization and Arxiv containing documents, downloading them and ingesting them into the database. When the gathering module is started by the system operator, it looks for new publications and adds them to the database. These publications are later downloaded asynchronously in the background. The downloaded files are in the PDF format, which makes it necessary to divide them into pages, while converting them to the PNG format.

Documents are stored in the document store on a hard drive and served through a web server to other modules. The document store is exposed through an nginx web server.

The database stores all metadata about publications, pages, OCR data, annotations made by users, experiments and all other data used by various parts of the system.

The central backend serves as an interface between the database and other modules. It exposes all its functionality through a web API. This part is also responsible for user authentication and authorization. Included in the central backend is the administration module that lets the system operator modify system’s configuration and manage users.

Current deployment

The gathering module is currently deployed on the MiNI cluster. Its services are available under:

http://pz-gathering.mini.pw.edu.pl/admin/ – administration panel
http://pz-gathering.mini.pw.edu.pl/api/ – API

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gathering

Current deployment

See also

Clone this wiki locally