These tools can be used only internally within the SBB in order to:
- perform NER+EL on the digitalized collection of the SBB given a running NER+EL system
- augment the ALTO-XML files of the digitalized collections with NER+EL information
- extract BERT pre-training data from the ALTO-XML files of the digital collections of the SBB.
See Makefile for entire processing chain.
Clone this project and the SBB-utils.
Setup virtual environment:
virtualenv --python=python3.6 venv
Activate virtual environment:
source venv/bin/activate
Upgrade pip:
pip install -U pip
Install packages together with their dependencies in development mode:
pip install -e sbb_utils
pip install -e sbb_tools
altotool --help
Usage: altotool [OPTIONS] SOURCE_DIR OUTPUT_FILE
Extract text from a bunch of ALTO XML files into one big CSV(.csv) or
SQLITE3(.sqlite3) file.
SOURCE_DIR: The directory that contains subfolders with the ALTO xml
files. OUTPUT_FILE: Write the extracted fulltext to this file (either .csv
or .sqlite3).
Options:
--processes INTEGER number of parallel processes. default: 6.
--help Show this message and exit.
corpusentropy --help
Usage: corpusentropy [OPTIONS] ALTO_FULLTEXT_FILE ENTROPY_FILE
Read the documents of the corpus from ALTO_FULLTEXT_FILE where each line
of the .csv file describes one page.
Foreach page compute its character entropy rate and store the result as a
pickled pandas DataFrame in ENTROPY_FILE.
Options:
--chunksize INTEGER size of chunks used for processing alto-csv-file
--processes INTEGER number of parallel processes. default: 6.
--help Show this message and exit.
corpuslanguage --help
Usage: corpuslanguage [OPTIONS] ALTO_FULLTEXT_FILE LANGUAGE_FILE
Read the documents of the corpus from ALTO_FULLTEXT_FILE where each line
of the .csv file describes one page.
Foreach page classify its language by means of langid. Store the
classification results as a pickled pandas DataFrame in LANGUAGE_FILE.
Options:
--chunksize INTEGER size of chunks used for processing alto-csv-file
--processes INTEGER number of parallel processes. default: 6.
--help Show this message and exit.
batchner --help
Usage: batchner [OPTIONS] FULLTEXT_SQLITE_FILE SELECTION_FILE MODEL_NAME
NER_ENDPOINT...
Reads the text content per page of digitalized collections from sqlite
file FULLTEXT_SQLITE_FILE.
Considers only a subset of documents that is defined by SELECTION_FILE.
Performs NER on the text content using the REST endpoint[s] NER_ENDPOINT
....
Writes the NER results back to another sqlite file whose name is equal to
FULLTEXT_SQLITE_FILE + '-ner-' or to the file specified in the --outfile
option.
Writes results in chunks of size <chunksize>.
Suppress proxy with option --noproxy.
Options:
--chunksize INTEGER size of chunks used for processing. default: 10**4
--noproxy disable proxy. default: enabled.
--processes INTEGER number of parallel processes, default: number of NER
endpoints.
--outfile PATH Write results to this file. default: derive name from
fulltext sqlite file.
--help Show this message and exit.
batchel --help
Usage: batchel [OPTIONS] SQLITE_FILE LANG_FILE EL_ENDPOINTS
Performs entity linking on all the PPNs resp. files whose NER-tagging is
contained in the input SQLITE_FILE. Stores the linking results in the same
file in a table 'entity_linking'.
SQLITE_FILE: File that has been produced by batchner, i.e., a file that
contains all the NER results per PPN and page in table named 'tagged'.
LANG_FILE: Pickled pandas DataFrame that specifies the language of all
files per PPN:
ppn filename language
0 PPN646426230 00000045.xml fr
1 PPN646426230 00000218.xml fr
2 PPN646426230 00000394.xml fr
3 PPN646426230 00000071.xml fr
4 PPN646426230 00000317.xml fr
see also ->corpuslanguage --help
EL_ENDPOINTS: JSON structure that defines EL-endpoints per language:
{ "de": "http://b-lx0053.sbb.spk-berlin.de/sbb-tools/de-ned" }
Suppress proxy by option --noproxy.
Options:
--chunk-size INTEGER size of chunks sent to EL-Linking system. Default:
100.
--noproxy disable proxy. default: proxy is enabled.
--start-from-ppn TEXT
--help Show this message and exit.
alto-annotator --help
Usage: alto-annotator [OPTIONS] TAGGED_SQLITE_FILE SOURCE_DIR DEST_DIR
Read NER tagging results from TAGGED_SQLITE_FILE. Read ALTO XML files in
subfolders of directory SOURCE_DIR. Annotate the XML content with NER
information and write the annotated ALTO XML back to the same directory
structure in DEST_DIR.
Options:
--processes INTEGER number of parallel processes. default: 0.
--help Show this message and exit.