Yet another BORME Parser.
BORME (Boletin Oficial del Registro Mercantil) is the Official Bulletin of the Commercial Registry.
This program translate BORME PDF files to JSON.
Borme has two Parsers to extract the PDF file data to a json file.
- Parser: Read PDF files and write raw json files.
- Parser2: Read raw json files and write process json files.
YaBORMEParser requires the following to run:
- Python 2.7+
- pdfminer 20140328+
- ply 3.4+
To build the PIP package it is required:
- PyPandoc 1.1.3+
YABORMEParser is easiest to use when installed with pip:
pip install yabormerparse
Then two scripts are installed:
yabormeparser1
parse and analyze BORME PDF file. The result is a*.RAW.json
file. It uses PDFMiner.yabormeparser2
transforms a*.RAW.json
file in a more structured*.json
file. It uses PLY.
yabormeparser1 -i BORME-A-2009-100-49.pdf
We get the most of the times a BORME-A-2009-100-49.RAW.json
file. But, if some
error happens then we get BORME-A-2009-100-49.RAW.patch.TMP
. If we get this
latter file we have to:
- Rename it to
BORME-A-2009-100-49.RAW.patch
. - Edit the renamed one in order to solve the error.
- Exit and execute the parser again with the patch file.
yabormeparser1 -i BORME-A-2009-100-49.pdf -p BORME-A-2009-100-49.RAW.patch
Remember to save the file BORME-A-2009-100-49.RAW.patch
in the code
repository.
To get more options:
yabormeparser1 -h
The second script is like the first one but:
yabormeparser2 -i BORME-A-2009-100-49.RAW.json
And it returns BORME-A-2009-100-49.json
or BORME-A-2009-100-49.patch.TMP
.
yabormeparser2 -i BORME-A-2009-100-49.RAW.json -p BORME-A-2009-100-49.patch
More options:
yabormeparser2 -h
The JSON files have version numbers. If the parser changes the output format or any data we have to change the version number. This is in order to have data consistency.
These version numbers are different from the package version. They have nothing to do with it.
The first parser version:
In yabormeparser/parser.py
:
RAW_FILE_VERSION = u'1'
In the *.RAW.json
file:
"raw_version": "1"
In yabormeparser/parser2.py
:
RAW_FILE_VERSION = parser.RAW_FILE_VERSION
# Thousands file version. Represent the file version part corresponding to this
# parser (parser2)
TH_FILE_VERSION = u"7"
# The file version depends on parser one and parser two. It is coded to avoid
# that the parser I change and the parser II does not.
FILE_VERSION = u"%i" % (int(RAW_FILE_VERSION) + 1000 * int(TH_FILE_VERSION))
In the *.json
file:
"raw_version": "1",
"version": "7001"
The program, that uses these scripts, must test if the version file is older than the current one. And if it is it must delete the JSON file and create a new one.
Put the patch files under patches/FILE_VERSION_<THE_PARSER2_FILE_VERSION>
.
If you change the version then create a tag in the repository:
FILE_VERSION_<THE_PARSER2_FILE_VERSION>
, for example, FILE_VERSION_7001
.
To contribute to YABORMEParser, clone this repo locally and commit your code on a separate branch.
If you download the code and you don't install the package you can execute the scripts:
python -m yabormeparser.parser -i examples/BORME-A-2009-100-49.pdf
python -m yabormeparser.parser2 -i examples/BORME-A-2009-100-49.RAW.json
When you want to correct a bug, first of all, create a test that fail because of that bug.
When you finish run all python unit tests:
python -m unittest discover tests
And run all integration tests:
mkdir tmp
cp examples/*.pdf tmp
cp examples/*.patch tmp
bin/parser_dir.sh tmp/
The last part of the output must be:
PDFs 15
JSONs 15
ERRORs 0
Now the second parser:
bin/parser_dir.sh tmp/
rm -rf tmp
The result must be (or similar):
RAWs 15
JSONs 15
ERRORs 0
BORMEMining is a program that uses YABORMEParser.
BORMEMining:
- downloads BORME PDFs from the BORME web,
- parses BORME PDFs to JSON (in parallel to use all CPUs),
- records state of files, PDFs, and JSONs in a database,
- helps apply patches to problematic PDFs or RAW JSON files...
The database and JSON files of BORMEMining are used for BORMEMiningWeb to give a Web interface and a REST API.