Stream json documents or a csv file to a backend.
Currently we support mongodb
and elasticsearch
. More
backends could be added easily using the node etl driver.
npm install -g git+https://[email protected]:kyv/stream2db.git
Since ellison hashes documents before sending them over the wire, those streams will get checked for data corruption.
stream2db https://excel2json.herokuapp.com/https://compranetinfo.funcionpublica.gob.mx/descargas/cnet/Contratos2013.zip
If you do not provide an ID field (--id
) a random ID will be generated. If you do set the ID new documents with the same ID will replace their predecessors.
stream2db -i CODIGO_CONTRATO https://excel2json.herokuapp.com/https://compranetinfo.funcionpublica.gob.mx/descargas/cnet/Contratos2013.zip
You can use a csv file as your data source.
stream2db -d cargografias ~/Downloads/Cargografias\ v5\ -\ Nuevos_Datos_CHEQUEATON.csv
You can set some options on the commandline.
stream2db -h|--help
--backend DATA BACKEND Backend to save data to. [mongo|elastic]
--db INDEX|DB Name of the index (elastic) or database (mongo) where data is written
--type TYPE|COLLECTION Mapping type (elastic) or collection (mongo).
--id ID Specify a field to be used as _id. If hash is specified the object hash will be used
--uris URIS Space separated list of urls to stream
--host HOST Host to stream to. Default is localhost
--port PORT Port to stream to. Defaults to 9500 (elastic) or 27017 (mongo)
--converter JAVASCRIPT MODULE Pass data trough some predefined conversion function
--help Print this usage guide.
The --verbose
flag triggers debugging mode of the DB driver. En elasticsearch this is set to log: trace
. The mongo driver allows for configuration by way of variables in the enviornment.
You can add arbitrary data conversion using by exporting a default function from some file in the converters
directory and passing the name of that file with the option --converter
. A conversion to OCDS has been added as an example. To use it you would add --converter ocds
to your commandline.
As we are targeting local data management, we have not yet added DB authorization. This will get added to the parameters.
strings are normalized and trimmed.
We do very simple type coercion. Numbers should work. Anything else you want to do can be easily implemented with a converter.
We add the field hash
to the indexed document. You can use it however you like.
We produce a docker image which you can use with the *CronJob.yaml files found here to run this code as a cronJob on kubernetes.