Skip to content
/ stream2db Public

Stream a file or url to elasticSearch or mongodb

Notifications You must be signed in to change notification settings

kyv/stream2db

Repository files navigation

Stream2db

Stream json documents or a csv file to a backend. Currently we support mongodb and elasticsearch. More backends could be added easily using the node etl driver.

install

npm install -g git+https://[email protected]:kyv/stream2db.git

Examples

Import compranet from streaming source to elasticsearch.

Since ellison hashes documents before sending them over the wire, those streams will get checked for data corruption.

stream2db https://excel2json.herokuapp.com/https://compranetinfo.funcionpublica.gob.mx/descargas/cnet/Contratos2013.zip

Use CODIGO_CONTRATO as _id

If you do not provide an ID field (--id) a random ID will be generated. If you do set the ID new documents with the same ID will replace their predecessors.

stream2db -i CODIGO_CONTRATO https://excel2json.herokuapp.com/https://compranetinfo.funcionpublica.gob.mx/descargas/cnet/Contratos2013.zip

Import CSV into cargografias index on elasticsearch

You can use a csv file as your data source.

stream2db -d cargografias ~/Downloads/Cargografias\ v5\ -\ Nuevos_Datos_CHEQUEATON.csv

Options

You can set some options on the commandline.

stream2db -h|--help
--backend DATA BACKEND   Backend to save data to. [mongo|elastic]
--db INDEX|DB            Name of the index (elastic) or database (mongo) where data is written
--type TYPE|COLLECTION   Mapping type (elastic) or collection (mongo).
--id ID                  Specify a field to be used as _id. If hash is specified the object hash will be used
--uris URIS              Space separated list of urls to stream
--host HOST              Host to stream to. Default is localhost
--port PORT              Port to stream to. Defaults to 9500 (elastic) or 27017 (mongo)
--converter JAVASCRIPT MODULE   Pass data trough some predefined conversion function
--help                   Print this usage guide.

Debugging

The --verbose flag triggers debugging mode of the DB driver. En elasticsearch this is set to log: trace. The mongo driver allows for configuration by way of variables in the enviornment.

conversion

You can add arbitrary data conversion using by exporting a default function from some file in the converters directory and passing the name of that file with the option --converter. A conversion to OCDS has been added as an example. To use it you would add --converter ocds to your commandline.

Notes

As we are targeting local data management, we have not yet added DB authorization. This will get added to the parameters.

Cleanup

strings are normalized and trimmed.

Type coercion

We do very simple type coercion. Numbers should work. Anything else you want to do can be easily implemented with a converter.

Hashes

We add the field hash to the indexed document. You can use it however you like.

k8s

We produce a docker image which you can use with the *CronJob.yaml files found here to run this code as a cronJob on kubernetes.

About

Stream a file or url to elasticSearch or mongodb

Resources

Stars

Watchers

Forks

Packages

No packages published