Dataset converter pipeline tool. Transforms dataset csv files into parquet files.
- Simple automatic deployment
- Extendable plugin system
- Flexible configuration
- Containerized setup
- Copy
.env.example
to.env
and fill the options
- Download the .csv files to the
./data
folder - Create the necessary config files. See Configuration for more details.
- Run the application:
$ docker-compose up
- It will look for all
.yml
, for each dataset configured file, it will produce an optimized parquet file and a pickle file containing the pandas dtypes. The generated files are located in the./data
folder.
For each config file found, keeps the same file name
as set in the config and create the following files:
Contains a dict python with the column:dtype for each entry.
Creates a parquet binary file compressed in 7z format from the dataframe processed.
A plugin system is available, where is possible to call additional procedures to modify the dataset files.
A plugin has a method named apply
which receives a pandas Dataframe object and returns it at the end of the method. The plugin can be configured to run right after a file is loaded before the main processing is done or afterwards.
The application includes the following packages:
- numpy
- pandas
- pyyaml
- pyarrow
Any extra dependencies can be added to the requirements.txt
in the plugins folder, they will be installed on the startup of the application.
A sample plugin is provided as a template to get you started.
A collection of ready-to use addons (configs and plugins) can be found here.
For each .csv
file create .yml
with the same name. A sample configuration file is provided.
Type: string
Table filename. This argument is required.
Type: string
Default: ;
Delimiter character to use. This argument is optional.
Type: boolean
Default: true
Whether to compress the parquet files to 7z
format or not, when compression is on, it deletes the uncompressed parquet files afterwards. This argument is optional.
Type: boolean
Default: false
Whether to export to separate files in chunks or to export to a single parquet file. This argument is optional.
Type: number
Chunk size, useful when files are large. This argument is optional. Ommiting this argument loads the whole file at once.
Type: sequence
List of columns to load from .csv
file. This argument is optional. Ommiting this argument loads all columns from file.
Type: sequence
List of columns to parse with datetime format. This argument is optional.
Type: sequence
List of plugins to be applied to the dataframe object, plugins are called by the same order they are set in the configuration file. This argument is optional.
- before: Plugins called just after a file/chunk is loaded to memory.
- after: Plugins called at the end of downcasting process, before files are exported.