Analytics runner facilitates loading and storing data from various data collections using Loader module and, with the use of Pipeline module, it enables feature extraction, model building and making predictions. Currently, the script is focused on moving data between a shared MariaDB database and a local QMiner database (QMinerDB). It extracts features from given data set, enriches data with weather features, creates models, uploads model configurations to the shared database, fits all the models, makes predictions and uploads predictions back to the shared database.
To put it in other words - it combines Loader and Pipeline modules with logging and error handling mechanisms.
The script executes sub-tasks in the following order:
- Data preparation — downloads data from MariaDB, obtains weather data and stores it to QMinerDB.
- Weather transformation — extracts all weather features.
- Other transformation — extracts all other features, using transformation configuration files.
- Models preparation — creates new models or uses matching model configurations in the local QMinerDB.
- Model — can be executed in
- Fit-init mode — fits a new model with the historical data.
- Fit mode — fits the existing model with the recent data.
- Predict mode — makes predictions using the existing model.
- Report
Runner configuration file
A JSON configuration file that specifies all model's configurations and loader's configurations that are needed to run analytics runner successfully. See Configuration section.
Transformation configuration file
A JSON configuration file that specifies feature extraction. It is used in the pipeline module. See pipeline module documentation.
Model/pipeline configuration file
A JSON configuration file that specifies input extraction, model building and making model predictions in a single pipeline configuration file. It is used in the pipeline module. See pipeline module documentation.
Loader configuration file
A JSON configurations file that specifies how to move data between different data collections conveniently. Currently supports moving data between TSV files, MariaDB, ArangoDB and QMinerDB. The configuration file is used in the loader module. See loader module documentation.
Example of the runner configuration file:
{
"use_case": "Prediction by categories - example",
"models_configs_dir": "./usecase/example/models/predict",
"models_dst_dir": "../data/usecase/example/models/",
"transformations": {
"weather": "./usecase/common/transformations/weather_transformation.json",
"other": [
"./usecase/example/transformations/categories_transformation.json"
]
}
}
When executing runner.js and no configuration file with option --conf
is specified,
the default runner configuration file analytics/config/analytics_runner_default.json is used.
Parameter | Type | Required | Description |
---|---|---|---|
use_case | String | No | Name of the use case. |
models_configs_dir | String | Yes | Directory with the model configuration files. |
models_dst_dir | String | Yes | Directory to save model database and expected results. |
transformations | Object | Yes | Transformations to be executed on data. |
transformations.weather | String | Yes | Transformation configuration file to build weather features. |
transformations.other | List | Yes | List of transformation configuration files to extract features from the data that are specific to the use-case. |
Some options, when specifying init
, will look for paths with _init.json
suffix. Transformation configuration files with _init.json
suffix must be provided in the same directory as an original configuration file.
Note that the path is automatically resolved and it should not be provided in the runner configuration file.
Example: ./usecase/example/transformations/categories_transformation.json as original configuration file and ./usecase/example/transformations/categories_transformation_init.json for initial fit.
Initialization transformation configuration files are usually used only once, to get features from the historical data and to fit the initial model. Afterwards, to make predictions and to fit on recent data, the transformation configuration file for recent data is used.
node analytics/runner.js [<model_conf_paths>] [<options>]
Option model_conf_paths are given paths to the model configuration files that are used in the pipeline module. It can be provided
as multiple model configuration file paths or as a single path to the directory containing the model configuration files.
If specified, these model configurations files are used instead of the ones in the conf.models_configs_dir
directory.
See Configuration section.
All the following options can be passed to the script. They are divided into separate sections for better clarity.
If not specified otherwise, the default value of the following parameters is false
and the matching sub-tasks will not be executed.
--conf=<path/to/runner/config/file.json>
Path to the runner configuration file. Default: analytics_runner_default.json. See Configuration section.
--update-weather[=force]
Download weather for a specific date. Weather data for a specific date are stored into db-weather-<YYYY-MM-DD>.tsv
file.
See weather_update.sh script and modify it as necessary.
If force
is specified, existing TSVs will be overwritten.
Note: Option --date
must be set or the default value is used.
Prerequisite: To get raw weather data with weather_update.sh script you must have installed and properly configured weather-data API.
--upload-weather[=init]
Upload raw weather data from TSV files to QMinerDB and MariaDB. By default, execution uses configuration files weather_store.json and weather_qminer.json.
If init
is specified, the configuration files weather_store_init.json is used and
config.paths.weatherInitTsv
in the config.js is used as source TSV file to initialize raw weather QMinerDB.
Prerequisite: weather_store.json, weather_qminer.json
weather_store_init.json and config.paths.weatherInitTsv
TSV file.
--update-data[=init]
Download data, namely products and product states, from shared MariaDB for the specific date. Loader module uses loader's configuration file
prod_load.json with modified source query, that has --date
value.
If init
is specified, a new database is created and the existing one is overwritten. The loader module loads data using configurations in prod_load_init.json.
Note: If init
is specified, option --date
is ignored.
Prerequisite: prod_load.json and prod_load_init.json.
--prepare-data[=init]
Same effect as putting options --update-weather
, --upload-weather
and --update-data
together.
--transform-weather[=init|new]
Extract weather features using the pipeline module.
The path to the configuration file is given in the parameter transformations.weather
in the provided runner configuration file.
Default: weather_transformation.json.
If init
is specified, the weather features are built using the path, in the parameter transformations.weather
,
with suffix _init.json
(e.g. ./weather_transformation_init.json
).
If new
is specified, the transformations.weather
file is used as template and start_date
parameter
in the configurations is set to --date
option and end_date
to 7 days after.
Note: Option --date
is optional. However, if not given, yesterday's date is used.
--transform-other[=init]
Extract all other features, commonly specific to the use-case.
Paths to configuration files are given in the list transformations.other
in the provided runner configuration file.
--transform[=init]
The same effect as putting options --transform-other
and --transform-weather
together.
--upload[=false|true]
Upload predictions and raw weather data to the shared database.
--fit[=init]
Fit models to queried data.
To extract the queried input data, input_extraction.params.search_query
is used in the model configuration file.
However, search_query.Timestamp
is set to --date
and consequentially only the records for a given date are selected.
If init
is specified and parameter input_extraction.params.init_use_timestamp
, in the model configuration file, is
set to false
or not present, the parameter search_query.Timestamp
is removed from the model configuration file.
All records are queried without Timestamp
constraints.
If init
is specified and input_extraction.params.init_use_timestamp
is set to true
, untouched search_query
is used.
--predict[=init|conf]
Make predictions for each model on queried input data.
To extract queried input data, input_extraction.params.search_query
is used in the model configuration file.
However, search_query.Timestamp
is set to --date
and consequentially only records for a given date are selected.
If init
is specified, parameter search_query
is removed from pipeline's configuration file. All records are queried from that store.
If conf
is specified, untouched search_query
is used.
-d <specific_date>
--date=<specific_date>
A specific date is used to download new products, update weather, transform weather, fit and predict for a specific date.
Note: Option --date
is optional. However, if not given, yesterday's date is used.
-v [false|true]
--verbose[=false|true]
Verbose logging. Default: true
.
-r [false|true]
--report[=false|true]
Show a short report at the end of the script. Default: Value from --verbose
option.
--clear-logs
Remove all logs from previous executions of the script.
--prepare-models[=use|upload|update|local-update]
Create or update model configuration files in the shared database and the local QMinerDB using locally stored
model configurations files.
The script searches for model configuration files using directory path given as parameter conf.models_configs_dir
in the runner configuration file.
If use
is specified, use model in the local QMinerDB.
If upload
is specified, upload model configuration files to the shared database and the local QMinerDB. If one exists, skip it.
If update
is specified, update model configuration files in the shared database and the local QMinerDB.
If local-update
is specified, update model configuration files in the local QMinerDB.
Note: Parameter models_configs_dir
set in the runner configuration file must point to the directory with at least one model configuration file or model_conf_paths options must be provided.
To run this example you need:
See usecase/common/loader directory with preconfigured configuration files:
- weather_store.json — stores weather data for a specific date from a TSV file to the shared database.
- weather_store_init.json — stores historical weather data from a TSV file to the shared database.
- weather_qminer.json — stores weather data for a specific date from a TSV file to raw weather data QMinerDB.
- model_store_dupl.json — stores model configuration files to the shared database.
- model_update.json — stores model configuration files to the shared database.
- model_load.json — loads models configurations from shared MariaDB to QMinerDB.
- pred_store.json — stores predictions to the shared database.
- prod_load_init.json — download products and products' states from the shared database.
- prod_load.json — download products and products' states for a specific date from the shared database.
Note: Fill in database credentials.
QMinerDB with historical data. Default: ../../data/dbs/ceBbDb/.
We can modify input_db_history
parameter in categories_transformation.json and categories_transformation_init.json.
QMinerDB with raw weather data. Default: ../../data/dbs/weatherDb/.
We can modify input_db
parameter in weather_transformation.json and weather_transformation_init.json.
In case we want to update the weather database and weather features we need to install weather-data API, or at least have TSV files (db-weather-<YYYY-MM-DD>.tsv
) with the raw weather data you are interested in.
Before fitting models and making predictions you need raw weather data and weather features for all the dates you are interested in. Default location of the raw weather QMinerDB is ../../data/dbs/weatherDb/ and the location of the weather feature QMinerDB is ../../data/common/features/weatherFeaturesDb/.
In case the weather databases are not initialized, execute:
node ./analytics/runner.js --upload-weather=init --weather-transformations=init --conf=./usecase/example/analytics_runner.json
Obtaining weather data and extracting weather features are time-consuming operations. To avoid executing latter operations again and again, use and update the latest copies of the existing databases (the raw weather QMinerDB and the weather feature QMinerDB).
To update weather database with the new data execute:
node ./analytics/runner.js --date=<new_date> --upload-weather --weather-transformations=new --conf=./usecase/example/analytics_runner.json
If you want to update weather data on a date interval use analytics_runner.sh. See example in the analytics runner section.
node ./analytics/runner.js ./usecase/example/models/init/ --transform-other=init --fit=init --prepare-models=upload --conf=./usecase/example/analytics_runner.json
This command runs transformations on historical data using categories_transformations_init.json transformation configuration file and creates all models described in the usecase/example/models/init/. Directory models_configs_dir
, given in the runner configuration file, is ignored.
Finally, all models fit using historical data and its model configuration files.
The models are also uploaded to the shared database. In case, we want to leave out uploading the model configuration files provide --prepare-models=local-update
.
To make it easier, one can divide the command into two separate commands — one for executing transformations and one for the initial fit of the models.
Note: Transformations are performed once and are used for all models in the use-case.
node ./analytics/runner.js --update-data=init --conf=./usecase/example/analytics_runner.json
This command downloads all records from products
and product_states
shared database tables.
Currently, using this command is use-case specific to the tables in the shared database.
Note: You need to provide prod_load_init.json. See loader configuration files.
node ./analytics/runner.js --transform-other --conf=./usecase/example/analytics_runner.json
This extracts all features from the newly obtained data.
node ./analytics/runner.js ./usecase/example/models/predict/ --prepare-models=upload --conf=./usecase/example/analytics_runner.json
Upload model configuration files to the shared database. If one exists, skip it.
node ./analytics/runner.js ./usecase/example/models/predict/ --prepare-models=update --conf=./usecase/example/analytics_runner.json
Update model configuration files in the shared database. If one exists, replace it.
In case the model configuration file for prediction is different from the one for fitting the model it is necessary to update the model configuration file in the database. Cases usually differ in input-extraction
parameter.
Make predictions on the new dataset and upload to the shared database — specify --upload
option.
If we want to predict using the input_extraction.params.search_query
in the model configuration file and predict on all extracted records at once, execute:
node ./analytics/runner.js --prepare-models=update --predict=conf--conf=./usecase/example/analytics_runner.json
This queries all records using input_extraction
search query from the model configuration file.
Note, that parameter input_extraction.params.search_query
must be set correctly in the pipeline configuration file, especially Timestamp
.
In case, we would like to predict on date interval use the analytics_runner.sh script:
bash ./scripts/analytics_runner.sh --interval <from> <to> --prepare-data --other-transformations --predict
The script extracts inputs for each date separately and predicts on each set. The search query is ignored in the model configuration file.
Script analytics_runner.sh wraps similar commands we used in the previous example. However, it is useful when we need to use runner.js on the date interval.
For example, if we would like to update raw weather QMinerDB and calculate new weather transformations between dates <from>
and <to>
, we can run:
bash ./scripts/analytics_runner.sh --interval <from> <to> --upload-weather --weather-transformations=new
or shortly:
bash ./scripts/analytics_runner.sh --interval <from> <to> --weather
For more shortcuts see analytics_runner.sh.
The script runs runner.js separately for each date between <from>
and <to>
, inclusive, with --upload-weather --weather-transformations=new
parameters. Obviously, the parameter --date
is set accordingly.
Similarly, we can use analytics_runner.sh script to run other supported commands that use --date
parameter.