Skip to content

Latest commit

 

History

History

analytics

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Analytics runner

Analytics runner facilitates loading and storing data from various data collections using Loader module and, with the use of Pipeline module, it enables feature extraction, model building and making predictions. Currently, the script is focused on moving data between a shared MariaDB database and a local QMiner database (QMinerDB). It extracts features from given data set, enriches data with weather features, creates models, uploads model configurations to the shared database, fits all the models, makes predictions and uploads predictions back to the shared database.

To put it in other words - it combines Loader and Pipeline modules with logging and error handling mechanisms.

⚠️ Note: Because of the diversity of different use cases, it is advised to use Loader and Pipeline separately and not with analytics runner wrapper.

Features

The script executes sub-tasks in the following order:

  • Data preparation — downloads data from MariaDB, obtains weather data and stores it to QMinerDB.
  • Weather transformation — extracts all weather features.
  • Other transformation — extracts all other features, using transformation configuration files.
  • Models preparation — creates new models or uses matching model configurations in the local QMinerDB.
  • Model — can be executed in
    • Fit-init mode — fits a new model with the historical data.
    • Fit mode — fits the existing model with the recent data.
    • Predict mode — makes predictions using the existing model.
  • Report

Naming convention

Runner configuration file

A JSON configuration file that specifies all model's configurations and loader's configurations that are needed to run analytics runner successfully. See Configuration section.

Transformation configuration file

A JSON configuration file that specifies feature extraction. It is used in the pipeline module. See pipeline module documentation.

Model/pipeline configuration file

A JSON configuration file that specifies input extraction, model building and making model predictions in a single pipeline configuration file. It is used in the pipeline module. See pipeline module documentation.

Loader configuration file

A JSON configurations file that specifies how to move data between different data collections conveniently. Currently supports moving data between TSV files, MariaDB, ArangoDB and QMinerDB. The configuration file is used in the loader module. See loader module documentation.

Configuration

Example of the runner configuration file:

{
    "use_case": "Prediction by categories - example",
    "models_configs_dir": "./usecase/example/models/predict",
    "models_dst_dir": "../data/usecase/example/models/",
    "transformations": {
        "weather": "./usecase/common/transformations/weather_transformation.json",
        "other": [
            "./usecase/example/transformations/categories_transformation.json"
        ]
    }
}

When executing runner.js and no configuration file with option --conf is specified, the default runner configuration file analytics/config/analytics_runner_default.json is used.

Parameters

Parameter Type Required Description
use_case String No Name of the use case.
models_configs_dir String Yes Directory with the model configuration files.
models_dst_dir String Yes Directory to save model database and expected results.
transformations Object Yes Transformations to be executed on data.
transformations.weather String Yes Transformation configuration file to build weather features.
transformations.other List Yes List of transformation configuration files to extract features from the data that are specific to the use-case.

Some options, when specifying init, will look for paths with _init.json suffix. Transformation configuration files with _init.json suffix must be provided in the same directory as an original configuration file. Note that the path is automatically resolved and it should not be provided in the runner configuration file.

Example: ./usecase/example/transformations/categories_transformation.json as original configuration file and ./usecase/example/transformations/categories_transformation_init.json for initial fit.

Initialization transformation configuration files are usually used only once, to get features from the historical data and to fit the initial model. Afterwards, to make predictions and to fit on recent data, the transformation configuration file for recent data is used.

Execution

node analytics/runner.js [<model_conf_paths>] [<options>]

Model configuration paths

Option model_conf_paths are given paths to the model configuration files that are used in the pipeline module. It can be provided as multiple model configuration file paths or as a single path to the directory containing the model configuration files. If specified, these model configurations files are used instead of the ones in the conf.models_configs_dir directory. See Configuration section.

Options

All the following options can be passed to the script. They are divided into separate sections for better clarity. If not specified otherwise, the default value of the following parameters is false and the matching sub-tasks will not be executed.

Configuration file

--conf=<path/to/runner/config/file.json>

Path to the runner configuration file. Default: analytics_runner_default.json. See Configuration section.

Data preparation

--update-weather[=force]

Download weather for a specific date. Weather data for a specific date are stored into db-weather-<YYYY-MM-DD>.tsv file. See weather_update.sh script and modify it as necessary.

If force is specified, existing TSVs will be overwritten.

Note: Option --date must be set or the default value is used.

Prerequisite: To get raw weather data with weather_update.sh script you must have installed and properly configured weather-data API.

--upload-weather[=init]

Upload raw weather data from TSV files to QMinerDB and MariaDB. By default, execution uses configuration files weather_store.json and weather_qminer.json. If init is specified, the configuration files weather_store_init.json is used and config.paths.weatherInitTsv in the config.js is used as source TSV file to initialize raw weather QMinerDB.

Prerequisite: weather_store.json, weather_qminer.json weather_store_init.json and config.paths.weatherInitTsv TSV file.

--update-data[=init]

Download data, namely products and product states, from shared MariaDB for the specific date. Loader module uses loader's configuration file prod_load.json with modified source query, that has --date value.

If init is specified, a new database is created and the existing one is overwritten. The loader module loads data using configurations in prod_load_init.json.

Note: If init is specified, option --date is ignored.

Prerequisite: prod_load.json and prod_load_init.json.

--prepare-data[=init]

Same effect as putting options --update-weather, --upload-weather and --update-data together.

Transformations

--transform-weather[=init|new]

Extract weather features using the pipeline module. The path to the configuration file is given in the parameter transformations.weather in the provided runner configuration file. Default: weather_transformation.json.

If init is specified, the weather features are built using the path, in the parameter transformations.weather, with suffix _init.json (e.g. ./weather_transformation_init.json).

If new is specified, the transformations.weather file is used as template and start_date parameter in the configurations is set to --date option and end_date to 7 days after.

Note: Option --date is optional. However, if not given, yesterday's date is used.

--transform-other[=init]

Extract all other features, commonly specific to the use-case. Paths to configuration files are given in the list transformations.other in the provided runner configuration file.

--transform[=init]

The same effect as putting options --transform-other and --transform-weather together.

Modes

--upload[=false|true]

Upload predictions and raw weather data to the shared database.

--fit[=init]

Fit models to queried data.

To extract the queried input data, input_extraction.params.search_query is used in the model configuration file. However, search_query.Timestamp is set to --date and consequentially only the records for a given date are selected.

If init is specified and parameter input_extraction.params.init_use_timestamp, in the model configuration file, is set to false or not present, the parameter search_query.Timestamp is removed from the model configuration file. All records are queried without Timestamp constraints.

If init is specified and input_extraction.params.init_use_timestamp is set to true, untouched search_query is used.

--predict[=init|conf]

Make predictions for each model on queried input data.

To extract queried input data, input_extraction.params.search_query is used in the model configuration file. However, search_query.Timestamp is set to --date and consequentially only records for a given date are selected.

If init is specified, parameter search_query is removed from pipeline's configuration file. All records are queried from that store.

If conf is specified, untouched search_query is used.

-d <specific_date>
--date=<specific_date>

A specific date is used to download new products, update weather, transform weather, fit and predict for a specific date.

Note: Option --date is optional. However, if not given, yesterday's date is used.

Miscellaneous

-v [false|true]
--verbose[=false|true]

Verbose logging. Default: true.

-r [false|true]
--report[=false|true]

Show a short report at the end of the script. Default: Value from --verbose option.

--clear-logs

Remove all logs from previous executions of the script.

--prepare-models[=use|upload|update|local-update]

Create or update model configuration files in the shared database and the local QMinerDB using locally stored model configurations files. The script searches for model configuration files using directory path given as parameter conf.models_configs_dir in the runner configuration file.

If use is specified, use model in the local QMinerDB.

If upload is specified, upload model configuration files to the shared database and the local QMinerDB. If one exists, skip it.

If update is specified, update model configuration files in the shared database and the local QMinerDB.

If local-update is specified, update model configuration files in the local QMinerDB.

Note: Parameter models_configs_dir set in the runner configuration file must point to the directory with at least one model configuration file or model_conf_paths options must be provided.

Workflow example

1. Prerequisite

To run this example you need:

1.1. Loader configuration files

See usecase/common/loader directory with preconfigured configuration files:

  • weather_store.json — stores weather data for a specific date from a TSV file to the shared database.
  • weather_store_init.json — stores historical weather data from a TSV file to the shared database.
  • weather_qminer.json — stores weather data for a specific date from a TSV file to raw weather data QMinerDB.
  • model_store_dupl.json — stores model configuration files to the shared database.
  • model_update.json — stores model configuration files to the shared database.
  • model_load.json — loads models configurations from shared MariaDB to QMinerDB.
  • pred_store.json — stores predictions to the shared database.
  • prod_load_init.json — download products and products' states from the shared database.
  • prod_load.json — download products and products' states for a specific date from the shared database.

Note: Fill in database credentials.

1.2. Raw historical data

QMinerDB with historical data. Default: ../../data/dbs/ceBbDb/.

We can modify input_db_history parameter in categories_transformation.json and categories_transformation_init.json.

1.3. Raw weather database

QMinerDB with raw weather data. Default: ../../data/dbs/weatherDb/.

We can modify input_db parameter in weather_transformation.json and weather_transformation_init.json.

1.4. Weather-data API

In case we want to update the weather database and weather features we need to install weather-data API, or at least have TSV files (db-weather-<YYYY-MM-DD>.tsv) with the raw weather data you are interested in.

2. Weather preparation

Before fitting models and making predictions you need raw weather data and weather features for all the dates you are interested in. Default location of the raw weather QMinerDB is ../../data/dbs/weatherDb/ and the location of the weather feature QMinerDB is ../../data/common/features/weatherFeaturesDb/.

In case the weather databases are not initialized, execute:

node ./analytics/runner.js --upload-weather=init --weather-transformations=init --conf=./usecase/example/analytics_runner.json

Obtaining weather data and extracting weather features are time-consuming operations. To avoid executing latter operations again and again, use and update the latest copies of the existing databases (the raw weather QMinerDB and the weather feature QMinerDB).

To update weather database with the new data execute:

node ./analytics/runner.js --date=<new_date> --upload-weather --weather-transformations=new --conf=./usecase/example/analytics_runner.json 

If you want to update weather data on a date interval use analytics_runner.sh. See example in the analytics runner section.

3. Initial transformations and model fit

node ./analytics/runner.js ./usecase/example/models/init/ --transform-other=init --fit=init --prepare-models=upload --conf=./usecase/example/analytics_runner.json

This command runs transformations on historical data using categories_transformations_init.json transformation configuration file and creates all models described in the usecase/example/models/init/. Directory models_configs_dir, given in the runner configuration file, is ignored. Finally, all models fit using historical data and its model configuration files.

The models are also uploaded to the shared database. In case, we want to leave out uploading the model configuration files provide --prepare-models=local-update.

To make it easier, one can divide the command into two separate commands — one for executing transformations and one for the initial fit of the models.

Note: Transformations are performed once and are used for all models in the use-case.

4. Download recent data

node ./analytics/runner.js --update-data=init --conf=./usecase/example/analytics_runner.json

This command downloads all records from products and product_states shared database tables. Currently, using this command is use-case specific to the tables in the shared database.

Note: You need to provide prod_load_init.json. See loader configuration files.

5. Transformations of recent data

node ./analytics/runner.js --transform-other --conf=./usecase/example/analytics_runner.json 

This extracts all features from the newly obtained data.

6. Upload/update model configurations

node ./analytics/runner.js ./usecase/example/models/predict/ --prepare-models=upload --conf=./usecase/example/analytics_runner.json

Upload model configuration files to the shared database. If one exists, skip it.

node ./analytics/runner.js ./usecase/example/models/predict/ --prepare-models=update  --conf=./usecase/example/analytics_runner.json

Update model configuration files in the shared database. If one exists, replace it.

In case the model configuration file for prediction is different from the one for fitting the model it is necessary to update the model configuration file in the database. Cases usually differ in input-extraction parameter.

7. Predictions

Make predictions on the new dataset and upload to the shared database — specify --upload option. If we want to predict using the input_extraction.params.search_query in the model configuration file and predict on all extracted records at once, execute:

node ./analytics/runner.js --prepare-models=update --predict=conf--conf=./usecase/example/analytics_runner.json 

This queries all records using input_extraction search query from the model configuration file. Note, that parameter input_extraction.params.search_query must be set correctly in the pipeline configuration file, especially Timestamp.

In case, we would like to predict on date interval use the analytics_runner.sh script:

bash ./scripts/analytics_runner.sh --interval <from> <to> --prepare-data --other-transformations --predict

The script extracts inputs for each date separately and predicts on each set. The search query is ignored in the model configuration file.

Using analytics_runner.sh script

Script analytics_runner.sh wraps similar commands we used in the previous example. However, it is useful when we need to use runner.js on the date interval.

For example, if we would like to update raw weather QMinerDB and calculate new weather transformations between dates <from> and <to>, we can run:

bash ./scripts/analytics_runner.sh --interval <from> <to> --upload-weather --weather-transformations=new

or shortly:

bash ./scripts/analytics_runner.sh --interval <from> <to> --weather

For more shortcuts see analytics_runner.sh.

The script runs runner.js separately for each date between <from> and <to>, inclusive, with --upload-weather --weather-transformations=new parameters. Obviously, the parameter --date is set accordingly.

Similarly, we can use analytics_runner.sh script to run other supported commands that use --date parameter.