https://www.datascience-contest.com
The Korea National Oil Corporation was interested in purchasing shale gas wells from the United States and wanted to predict their productions to select wells that maximize profit.
A combination of LightGBM regression and Exponential smoothing is used to predict productions. 0-1 integer programming using Gurobi is used for optimization to maximize profit. Performance evaluation is based on sMAPE (symmetric Mean Absolute Percentage Error). Our team has one of the best performances, having a percentage error of 25.54%, compared to the best one of 19.49%.
Unfortunately, the train and exam datasets are confidential. Therefore, they are not included in this repository.
- trainSet.csv - Data of 280 shale gas wells for training models
- examSet.csv - Data of 44 shale gas wells for prediction
The task is to predict the monthly average gas productions of 44 shale gas wells in examSet.csv for the next 6 months.
Performance evaluation is based on sMAPE (symmetric Mean Absolute Percentage Error):
- Fi - predicted monthly average gas production of ith gas well over the next 6 months
- Ai - actual monthly average gas production of ith gas well over the next 6 months
- n - number of gas wells (44 in this problem)
A budget of $15,000,000 is allocated. The task is to select gas wells among the 44 wells to maximize profit after predicting their monthly average gas productions:
- Ai - actual monthly average gas production of ith gas well over the next 6 months
- Pi - price of ith gas well
- Ps - shale gas price ($5 per 1 Mcf)
- Ci - monthly operation cost of ith gas well
- Xi - decision variable to purchase ith gas well (if purchasing ith gas well: Xi = 1, else: Xi = 0)
The wells are divided into new wells and old wells. New wells do not have data on gas production, non-gas production and hours operated per month. This data is available for old wells.
Therefore, regression is used to predict the monthly average productions of new wells for the first 6 months, and exponential smoothing is used to predict the monthly average productions of old wells for the last 6 months.
After EDA (Exploratory Data Analysis) and feature engineering, the following advanced decision tree-based models for regression are tested:
BaggingRegressor
n_estimators=50
RandomForestRegressor
n_estimators=50
XGBRegressor
max_depth=5
objective='reg:squarederror'
LGBMRegressor
VotingRegressor
estimators=[bagging, random_forest, xgb, lgbm]
n_jobs=-1
Hyperparameter: train_test_split(test_size=0.2, random_state=42)
LGBMRegressor
turns out as the best performing, with the minimum sMAPE.
LGBMRegressor
hyperparameters after tuning with Ray Tune using Grid Search Algorithm:
boosting_type='gbdt'
learning_rate=0.1
max_bin=250
max_depth=-1
min_data_in_leaf=20
num_iterations=100
num_leaves=20
GPU is leveraged.
The following exponential smoothing models are tested:
SimpleExpSmoothing
smoothing_level=0.2
smoothing_level=0.6
- optimized smoothing level
Holt
- Additive model
- Multiplicative model
- Damped additive model
- Damped multiplicative model
ExponentialSmoothing
use_boxcox=True
- Additive model
- Damped additive model
Depending on the model with the minimum SSE (Sum of Squared Error) for each well, different models are used to forecast different wells.
The following 0-1 integer programming model is used: