This repository is the 1st solution of web ctr competition.
- CPU: i7-11799K core 8
- RAM: 32GB
- GPU: NVIDIA GeForce RTX 3090 Ti
By default, hydra-core==1.3.0
was added to the requirements given by the competition.
For pytorch
, refer to the link at https://pytorch.org/get-started/previous-versions/ and reinstall it with the right version of pytorch
for your environment.
You can install a library where you can run the file by typing:
$ conda env create --file environment.yaml
Code execution for the new model is as follows:
Running the learning code shell.
$ python -m scripts.covert_to_parquet
$ sh scripts/shell/sampling_dataset.sh
$ sh scripts/shell/lgb_experiment.sh
$ sh scripts/shell/cb_experiment.sh
$ sh scripts/shell/xdeepfm_experiment.sh
$ sh scripts/shell/fibinet_experiment.sh
$ python -m scripts.ensemble
Examples are as follows.
MODEL_NAME="lightgbm"
SAMPLING=0.45
for seed in 517 1119
do
python -m scripts.train \
data.train=train_sample_${SAMPLING}_seed${seed} \
models=${MODEL_NAME} \
models.results=5fold-ctr-${MODEL_NAME}-${SAMPLING}-seed${seed}
python -m scripts.predict \
models=${MODEL_NAME} \
models.results=5fold-ctr-${MODEL_NAME}-${SAMPLING}-seed${seed} \
output.name=5fold-ctr-${MODEL_NAME}-${SAMPLING}-seed${seed}
done
Simple is better than complex
Negative sampling is very important in recommendation systems. This method is very effective when it is not possible to train on large volumes of data. In my experiment, I used seeds 414 and 602 for a 40% negative sample, and seeds 517 and 1119 for a 45% negative sample.
I encoded the Label of each categorical dataset and trained them together, referring to the kaggler code.
I encoded the frequency of occurrence of each categorical dataset and trained them.
Routine to rank a set of given ensemble forecasts according to their "value".
This method normally distributes the distribution of each numerical data, resulting in better performance for the model. Experimental results show higher performance than MinMaxScaler
.
Considering the characteristics of tabular data, we devised a strategy to train GBDT models and NN models, and then ensemble them.
-
LightGBM
- With count features
- StratifiedKfold: 5
-
CatBoost
- Use GPU
- Not used cat_features parameter
- With count features
- StratifiedKFold: 5
-
xDeepFM
- With Gauss Rank
- StratifiedKFold: 5
-
FiBiNET
- With Gauss Rank
- StratifiedKFold: 5
- Long training and inferencing time
I used the concept of log-odds from logistic regression to construct an ensemble:
- It seems to perform better than other ensembles (Rank, Voting).
- Since the prediction values are probabilities, we used the logit function and its inverse to perform bagging for the ensemble.
- Each model result
Model | cv | public-lb | private-lb |
---|---|---|---|
LightGBM-0.45 sampling | 0.7850 | 0.7863 | 0.7866 |
FiBiNET-0.45 sampling | 0.7833 | 0.7861 | 0.7862 |
xDeepFM-0.45 sampling | 0.7819 | 0.7866 | 0.7867 |
wide&deep-0.45 sampling | 0.7807 | 0.7835 | 0.7837 |
AutoInt-0.45 sampling | 0.7813 | 0.7846 | 0.7848 |
CatBoost-0.45 sampling | 0.7765 | 0.7773 | 0.7778 |
- Ensemble result
Method | public-lb | private-lb |
---|---|---|
Rank Ensemble | 0.7889 | - |
Average Ensemble | 0.7892 | - |
Weighted average Ensemble | 0.7891 | - |
Sigmoid Ensemble | 0.7903 | 0.7905 |
- Day Cross validation
- Day feature
- Catboost with cat_features parameter
- XGBoost with GPU
- Hash features: need more RAM
- DeepFM
- LightGBM DART
- LightGBM: A Highly Efficient Gradient Boosting Decision Tree
- Wide & Deep Learning for Recommender Systems
- FiBiNET: Combining Feature Importance and Bilinear feature Interaction for Click-Through Rate Prediction
- xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems
- CatBoost is a high-performance open source library for gradient boosting on decision trees
- Efficient Click-Through Rate Prediction for Developing Countries via Tabular Learning
- Label Encoder
- Gauss Rank
- Sigmoid Ensemble