web-ctr-prediction

This repository is the 1st solution of web ctr competition.

Setting

CPU: i7-11799K core 8
RAM: 32GB
GPU: NVIDIA GeForce RTX 3090 Ti

Requirements

By default, hydra-core==1.3.0 was added to the requirements given by the competition. For pytorch, refer to the link at https://pytorch.org/get-started/previous-versions/ and reinstall it with the right version of pytorch for your environment.

You can install a library where you can run the file by typing:

$ conda env create --file environment.yaml

Run code

Code execution for the new model is as follows:

Running the learning code shell.

 $ python -m scripts.covert_to_parquet
 $ sh scripts/shell/sampling_dataset.sh
 $ sh scripts/shell/lgb_experiment.sh
 $ sh scripts/shell/cb_experiment.sh
 $ sh scripts/shell/xdeepfm_experiment.sh
 $ sh scripts/shell/fibinet_experiment.sh
 $ python -m scripts.ensemble

Examples are as follows.

 MODEL_NAME="lightgbm"
 SAMPLING=0.45

 for seed in 517 1119
 do
     python -m scripts.train \
         data.train=train_sample_${SAMPLING}_seed${seed} \
         models=${MODEL_NAME} \
         models.results=5fold-ctr-${MODEL_NAME}-${SAMPLING}-seed${seed}

     python -m scripts.predict \
         models=${MODEL_NAME} \
         models.results=5fold-ctr-${MODEL_NAME}-${SAMPLING}-seed${seed} \
         output.name=5fold-ctr-${MODEL_NAME}-${SAMPLING}-seed${seed}
 done

Summary

Simple is better than complex

Negative Sampling

Negative sampling is very important in recommendation systems. This method is very effective when it is not possible to train on large volumes of data. In my experiment, I used seeds 414 and 602 for a 40% negative sample, and seeds 517 and 1119 for a 45% negative sample.

Features

Label Encoder

I encoded the Label of each categorical dataset and trained them together, referring to the kaggler code.

Count features

I encoded the frequency of occurrence of each categorical dataset and trained them.

Gauss Rank

Routine to rank a set of given ensemble forecasts according to their "value". This method normally distributes the distribution of each numerical data, resulting in better performance for the model. Experimental results show higher performance than MinMaxScaler.

Model

Considering the characteristics of tabular data, we devised a strategy to train GBDT models and NN models, and then ensemble them.

GBDT

LightGBM
- With count features
- StratifiedKfold: 5
CatBoost
- Use GPU
- Not used cat_features parameter
- With count features
- StratifiedKFold: 5

Deep CTR

xDeepFM
- With Gauss Rank
- StratifiedKFold: 5
FiBiNET
- With Gauss Rank
- StratifiedKFold: 5
- Long training and inferencing time

Ensemble

Sigmoid Ensemble

I used the concept of log-odds from logistic regression to construct an ensemble:
$$\sigma(𝑥)=\frac{1}{1 + e^{-x}}$$
$$\sigma^{-1}(x)= \log(\frac{x}{1-x})$$
$$\hat{y}=\sigma(\frac{1}{n}\sum_i^n \sigma^{-1}(x_i))=\sigma(\mathbb{E}[\sigma^{-1}(X)])$$

It seems to perform better than other ensembles (Rank, Voting).
Since the prediction values are probabilities, we used the logit function and its inverse to perform bagging for the ensemble.

Benchmark

Each model result

Model	cv	public-lb	private-lb
LightGBM-0.45 sampling	0.7850	0.7863	0.7866
FiBiNET-0.45 sampling	0.7833	0.7861	0.7862
xDeepFM-0.45 sampling	0.7819	0.7866	0.7867
wide&deep-0.45 sampling	0.7807	0.7835	0.7837
AutoInt-0.45 sampling	0.7813	0.7846	0.7848
CatBoost-0.45 sampling	0.7765	0.7773	0.7778

Ensemble result

Method	public-lb	private-lb
Rank Ensemble	0.7889	-
Average Ensemble	0.7892	-
Weighted average Ensemble	0.7891	-
Sigmoid Ensemble	0.7903	0.7905

Doesn't Work

Day Cross validation
Day feature
Catboost with cat_features parameter
XGBoost with GPU
Hash features: need more RAM
DeepFM
LightGBM DART

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
config		config
input		input
libs		libs
notebook		notebook
res		res
scripts		scripts
.flake8		.flake8
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yaml		environment.yaml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

web-ctr-prediction

Setting

Requirements

Run code

Summary

Negative Sampling

Features

Label Encoder

Count features

Gauss Rank

Model

GBDT

Deep CTR

Ensemble

Sigmoid Ensemble

Benchmark

Doesn't Work

Reference

About

Releases

Packages

Languages

License

ds-wook/web-ctr-prediction

Folders and files

Latest commit

History

Repository files navigation

web-ctr-prediction

Setting

Requirements

Run code

Summary

Negative Sampling

Features

Label Encoder

Count features

Gauss Rank

Model

GBDT

Deep CTR

Ensemble

Sigmoid Ensemble

Benchmark

Doesn't Work

Reference

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages