Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong predictions with quantile distribution #149

Open
khotilov opened this issue Dec 16, 2017 · 0 comments
Open

Wrong predictions with quantile distribution #149

khotilov opened this issue Dec 16, 2017 · 0 comments

Comments

@khotilov
Copy link

I tried to switch from gbm to gbm3 for quantile regression, but I saw some wrong quantile prediction results from gbm3. See the example below, using the attached toy data X.zip

library(data.table)

get_tree <- function(g) {
  t <- if (class(g) == "gbm") gbm::pretty.gbm.tree(g)
       else gbm3::pretty_gbm_tree(g)
  t$Node <- as.integer(rownames(t))
  mis <- t$MissingNode + 1
  t <- t[-mis,]
  t$MissingNode <- NULL
  t$RealPrediction <- t$Prediction + g$initF
  t
}

X <- readRDS('X.rds')
str(X)

params <- list(y ~ ., data=X, distribution = list(name="quantile", alpha=0.9), n.trees = 1,
               interaction.depth = 2, n.minobsinnode = 250, shrinkage = 1, bag.fraction = 1)
g0 <- do.call(gbm::gbm, params)
g3 <- do.call(gbm3::gbm, params)

get_tree(g0)
get_tree(g3)

# overall 90% quantile:
quantile(X$y, 0.9, type = 2)
# true 90% quantiles inside the splits:
X[, quantile(y, 0.9, type = 2), .(s1 = a<14.5, s2 = a>=14.5 & b < 0.133)]

# Check the empirical CDF's in the 4th node for g0 and g3:
X[a>=14.5 & b >= 0.133, ecdf(y)(c(1.716003, 1.529294))]

The output of it is

> get_tree(g0)
  SplitVar SplitCodePred LeftNode RightNode ErrorReduction Weight  Prediction Node RealPrediction
0        0    14.5000000        1         2      0.6721752   1567  0.03466136    0       1.984051
1       -1     0.1333954       -1        -1      0.0000000    854  0.13339536    1       2.082785
2        1     0.5000000        3         4      1.4643795    713 -0.08359787    2       1.865792
3       -1     0.1578200       -1        -1      0.0000000    273  0.15781996    3       2.107210
4       -1    -0.2333867       -1        -1      0.0000000    440 -0.23338666    4       1.716003
> get_tree(g3)
  SplitVar SplitCodePred LeftNode RightNode ErrorReduction Weight   Prediction Node RealPrediction
0        0    14.5000000        1         2      0.6721752   1567  0.009314278    0       1.958704
1       -1     0.1333954       -1        -1      0.0000000    854  0.133395364    1       2.082785
2        1     0.5000000        3         4      1.4643795    713 -0.112242207    2       1.837148
3       -1     0.1578200       -1        -1      0.0000000    273  0.157819963    3       2.107210
4       -1    -0.4200960       -1        -1      0.0000000    440 -0.420095993    4       1.529294
> 
> # overall 90% quantile:
> quantile(X$y, 0.9, type = 2)
    90% 
1.94939 
> # true 90% quantiles inside the splits:
> X[, quantile(y, 0.9, type = 2), .(s1 = a<14.5, s2 = a>=14.5 & b < 0.133)]
      s1    s2       V1
1: FALSE FALSE 1.716003
2: FALSE  TRUE 2.107210
3:  TRUE FALSE 2.082785
> 
> # Check the empirical CDF's in the 4th node for g0 and g3:
> X[a>=14.5 & b >= 0.133, ecdf(y)(c(1.716003, 1.529294))]
[1] 0.8909091 0.5727273

While the true 90% quantiles inside the splits match the leaves from g0 spot on, the node 4 leaf in g3 is wrong, and it corresponds to a 57% empirical quantile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant