Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

h2o.shap_summary_plot feature data normalization issue with binary variables #16407

Open
laura-vangalen opened this issue Oct 3, 2024 · 2 comments
Labels

Comments

@laura-vangalen
Copy link

H2O version, Operating System and Environment
Windows 10. R version 4.4.1. h2o R package version 3.44.0.3

Actual behavior
When I use h2o.shap_summary_plot to plot the output of a random forest model with only numeric variables, the normalized versions of those the numeric variables that are binary (0,1) are not 0,1, they come out as ~0.5 and 1 (and so are plotted as purple and pink rather than blue and pink). But if I include a factor variable in my model, then they get normalized to 0 and 1 (and are plotted as blue and pink). Numeric variables seem to be being normalized differently depending on whether there are factor variables in the model.

Expected behavior
I would expect the binary variables to be treated the same regardless of what other variables are in the model

Steps to reproduce

h2o.init()
example <- data.frame(
  NumericVar = rnorm(100, mean = 50, sd = 10), # Numeric variable (normal distribution)
  BinaryVar = sample(c(0, 1), 100, replace = TRUE), # Binary variable (0, 1)
  BinaryVar2 = sample(c(0, 1), 100, replace = TRUE) # Binary variable (0, 1)
)
example$CorrelatedVar = example$NumericVar * 0.8 + rnorm(100, mean = 0, sd = 5)  # Add response variable

# run and plot model that contains numeric values only. The binary numeric variables don't get normalized to 0 and 1.
regressionMatrix <- as.h2o(example)
rfModel <- h2o.randomForest(training_frame = regressionMatrix,
                            y = "CorrelatedVar",
                            ntrees = 500,
                            mtries = 3,
                            sample_rate = 0.632,
                            min_rows = 2,
                            seed = 42,
                            max_depth = 20)
p1=h2o.shap_summary_plot(
  model = rfModel,
  newdata = regressionMatrix
)
p1

# change one variable to a factor, then run and plot the model. The remaining binary numeric variable does get normalized to 0 and 1
example$BinaryVar2=as.factor(example$BinaryVar2) # change one of the binary variables to a factor

regressionMatrix <- as.h2o(example)
rfModel <- h2o.randomForest(training_frame = regressionMatrix,
                            y = "CorrelatedVar",
                            ntrees = 500,
                            mtries = 3,
                            sample_rate = 0.632,
                            min_rows = 2,
                            seed = 42,
                            max_depth = 20)
p2=h2o.shap_summary_plot(
  model = rfModel,
  newdata = regressionMatrix
)
p2

Screenshots
"p1" plot - only numeric variables in the model. Both binary variables are not normalized to 0 and 1, more like 0.5 and 1
image

"p2" plot - "BinaryVar2" has been changed to a factor. Now the remaining numeric binary variable is normalized to 0 and 1
image

Why is this happening? How can I get the plot to properly normalize binary variables to be 0 and 1 even when I don't have features that are factors?

@tomasfryda
Copy link
Contributor

This looks like a bug. Thank you for reporting it!

Why is this happening?

We try to show the value of individual columns using one color scheme and to make it more robust to outliers we show use quantiles of the points instead of their actual value. This should be relatively robust for continuous values (outlier won't make the point with just one color). Another advantage is that you can somehow compare the values between multiple columns - the same quantile will have the same color regardless the actual value.

How can I get the plot to properly normalize binary variables to be 0 and 1 even when I don't have features that are factors?

I would suggest using factors as the models might benefit from the information that the column contains discrete values.

But if you want to change how the values are normalized you can use the following code. I changed the code so that it doesn't use quantiles for columns with less than 32 unique values.

.uniformize <- function(col) {
  if (is.factor(col)) {
    return(.min_max(as.numeric(col) / nlevels(col)))
  }
  if (is.character(col) || all(is.na(col))) {
    if (is.character(col) && !all(is.na(col))) {
      fct <- as.factor(col)
      return(.min_max(as.numeric(fct) / nlevels(fct)))
    }
    return(rep_len(0, length(col)))
  }
  res <- col
  if (length(unique(col)) >= 32) # don't uniformize for low number of unique values
    res <- stats::ecdf(col)(col)
  res[is.na(res)] <- 0
  return(res)
}

assignInNamespace(".uniformize", .uniformize, "h2o")

@laura-vangalen
Copy link
Author

Thanks for your quick response. Another hack I found to change how the values are normalized was to add a fake character variable. This variable then got automatically deleted when running the model, but the normalizing still worked in the way I wanted. But thanks for the code, that is much better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants