Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow visualize() to give density curve on non theoretical distribution #539

Closed
nicolasfoss opened this issue Aug 12, 2024 · 4 comments
Closed

Comments

@nicolasfoss
Copy link

nicolasfoss commented Aug 12, 2024

Intro

Hello! First, thank you for taking time to review this issue. This is my first time posting an issue on Github so I might be making some mistakes. Please let me know how to improve this so I can make the material more helpful or to save you time.

Issue

In cases where you may want to overlay a density curve over the histogram produced by visualize(method = "simulation"), you cannot run visualize(method = "both") or you will get an error. It might be a helpful option to see shading under a density curve instead of thr histogram shading. See below for a reprex and example of what I propose be available for the empirical distribution as well via method = "simulation". This would be quite helpful for cases when you are looking at categorical data, as the error you get (see below) mentions that.

I will admit...

I do realize there is documentation showing how to use visualize() with both the theoretical and empirical distributions, but my use case here did not apply to the documentation / vignette.

# visualize oob
## set random number seed

set.seed(123)

## create data
faux_data <- tibble(dead = sample(
  x = factor(c("yes", "no"), levels = c("yes", "no")), 
  size = 100, replace = T),
  gender = sample(
    x = factor(c("male", "female"), levels = c("male", "female")),
    size = 100,
    replace = T
    ))

## summarize data

faux_summary <- faux_data %>% 
  summarize(N = n(), 
            prop_dead = mean(dead == "yes"), 
            count_dead = sum(dead == "yes"), 
            .by = gender
            ) # females died more

## get difference

faux_diff <- faux_data %>% 
  specify(dead ~ gender, success = "yes") %>% 
  calculate(stat = "diff in props", order = c("male", "female")) %>%  # negative diff shows females died more
  pull()

## prop test
faux_test <- faux_data %>% 
  prop_test(dead ~ gender, order = c("male", "female")) # not significantly different

## model
faux_model <- faux_data %>% 
  specify(dead ~ gender, success = "yes") %>% 
  hypothesise(null = "independence") %>% 
  generate(reps = 1000, type = "permute") %>% 
  calculate(stat = "diff in props", order = c("male", "female"))

## visualize oob performance

faux_model %>% visualize() # this works fine, and you can shade the areas where the obs. stat needs to fall to be significant

## try to get the density curve - error
faux_model %>% visualize(method = "both") 

It would be great for visualize to produce the density curve for non-theoretical distributions like the one produced above here is a home grown example.

# solution
calculate_critical_values <- function(data, stat_col, alpha = 0.05) {
  data %>%
    summarise(
      lower = quantile({{stat_col}}, probs = alpha / 2),
      upper = quantile({{stat_col}}, probs = 1 - (alpha / 2))
    )
}

## get critical values

faux_crit <- faux_model %>% 
  calculate_critical_values(stat_col = stat, alpha = 0.05)

lower_crit <- faux_crit$lower

upper_crit <- faux_crit$upper

## modify df to classify stat by critical values
faux_model_mod <- faux_model %>% 
  mutate(area = if_else(stat <= lower_crit | stat >= upper_crit, TRUE, FALSE))

## Calculate density

density_data <- density(faux_model$stat)
density_df <- data.frame(x = density_data$x, y = density_data$y)

## plot the distribution of differences between proportions

{
  
  faux_model_diff_plot <- faux_model_mod %>% 
    ggplot(aes(x = stat, label = round(faux_diff, digits = 5))) + 
    geom_histogram(aes(y = after_stat(density)), bins = 15, fill = "lightgray", color = "black", alpha = 0.5) +
    geom_line(data = density_df, aes(x = x, y = y), color = "blue", linewidth = 1.25, alpha = 0.5) +
    geom_area(data = density_df %>% filter(x <= lower_crit), aes(x = x, y = y), fill = "coral1", alpha = 0.5) +
    geom_area(data = density_df %>% filter(x >= upper_crit), aes(x = x, y = y), fill = "coral1", alpha = 0.5) +
    geom_area(data = density_df %>% filter(x <= upper_crit, x >= lower_crit), aes(x = x, y = y), fill = "dodgerblue", alpha = 0.5) +
    ggplot2::annotate(geom = "segment", x = faux_diff, y = 0, xend = faux_diff, yend = Inf,
                      color = "coral1",
                      linewidth = 2.25) + 
    ggplot2::annotate(
      geom = "text",
      x = faux_diff - (faux_diff * 0.1),
      y = 3,
      label = paste0("diff = ", pretty_number(faux_diff, n_decimal = 5)),
      size = 8,
      family = "Work Sans",
      angle = 90
    ) + 
    labs(x = "Differences in Proportions",
         y = "Count / Kernel Density",
         title = "Differences in Proportions of Deceased Males and Females",
         subtitle = "Simulation-Based Null Distribution from 1,000 Permutated Samples",
         caption = paste0("- p = ", round(faux_test$p_value, digits = 3))
    ) + 
    theme_clean()

}

Thanks!

Thanks again for your time and I look forward to interacting with you.

@simonpcouch
Copy link
Collaborator

Thanks for the issue! Hope you don't mind that I edited your issue description to adjust code formatting.

I'll first say that stat = "z" is a "theorized" test statistic supported by infer that can apply those distributional assumptions to plot a curve for you:

library(infer)
library(tidyverse)
   
set.seed(123)

faux_data <- tibble(dead = sample(
   x = factor(c("yes", "no"), levels = c("yes", "no")), 
   size = 100, replace = T),
   gender = sample(
      x = factor(c("male", "female"), levels = c("male", "female")),
      size = 100,
      replace = T
   ))

faux_summary <- faux_data %>% 
   summarize(N = n(), 
             prop_dead = mean(dead == "yes"), 
             count_dead = sum(dead == "yes"), 
             .by = gender
   )

faux_diff <- faux_data %>% 
   specify(dead ~ gender, success = "yes") %>% 
   calculate(stat = "z", order = c("male", "female")) %>%
   pull()

faux_test <- faux_data %>% 
   prop_test(dead ~ gender, order = c("male", "female")) 

faux_model <- faux_data %>% 
   specify(dead ~ gender, success = "yes") %>% 
   hypothesise(null = "independence") %>% 
   generate(reps = 1000, type = "permute") %>% 
   calculate(stat = "z", order = c("male", "female"))


faux_model %>% visualize()

faux_model %>% visualize(method = "both")
#> Warning: Check to make sure the conditions have been met for the theoretical method.
#> infer currently does not check these for you.

Created on 2024-08-19 with reprex v2.1.1

In general, I think we feel that histograms best represent the empirical distributions that can be created with infer pipelines, and likely won't extend that functionality in future releases of the package. That said, the outputs of visualize() are just ggplot objects, and you're able to add elements to them as you please! With your example and a call to geom_density():

library(infer)
library(tidyverse)
   
set.seed(123)

faux_data <- tibble(dead = sample(
   x = factor(c("yes", "no"), levels = c("yes", "no")), 
   size = 100, replace = T),
   gender = sample(
      x = factor(c("male", "female"), levels = c("male", "female")),
      size = 100,
      replace = T
   ))

faux_summary <- faux_data %>% 
   summarize(N = n(), 
             prop_dead = mean(dead == "yes"), 
             count_dead = sum(dead == "yes"), 
             .by = gender
   )

faux_diff <- faux_data %>% 
   specify(dead ~ gender, success = "yes") %>% 
   calculate(stat = "diff in props", order = c("male", "female")) %>%
   pull()

faux_test <- faux_data %>% 
   prop_test(dead ~ gender, order = c("male", "female")) 

faux_model <- faux_data %>% 
   specify(dead ~ gender, success = "yes") %>% 
   hypothesise(null = "independence") %>% 
   generate(reps = 1000, type = "permute") %>% 
   calculate(stat = "diff in props", order = c("male", "female"))


faux_model %>% 
   visualize() +
   geom_density(aes(x = stat))

Created on 2024-08-19 with reprex v2.1.1

@nicolasfoss
Copy link
Author

@simonpcouch Thanks so much for the guidance.

It looks like I can get what I was looking for already! I had not seen stat = "z" and I will make sure to dig deeper next time.

Have a great week, it was a pleasure learning with you at posit::conf.

@simonpcouch
Copy link
Collaborator

You're very welcome! You as well.

Copy link

github-actions bot commented Sep 3, 2024

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Sep 3, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants