Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capturing vector transformation parameters #127

Open
realauggieheschmeyer opened this issue Jul 27, 2022 · 2 comments
Open

Capturing vector transformation parameters #127

realauggieheschmeyer opened this issue Jul 27, 2022 · 2 comments

Comments

@realauggieheschmeyer
Copy link
Contributor

Both log_interval_vec() and standardize_vec() will print the auto-detected parameters used to scale the target variable.

For example:

log_interval_vec(): 
 Using limit_lower: 0
 Using limit_upper: 12
 Using offset: 1
 
Standardization Parameters
mean: -3.0500341071016
standard deviation: 1.22764358571979

However, there is currently no native way to capture these parameters outside of reviewing the printed text and manually saving the information. This isn't a problem for one-off analyses but prevents one from using these functions as part of an automated forecasting workflow. The target variable can be scaled automatically but without being able to store and access the parameters later, any predictions on the new variable can not be transformed back to the original scale without human intervention.

It would be nice to have some helper function that can be run prior to mutating your target variable to extract the relevant parameters and save them for later in the workflow.

Below is the code I wrote to capture these parameters manually:

log_params <- ticket_volume_pad_tbl %>% 
  group_by(department, ticket_type) %>% 
  summarize(
    limit_lower = 0,
    limit_upper = (max(tickets) * 1.1) + 1,
    .groups = "drop"
  )

standardization_params <- ticket_volume_pad_tbl %>% 
  left_join(log_params, by = c("department", "ticket_type")) %>% 
  mutate(
    tickets_scaled = log(((tickets + 1) - limit_lower) / (limit_upper - (tickets + 1)))
  ) %>% 
  group_by(department, ticket_type) %>% 
  summarize(
    mean = mean(tickets_scaled),
    standard_deviation = sd(tickets_scaled),
    .groups = "drop"
  )

log_params %>% 
  left_join(standardization_params, by = c("department", "ticket_type"))

If it's helpful, I can try my hand at converting the above into a function but I'd love some guidance on how to style it appropriately within the existing timetk functions.

@spsanderson
Copy link

I did something similar but it was strictly for my own use, see here:

https://github.com/spsanderson/healthyverse_tsa/blob/master/00_scripts/data_manipulation_functions.R

@realauggieheschmeyer
Copy link
Contributor Author

In addition to automated workflows, the manual nature of this process would also be problematic if you had a large number of groups in your data. Just imagine trying to forecast retail SKUs and having to manually log hundreds or thousands of parameters 😰

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants