Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roll-back tracking full language objects for internal calls #568

Merged
merged 8 commits into from
Sep 12, 2024

Conversation

yjunechoe
Copy link
Collaborator

@yjunechoe yjunechoe commented Sep 11, 2024

This is a small amendment of #543, which introduced a problem in serialization size due to validation_set$capture_stack$pb_call storing language objects with environments attached.

Now, we simply track the name of the internal function in string (ex: "tbl_val_comparison") as opposed to the full call (ex: tbl_val_comparison(...)). No user-visible changes.

What used to be 13kB -> 208kB is now 1Kb -> 133b. Reprex from #567:

# Setup
agent <- 
  create_agent(
    tbl = pointblank::small_table,
    tbl_name = "small_table",
    label = "`create_agent()` example.",
    actions = action_levels(
      warn_at = 0.10,
      stop_at = 0.25,
      notify_at = 0.35
    )
  )

agent <-
  agent %>% 
  col_exists(columns = c(date, date_time)) %>%
  col_vals_regex(
    columns = b,
    regex = "[0-9]-[a-z]{3}-[0-9]{3}"
  ) %>%
  rows_distinct() %>%
  col_vals_gt(columns = d, value = 100) %>%
  col_vals_lte(columns = c, value = 5) %>%
  col_vals_between(
    columns = c,
    left = vars(a), right = vars(d),
    na_pass = TRUE
  ) %>%
  interrogate()
  
  
# In memory
pb_call <- lapply(agent$validation_set$capture_stack, `[[`, "pb_call")
scales::label_bytes()(
  as.integer(object.size(pb_call))
)
#> [1] "1 kB"

# Serialized
f <- tempfile()
saveRDS(pb_call, f)
scales::label_bytes()(
  file.size(f)
)
#> [1] "133 B"

# Equivalence
identical(pb_call, readRDS(f))
#> [1] TRUE

pb_call
#> [[1]]
#> [1] "tbl_col_exists"
#> 
#> [[2]]
#> [1] "tbl_col_exists"
#> 
#> [[3]]
#> [1] "tbl_val_regex"
#> 
#> [[4]]
#> [1] "tbl_rows_distinct"
#> 
#> [[5]]
#> [1] "tbl_val_comparison"
#> 
#> [[6]]
#> [1] "tbl_val_comparison"
#> 
#> [[7]]
#> [1] "tbl_vals_between"

Rider: fixes a typo in examples of ?create_agent

@yjunechoe yjunechoe linked an issue Sep 11, 2024 that may be closed by this pull request
@yjunechoe
Copy link
Collaborator Author

yjunechoe commented Sep 11, 2024

@rich-iannone I think this fixes the original issue and should be ready to go!

One Q for this PR while I'm on it - do we need a test for memory size at serialization? I suppose it'd only exist to guard against regressions like this, but could be useful since it can easily go unnoticed - though I'm unsure how it's best implemented (some upper bound size?) or whether it's stable enough to be useful as a test at all (across OS, R setups, etc.).

@rich-iannone
Copy link
Member

@rich-iannone I think this fixes the original issue and should be ready to go!

This is amazing!! Thank you so much for working with the user to track this down (and of course for the PR).

One Q for this PR while I'm on it - do we need a test for memory size at serialization? I suppose it'd only exist to guard against regressions like this, but could be useful since it can easily go unnoticed - though I'm unsure how it's best implemented (some upper bound size?) or whether it's stable enough to be useful as a test at all (across OS, R setups, etc.).

This seems somewhat brittle as a test. There might be a convenient way to monitor the size of serialized agents through a separate workflow that just reports the sizes as an artifact (which we could check every now and then).

@yjunechoe
Copy link
Collaborator Author

This seems somewhat brittle as a test. There might be a convenient way to monitor the size of serialized agents through a separate workflow that just reports the sizes as an artifact (which we could check every now and then).

Thanks - this makes sense! For now I just added a manual test manual_tests/tests_agent_serialization_size.R that at least checks for the worst-case scenario of the entire user environment being snapshotted and written out as part of the agent, when using saveRDS().

Copy link
Member

@rich-iannone rich-iannone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@rich-iannone rich-iannone merged commit 8287cc4 into rstudio:main Sep 12, 2024
12 checks passed
@rich-iannone
Copy link
Member

Thanks again!

@yjunechoe yjunechoe deleted the rollback-pb_call-tracking branch September 17, 2024 20:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

File size increase in serialising interrogated agents
2 participants