Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File size increase in serialising interrogated agents #567

Closed
jl5000 opened this issue Sep 11, 2024 · 15 comments · Fixed by #568
Closed

File size increase in serialising interrogated agents #567

jl5000 opened this issue Sep 11, 2024 · 15 comments · Fixed by #568
Assignees

Comments

@jl5000
Copy link

jl5000 commented Sep 11, 2024

There seems to be some kind of regression in the dev version of {pointblank} that is causing interrogated agents that are saved to disk are over an order of magnitude larger than the ones produced by the version on CRAN.

I am using {pointblank} with {targets} and saving interrogated agents to disk using {qs}. I have a large validation pipeline, which I unfortunately cannot share, with creates a target on disk with size of ~14 MB. When using the dev version, this file size explodes to ~500 MB. In both cases I am ensuring tbl_checked is not extracted (either through the non-public API or the new argument in interrogate()).

Has there been anything added to the validation object which could account for the explosion in file size?

@yjunechoe
Copy link
Collaborator

Thanks for the report - and just to be clear, this is a regression somewhere between the 0.12.1 and the most recent commit on main a48b998?

For the record, here is the full diff between 0.12.1 and now: v0.12.1...main

I don't have a good guess at the moment, but would you be open to helping us narrow it down with a crude bisect between 0.12.1 and now?

  1. ~6 months ago at 4d967d2 remotes::install_github("rstudio/pointblank", "4d967d26ecd1342dab66eb51b133bf363e51c419 ")
  2. ~4 months ago aT 374e9f8 remotes::install_github("rstudio/pointblank", "374e9f8aab420c9e3c23160abf44af49a5993df0")
  3. ~1 month ago at 51968d8 remotes::install_github("rstudio/pointblank", "51968d87dba6873d4edcbf1e29390e04251d9ba2")

Also if you could provide any more information about the file size explosion that'd be great. For example, can you tell whether the memory is due to a larger agent$validation_set or some other element of the agent object (for example, by just saving the validation set data frame via {qs})/rds and seeing if the issue persists?

@jl5000
Copy link
Author

jl5000 commented Sep 11, 2024

That's correct.

I can confirm that the issue seems to be in agent$validation_set. The validation_set object in memory is 0.2 MB (CRAN) vs 0.5 MB (Dev). When using qs::qsave() this becomes 7.85 MB (CRAN) vs 479 MB (Dev).

I will try to do the bisect, but am on a work machine without Rtools so may prove difficult.

@yjunechoe
Copy link
Collaborator

Thanks - that helps narrow things down. Could you confirm whether this is just qs::qsave() or reproducible with saveRDS() as well?

And by "bisect" I just meant "run install_github() with an argument pointing to specific commits" - I presume this is the same way you installed the dev version? Shouldn't need git or any external tool for this!

@jl5000
Copy link
Author

jl5000 commented Sep 11, 2024

saveRDS() also has the exact same issue. I've done the bisection, and the regression seems to be apparent ~6 months ago (4d967d2). I'm quite confused by this as it was right after CRAN submission.

@yjunechoe
Copy link
Collaborator

I'm puzzled as well! How about further back in time to

@jl5000
Copy link
Author

jl5000 commented Sep 11, 2024

Right, apologies, something must not have refreshed properly between installs. I can confirm now that the regression actually occurred ~ 1-4 months ago (51968d8 is the first commit where it regresses).

@yjunechoe
Copy link
Collaborator

yjunechoe commented Sep 11, 2024

Got it! So I think that narrows down the problem space to: 374e9f8...51968d8

That's helpful. I'll try to come up with a reprex. Thanks!

Sorry 1 more quick question @jl5000: Do you use a col_vals_expr() anywhere in the agent? If so, is serialization still affected if you remove that step from the agent?

@jl5000
Copy link
Author

jl5000 commented Sep 11, 2024

There are several calls to col_vals_expr(). However, when I remove them all I still get the issue.

@jl5000
Copy link
Author

jl5000 commented Sep 11, 2024

Another bisection...c54b035 works ok. So the problem space is now c54b035...51968d8

@jl5000
Copy link
Author

jl5000 commented Sep 11, 2024

362b706 also works. Problem space: 362b706...51968d8

@yjunechoe
Copy link
Collaborator

Great sleuthing! I think I may have the answer, traceable to #543, where we started to track a validation_set$capture_stack$pb_call element to show more informative eval errors in the agent report

# In memory
pb_call <- lapply(agent$validation_set$capture_stack, `[[`, "pb_call")
scales::label_bytes()(
  as.integer(object.size(pb_call))
)
#> [1] "13 kB"

# Serialized
f <- tempfile()
saveRDS(pb_call, f)
scales::label_bytes()(
  file.size(f)
)
#> [1] "208 kB"

@yjunechoe
Copy link
Collaborator

I hadn't considered the memory implications of serializing the call language object, but this makes sense!

I'll run a few more things to double check, and if this is correct I'll start working on a (soft) rollback. We definitely don't need to track the full language object in there.

@yjunechoe
Copy link
Collaborator

@jl5000 Can you see if the open PR fixes the issue for you?

remotes::install_github("rstudio/pointblank", remotes::github_pull(568))

@jl5000
Copy link
Author

jl5000 commented Sep 11, 2024

Yes, that seems to work!

@yjunechoe
Copy link
Collaborator

Awesome! Thanks for your help debugging this :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants