Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generating large data sets is slower than I thought #205

Closed
kgoldfeld opened this issue Jun 5, 2023 · 8 comments
Closed

Generating large data sets is slower than I thought #205

kgoldfeld opened this issue Jun 5, 2023 · 8 comments
Labels
feature feature request or enhancement

Comments

@kgoldfeld
Copy link
Owner

@assignUser I've been playing around with some code to address Issue #126, and I started to notice that generating large data sets is slower than I thought. Really slow. I've tried this with different versions of simstudy, data.table, and R, and nothing seems to make a difference. (I did this, because I didn't think things were so slow in earlier versions.) Here is some code replicate the slowness:

library(simstudy)

d1 <- defData(varname = "x1", formula = "..test_1", variance = 1)
d1 <- defData(d1, varname = "x2", formula = "..test_1", variance = 1)
d1 <- defData(d1, varname = "x3", formula = "..test_1", variance = 1)
d1 <- defData(d1, varname = "x4", formula = "..test_1", variance = 1)

test_1 <- 5

dd <- genData(1000000, d1, id = "site")

I've starte to check out profvis to generate some profiles of the process, so that I can see where the bottleneck is. Here are some results (with pauses built in), and here are other results (without pauses). The problem is the delay seems to be generally related data.table code, but I haven't drilled down to confirm exactly where - the profiling results don't seem to indicate. I was wondering if you have experience with profiling, and can better interpret the results.

@kgoldfeld kgoldfeld added the feature feature request or enhancement label Jun 5, 2023
@assignUser
Copy link
Collaborator

I think that this is a memory management issue within data.table as the time increase is linked to the overall memory size of the resulting dt:

r$> bench::mark(genData(1000000, d1))
# A tibble: 1 × 13
  expression              min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result               memory                  time           gc
  <bch:expr>         <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>               <list>                  <list>         <list>
1 genData(1e+06, d1)    11.1s    11.1s    0.0900     684MB     16.0     1   178      11.1s <dt [1,000,000 × 5]> <Rprofmem [16,857 × 3]> <bench_tm [1]> <tibble [1 × 3]>

r$> d1 <- defData(varname = "x1", formula = "..test_1", variance = 1)

r$> bench::mark(genData(1000000, d1))
# A tibble: 1 × 13
  expression              min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result               memory                 time           gc
  <bch:expr>         <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>               <list>                 <list>         <list>
1 genData(1e+06, d1)    2.45s    2.45s     0.408     134MB     18.0     1    44      2.45s <dt [1,000,000 × 2]> <Rprofmem [3,271 × 3]> <bench_tm [1]> <tibble [1 × 3]>

r$> bench::mark(genData(4000000, d1))
# A tibble: 1 × 13
  expression              min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result               memory                  time           gc
  <bch:expr>         <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>               <list>                  <list>         <list>
1 genData(4e+06, d1)     9.9s     9.9s     0.101     534MB     18.1     1   179       9.9s <dt [4,000,000 × 2]> <Rprofmem [16,487 × 3]> <bench_tm [1]> <tibble [1 × 3]>

I am not familiar with dt's internals, so I don't really have a more specific idea of what's going on. Profiling this issue will probably difficult as it is probably handled in C/C++ within dt.

I will see if I can find out more....

@assignUser
Copy link
Collaborator

Actually it might be caused by us copying dt's around instead of using reference semantics... I would have to check bu I think we do use copy within .generate. We should not do that :D

This is a bit related to #50. We should only copy once and use that one copy internally with reference semantics. (copy in genData and not .generate or .gen*)

@assignUser
Copy link
Collaborator

hmm no that's not it, removing the copy from the switch statment for the normal distribution doesn't change anything... we don't actually modify so something (dt/C++) might be clever enough to only actually make a copy when changes happen...

This makes sense for the results above too as we copy once for each var so 4 v 1 which if the issue should have shown a noticeable change...

@kgoldfeld
Copy link
Owner Author

The issue is most certainly with .evalWith.

@kgoldfeld
Copy link
Owner Author

I don't think we need to "keyby" below - that speeds things up considerably. That seems to do the trick - but I need to verify to make sure.

 res <- with(e, {
      expr <- parse(text = as.character(formula2parse))
      tryCatch(
        expr = dtSim[, newVar := eval(expr)],     ### <-----------------------
        # expr = dtSim[, newVar := eval(expr), keyby = def_id],
        error = function(err) {
          if (grepl("RHS length must either be 1", gettext(err), fixed = T)) {
            dtSim[, newVar := eval(expr)]
          } else {
            stop(gettext(err))
          }
        }
      )
      copy(dtSim$newVar)
    })

@kgoldfeld
Copy link
Owner Author

The only problem seems to be vectorized arguments - this can probably be addressed as a unique (slower) case.

@kgoldfeld
Copy link
Owner Author

I have fixed this so that it handles vectorized arguments as well matrix arguments - though it is not pretty. It is currently in branch related to Issue #126.

@kgoldfeld kgoldfeld changed the title Generating large data sets is slower than I though Generating large data sets is slower than I thought May 24, 2024
@kgoldfeld
Copy link
Owner Author

This seems to be resolved - and is no longer running slowly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature feature request or enhancement
Projects
None yet
Development

No branches or pull requests

2 participants