Generating large data sets is slower than I thought #205

kgoldfeld · 2023-06-05T20:14:35Z

@assignUser I've been playing around with some code to address Issue #126, and I started to notice that generating large data sets is slower than I thought. Really slow. I've tried this with different versions of simstudy, data.table, and R, and nothing seems to make a difference. (I did this, because I didn't think things were so slow in earlier versions.) Here is some code replicate the slowness:

library(simstudy)

d1 <- defData(varname = "x1", formula = "..test_1", variance = 1)
d1 <- defData(d1, varname = "x2", formula = "..test_1", variance = 1)
d1 <- defData(d1, varname = "x3", formula = "..test_1", variance = 1)
d1 <- defData(d1, varname = "x4", formula = "..test_1", variance = 1)

test_1 <- 5

dd <- genData(1000000, d1, id = "site")

I've starte to check out profvis to generate some profiles of the process, so that I can see where the bottleneck is. Here are some results (with pauses built in), and here are other results (without pauses). The problem is the delay seems to be generally related data.table code, but I haven't drilled down to confirm exactly where - the profiling results don't seem to indicate. I was wondering if you have experience with profiling, and can better interpret the results.

The text was updated successfully, but these errors were encountered:

assignUser · 2023-06-06T04:20:15Z

I think that this is a memory management issue within data.table as the time increase is linked to the overall memory size of the resulting dt:

r$> bench::mark(genData(1000000, d1))
# A tibble: 1 × 13
  expression              min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result               memory                  time           gc
  <bch:expr>         <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>               <list>                  <list>         <list>
1 genData(1e+06, d1)    11.1s    11.1s    0.0900     684MB     16.0     1   178      11.1s <dt [1,000,000 × 5]> <Rprofmem [16,857 × 3]> <bench_tm [1]> <tibble [1 × 3]>

r$> d1 <- defData(varname = "x1", formula = "..test_1", variance = 1)

r$> bench::mark(genData(1000000, d1))
# A tibble: 1 × 13
  expression              min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result               memory                 time           gc
  <bch:expr>         <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>               <list>                 <list>         <list>
1 genData(1e+06, d1)    2.45s    2.45s     0.408     134MB     18.0     1    44      2.45s <dt [1,000,000 × 2]> <Rprofmem [3,271 × 3]> <bench_tm [1]> <tibble [1 × 3]>

r$> bench::mark(genData(4000000, d1))
# A tibble: 1 × 13
  expression              min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result               memory                  time           gc
  <bch:expr>         <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>               <list>                  <list>         <list>
1 genData(4e+06, d1)     9.9s     9.9s     0.101     534MB     18.1     1   179       9.9s <dt [4,000,000 × 2]> <Rprofmem [16,487 × 3]> <bench_tm [1]> <tibble [1 × 3]>

I am not familiar with dt's internals, so I don't really have a more specific idea of what's going on. Profiling this issue will probably difficult as it is probably handled in C/C++ within dt.

I will see if I can find out more....

assignUser · 2023-06-06T04:25:23Z

Actually it might be caused by us copying dt's around instead of using reference semantics... I would have to check bu I think we do use copy within .generate. We should not do that :D

This is a bit related to #50. We should only copy once and use that one copy internally with reference semantics. (copy in genData and not .generate or .gen*)

assignUser · 2023-06-06T04:44:47Z

hmm no that's not it, removing the copy from the switch statment for the normal distribution doesn't change anything... we don't actually modify so something (dt/C++) might be clever enough to only actually make a copy when changes happen...

This makes sense for the results above too as we copy once for each var so 4 v 1 which if the issue should have shown a noticeable change...

kgoldfeld · 2023-06-06T19:04:58Z

The issue is most certainly with .evalWith.

kgoldfeld · 2023-06-06T19:56:25Z

I don't think we need to "keyby" below - that speeds things up considerably. That seems to do the trick - but I need to verify to make sure.

 res <- with(e, {
      expr <- parse(text = as.character(formula2parse))
      tryCatch(
        expr = dtSim[, newVar := eval(expr)],     ### <-----------------------
        # expr = dtSim[, newVar := eval(expr), keyby = def_id],
        error = function(err) {
          if (grepl("RHS length must either be 1", gettext(err), fixed = T)) {
            dtSim[, newVar := eval(expr)]
          } else {
            stop(gettext(err))
          }
        }
      )
      copy(dtSim$newVar)
    })

kgoldfeld · 2023-06-07T00:56:13Z

The only problem seems to be vectorized arguments - this can probably be addressed as a unique (slower) case.

kgoldfeld · 2023-06-07T20:21:40Z

I have fixed this so that it handles vectorized arguments as well matrix arguments - though it is not pretty. It is currently in branch related to Issue #126.

kgoldfeld · 2024-05-24T19:48:41Z

This seems to be resolved - and is no longer running slowly.

kgoldfeld added the feature feature request or enhancement label Jun 5, 2023

kgoldfeld changed the title ~~Generating large data sets is slower than I though~~ Generating large data sets is slower than I thought May 24, 2024

kgoldfeld closed this as completed May 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generating large data sets is slower than I thought #205

Generating large data sets is slower than I thought #205

kgoldfeld commented Jun 5, 2023

assignUser commented Jun 6, 2023

assignUser commented Jun 6, 2023

assignUser commented Jun 6, 2023

kgoldfeld commented Jun 6, 2023

kgoldfeld commented Jun 6, 2023

kgoldfeld commented Jun 7, 2023

kgoldfeld commented Jun 7, 2023

kgoldfeld commented May 24, 2024

Generating large data sets is slower than I thought #205

Generating large data sets is slower than I thought #205

Comments

kgoldfeld commented Jun 5, 2023

assignUser commented Jun 6, 2023

assignUser commented Jun 6, 2023

assignUser commented Jun 6, 2023

kgoldfeld commented Jun 6, 2023

kgoldfeld commented Jun 6, 2023

kgoldfeld commented Jun 7, 2023

kgoldfeld commented Jun 7, 2023

kgoldfeld commented May 24, 2024