-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generating large data sets is slower than I thought #205
Comments
I think that this is a memory management issue within data.table as the time increase is linked to the overall memory size of the resulting dt: r$> bench::mark(genData(1000000, d1))
# A tibble: 1 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 genData(1e+06, d1) 11.1s 11.1s 0.0900 684MB 16.0 1 178 11.1s <dt [1,000,000 × 5]> <Rprofmem [16,857 × 3]> <bench_tm [1]> <tibble [1 × 3]>
r$> d1 <- defData(varname = "x1", formula = "..test_1", variance = 1)
r$> bench::mark(genData(1000000, d1))
# A tibble: 1 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 genData(1e+06, d1) 2.45s 2.45s 0.408 134MB 18.0 1 44 2.45s <dt [1,000,000 × 2]> <Rprofmem [3,271 × 3]> <bench_tm [1]> <tibble [1 × 3]>
r$> bench::mark(genData(4000000, d1))
# A tibble: 1 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 genData(4e+06, d1) 9.9s 9.9s 0.101 534MB 18.1 1 179 9.9s <dt [4,000,000 × 2]> <Rprofmem [16,487 × 3]> <bench_tm [1]> <tibble [1 × 3]> I am not familiar with dt's internals, so I don't really have a more specific idea of what's going on. Profiling this issue will probably difficult as it is probably handled in C/C++ within dt. I will see if I can find out more.... |
Actually it might be caused by us copying dt's around instead of using reference semantics... I would have to check bu I think we do use This is a bit related to #50. We should only copy once and use that one copy internally with reference semantics. (copy in |
hmm no that's not it, removing the This makes sense for the results above too as we copy once for each var so 4 v 1 which if the issue should have shown a noticeable change... |
The issue is most certainly with |
I don't think we need to "keyby" below - that speeds things up considerably. That seems to do the trick - but I need to verify to make sure. res <- with(e, {
expr <- parse(text = as.character(formula2parse))
tryCatch(
expr = dtSim[, newVar := eval(expr)], ### <-----------------------
# expr = dtSim[, newVar := eval(expr), keyby = def_id],
error = function(err) {
if (grepl("RHS length must either be 1", gettext(err), fixed = T)) {
dtSim[, newVar := eval(expr)]
} else {
stop(gettext(err))
}
}
)
copy(dtSim$newVar)
}) |
The only problem seems to be vectorized arguments - this can probably be addressed as a unique (slower) case. |
I have fixed this so that it handles vectorized arguments as well matrix arguments - though it is not pretty. It is currently in branch related to Issue #126. |
This seems to be resolved - and is no longer running slowly. |
@assignUser I've been playing around with some code to address Issue #126, and I started to notice that generating large data sets is slower than I thought. Really slow. I've tried this with different versions of
simstudy
,data.table
, andR
, and nothing seems to make a difference. (I did this, because I didn't think things were so slow in earlier versions.) Here is some code replicate the slowness:I've starte to check out
profvis
to generate some profiles of the process, so that I can see where the bottleneck is. Here are some results (with pauses built in), and here are other results (without pauses). The problem is the delay seems to be generally relateddata.table
code, but I haven't drilled down to confirm exactly where - the profiling results don't seem to indicate. I was wondering if you have experience with profiling, and can better interpret the results.The text was updated successfully, but these errors were encountered: