-
Notifications
You must be signed in to change notification settings - Fork 978
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added an atime test for performance improvement in forderv when reusing existing key and index attributes #6555
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #6555 +/- ##
=======================================
Coverage 98.62% 98.62%
=======================================
Files 79 79
Lines 14450 14450
=======================================
Hits 14251 14251
Misses 199 199 ☔ View full report in Codecov by Sentry. |
Generated via commit 7800a62 Download link for the artifact containing the test results: ↓ atime-results.zip
|
.ci/atime/tests.R
Outdated
"forderv improved in #4386" = atime::atime_test( | ||
N = 10^seq(3, 8), # 1e9 exceeds the runner's memory (process gets killed) | ||
setup = { | ||
options(datatable.forder.auto.index = TRUE, datatable.forder.reuse.sorting = TRUE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if setting these options will affect other tests?
I checked where this is defined / documented and I found for auto.index
:
grep -nH --null "auto[.]index" ../*/*
../R/data.table.R:3309: if (!getOption("datatable.auto.index")) return(NULL)
../R/onLoad.R:87: "datatable.auto.index"="TRUE", # DT[col=="val"] to auto add index so 2nd time faster
../R/setkey.R:144: if (isTRUE(getOption("datatable.forder.auto.index"))) return(invisible())
grep: ../inst/include: Is a directory
grep: ../inst/po: Is a directory
grep: ../inst/tests: Is a directory
../man/datatable-optimize.Rd:102:\code{options(datatable.auto.index = FALSE)}. To switch off using existing
Binary file ../src/data_table.dll matches
../src/forder.c:1628:// isTRUE(getOption("datatable.auto.index"))
../src/forder.c:1630: // for now temporarily 'forder.auto.index' not 'auto.index' to disabled it by default
../src/forder.c:1633: SEXP opt = GetOption(install("datatable.forder.auto.index"), R_NilValue);
../src/forder.c:1637: error("'datatable.forder.auto.index' option must be TRUE or FALSE"); // # nocov
../src/forder.c:1771: GetAutoIndex()) { // disabled by default, use datatable.forder.auto.index=T to enable, do not export/document, use for debugging only
Binary file ../src/forder.o matches
grep: ../vignettes/css: Is a directory
../vignettes/datatable-benchmarking.Rmd:67:options(datatable.auto.index=TRUE)
../vignettes/datatable-benchmarking.Rmd:72:- `auto.index=FALSE` disables building index automatically when doing subset on non-indexed data, but if indices were created before this option was set, or explicitly by calling `setindex` they still will be used for optimization.
../vignettes/datatable-secondary-indices-and-auto-indexing.Rmd:318:* Auto indexing can be disabled by setting the global argument `options(datatable.auto.index = FALSE)`.
grep: ../vignettes/plots: Is a directory
above I see two options, datatable.forder.auto.index
and datatable.auto.index
datatable.auto.index
is documented in Rd and two vignettes- forder.c comments say "datatable.forder.auto.index=T to enable, do not export/document, use for debugging only"
which option does your test require setting? can you please add a comment to clarify why?
and I found for reuse.sorting
:
grep -nH --null "reuse[.]sorting" ../*/*
../R/setkey.R:151:forderv = function(x, by=seq_along(x), retGrp=FALSE, retStats=retGrp, sort=TRUE, order=1L, na.last=FALSE, reuseSorting=getOption("datatable.reuse.sorting", NA)) {
in your test you have datatable.forder.reuse.sorting
but in the code there is no forder in the option name, datatable.reuse.sorting
... so I guess this option is not required to get your test result... where did you get that option name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I don't need datatable.reuse.sorting
, so do you want me to add a comment as to why datatable.forder.auto.index
is required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if setting these options will affect other tests?
I hope not! (we'll have to see from the results we get here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't look like its affecting the other tests
your original test had 3 calls to forderv.
I was wondering if we need both retGrp=TRUE and FALSE in expr? But for retGrp=T in setup then retGrp=F in expr, we see linear Fast, as in the plot below: looking at datatable.verbose=T in this case, for example like the code below, I see "using existing index" which to me suggests that this should be constant (not linear as observed), so is this a performance bug?? > set.seed(1);library(data.table);options(datatable.verbose=T);dt <- data.table(index = sample(N), values = sample(N));data.table:::forderv(dt, "index", retGrp = TRUE);cat("------\n");data.table:::forderv(dt, "index", retGrp = FALSE)
forder.c received 10 rows and 2 columns
forderReuseSorting: opt=-1, took 0.000s
[1] 4 5 7 2 6 9 3 10 1 8
attr(,"starts")
[1] 1 2 3 4 5 6 7 8 9 10
attr(,"maxgrpn")
[1] 1
attr(,"anyna")
[1] 0
attr(,"anyinfnan")
[1] 0
attr(,"anynotascii")
[1] 0
attr(,"anynotutf8")
[1] 0
------
forder.c received 10 rows and 2 columns
forderReuseSorting: opt=-1, took 0.000s
[1] 4 5 7 2 6 9 3 10 1 8 I think we need explain in the comments what is expected to happen (when is expr expected to be linear vs constant). |
so here is a CI result that shows 3 constant Fast vs 1 linear Fast https://asset.cml.dev/f6abccc722845026bf081f0c9aa93b071bef4a4a?cml=png&cache-bypass=5e8ae8d9-fc0e-4b9b-9831-12442334b5d5 |
for(retGrp in retGrp_values){ | ||
data.table:::forderv(dt, "index", retGrp = eval(str2lang(retGrp))) | ||
index.list[[retGrp]] <- attr(dt, "index") | ||
} | ||
}), | ||
expr = substitute({ | ||
data.table:::forderv(dt, "index", retGrp = RETGRP) # Reusing the index and computing group info. | ||
}, list(RETGRP=str2lang(retGrp_expr))), | ||
setattr(dt, "index", index.list[[retGrp_setup]]) | ||
data.table:::forderv(dt, "index", retGrp = retGrp_expr) # Reusing the index and computing group info. | ||
}, list( | ||
retGrp_setup=retGrp_setup, | ||
retGrp_expr=str2lang(retGrp_expr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this refactor I compute both retGrp=T and F in setup (using CRAN data.table), and store the "index" attribute in index.list
then in expr I set the "index" attribute to one of the two indices we computed in setup (using either retGrp=T or F).
and then we call forderv using a git version of data.table...
and I observe a different result than before --- Fast is always linear (for any combination of retGrp=T or F), never constant.
I don't understand what is happening, or which version of this test we should prefer (with or without setattr).
Can someone who knows forderv better please investigate/explain what is expected time complexity here?
Not sure what exactly question is about. |
Hi @jangorecki we are trying to adapt your examples here #4386 (comment) to be a performance test, but we are not sure about what is the right way to test the new functionality vs the old.
I guess I don't understand the difference between retGrp=TRUE and FALSE, is that documented somewhere? |
Rerunning same should be constant. As for the other two cases I cannot look so much in details into it at the moment, but possibly when computing retGrp=F when having retGrp=T we may still need to run forder on the index itself to get the original order from groups (should be fast but not constant). In one of PR was fdistinct() which is very small function that was using those so the exact logic can be looked up there easily. |
Closes #6320