Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added an atime test for performance improvement in forderv when reusing existing key and index attributes #6555

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

Anirban166
Copy link
Member

Closes #6320

.ci/atime/tests.R Outdated Show resolved Hide resolved
Copy link

codecov bot commented Oct 3, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.62%. Comparing base (b4538a0) to head (7800a62).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #6555   +/-   ##
=======================================
  Coverage   98.62%   98.62%           
=======================================
  Files          79       79           
  Lines       14450    14450           
=======================================
  Hits        14251    14251           
  Misses        199      199           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

.ci/atime/tests.R Outdated Show resolved Hide resolved
Copy link

github-actions bot commented Oct 3, 2024

Comparison Plot

Generated via commit 7800a62

Download link for the artifact containing the test results: ↓ atime-results.zip

Task Duration
R setup and installing dependencies 3 minutes and 18 seconds
Installing different package versions 1 minutes and 32 seconds
Running and plotting the test cases 2 minutes and 34 seconds

"forderv improved in #4386" = atime::atime_test(
N = 10^seq(3, 8), # 1e9 exceeds the runner's memory (process gets killed)
setup = {
options(datatable.forder.auto.index = TRUE, datatable.forder.reuse.sorting = TRUE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if setting these options will affect other tests?
I checked where this is defined / documented and I found for auto.index :

grep  -nH --null "auto[.]index" ../*/*
../R/data.table.R:3309:    if (!getOption("datatable.auto.index")) return(NULL)
../R/onLoad.R:87:       "datatable.auto.index"="TRUE",          # DT[col=="val"] to auto add index so 2nd time faster
../R/setkey.R:144:  if (isTRUE(getOption("datatable.forder.auto.index"))) return(invisible())
grep: ../inst/include: Is a directory
grep: ../inst/po: Is a directory
grep: ../inst/tests: Is a directory
../man/datatable-optimize.Rd:102:\code{options(datatable.auto.index = FALSE)}. To switch off using existing
Binary file ../src/data_table.dll matches
../src/forder.c:1628:// isTRUE(getOption("datatable.auto.index"))
../src/forder.c:1630:  // for now temporarily 'forder.auto.index' not 'auto.index' to disabled it by default
../src/forder.c:1633:  SEXP opt = GetOption(install("datatable.forder.auto.index"), R_NilValue);
../src/forder.c:1637:    error("'datatable.forder.auto.index' option must be TRUE or FALSE"); // # nocov
../src/forder.c:1771:        GetAutoIndex()) { // disabled by default, use datatable.forder.auto.index=T to enable, do not export/document, use for debugging only
Binary file ../src/forder.o matches
grep: ../vignettes/css: Is a directory
../vignettes/datatable-benchmarking.Rmd:67:options(datatable.auto.index=TRUE)
../vignettes/datatable-benchmarking.Rmd:72:- `auto.index=FALSE` disables building index automatically when doing subset on non-indexed data, but if indices were created before this option was set, or explicitly by calling `setindex` they still will be used for optimization.
../vignettes/datatable-secondary-indices-and-auto-indexing.Rmd:318:* Auto indexing can be disabled by setting the global argument `options(datatable.auto.index = FALSE)`.
grep: ../vignettes/plots: Is a directory

above I see two options, datatable.forder.auto.index and datatable.auto.index

  • datatable.auto.index is documented in Rd and two vignettes
  • forder.c comments say "datatable.forder.auto.index=T to enable, do not export/document, use for debugging only"

which option does your test require setting? can you please add a comment to clarify why?

and I found for reuse.sorting:

grep  -nH --null "reuse[.]sorting" ../*/*
../R/setkey.R:151:forderv = function(x, by=seq_along(x), retGrp=FALSE, retStats=retGrp, sort=TRUE, order=1L, na.last=FALSE, reuseSorting=getOption("datatable.reuse.sorting", NA)) {

in your test you have datatable.forder.reuse.sorting but in the code there is no forder in the option name, datatable.reuse.sorting ... so I guess this option is not required to get your test result... where did you get that option name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I don't need datatable.reuse.sorting, so do you want me to add a comment as to why datatable.forder.auto.index is required?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if setting these options will affect other tests?

I hope not! (we'll have to see from the results we get here)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't look like its affecting the other tests

@tdhock
Copy link
Member

tdhock commented Oct 3, 2024

your original test had 3 calls to forderv.

  • first in setup with retGrp default(=FALSE)
  • then in expr with retGrp=FALSE
  • then again in expr with retGrp=TRUE

I was wondering if we need both retGrp=TRUE and FALSE in expr?
If we have TRUE only (or FALSE only) I get this result (constant instead of linear for Fast)
image
I coded a for loop over doing retGrp=T or F in setup, and also in expr. (4 test cases total)
For three of these we see constant Fast (as above)

But for retGrp=T in setup then retGrp=F in expr, we see linear Fast, as in the plot below:
image
is this expected @jangorecki @MichaelChirico ?

looking at datatable.verbose=T in this case, for example like the code below, I see "using existing index" which to me suggests that this should be constant (not linear as observed), so is this a performance bug??

> set.seed(1);library(data.table);options(datatable.verbose=T);dt <- data.table(index = sample(N), values = sample(N));data.table:::forderv(dt, "index", retGrp = TRUE);cat("------\n");data.table:::forderv(dt, "index", retGrp = FALSE)
forder.c received 10 rows and 2 columns
forderReuseSorting: opt=-1, took 0.000s
 [1]  4  5  7  2  6  9  3 10  1  8
attr(,"starts")
 [1]  1  2  3  4  5  6  7  8  9 10
attr(,"maxgrpn")
[1] 1
attr(,"anyna")
[1] 0
attr(,"anyinfnan")
[1] 0
attr(,"anynotascii")
[1] 0
attr(,"anynotutf8")
[1] 0
------
forder.c received 10 rows and 2 columns
forderReuseSorting: opt=-1, took 0.000s
 [1]  4  5  7  2  6  9  3 10  1  8

I think we need explain in the comments what is expected to happen (when is expr expected to be linear vs constant).
I wonder if we need to use something like data.table:::setattr(L, "index", NULL) ??

@tdhock
Copy link
Member

tdhock commented Oct 3, 2024

so here is a CI result that shows 3 constant Fast vs 1 linear Fast https://asset.cml.dev/f6abccc722845026bf081f0c9aa93b071bef4a4a?cml=png&cache-bypass=5e8ae8d9-fc0e-4b9b-9831-12442334b5d5

Comment on lines 33 to 43
for(retGrp in retGrp_values){
data.table:::forderv(dt, "index", retGrp = eval(str2lang(retGrp)))
index.list[[retGrp]] <- attr(dt, "index")
}
}),
expr = substitute({
data.table:::forderv(dt, "index", retGrp = RETGRP) # Reusing the index and computing group info.
}, list(RETGRP=str2lang(retGrp_expr))),
setattr(dt, "index", index.list[[retGrp_setup]])
data.table:::forderv(dt, "index", retGrp = retGrp_expr) # Reusing the index and computing group info.
}, list(
retGrp_setup=retGrp_setup,
retGrp_expr=str2lang(retGrp_expr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this refactor I compute both retGrp=T and F in setup (using CRAN data.table), and store the "index" attribute in index.list

then in expr I set the "index" attribute to one of the two indices we computed in setup (using either retGrp=T or F).
and then we call forderv using a git version of data.table...
and I observe a different result than before --- Fast is always linear (for any combination of retGrp=T or F), never constant.

I don't understand what is happening, or which version of this test we should prefer (with or without setattr).
Can someone who knows forderv better please investigate/explain what is expected time complexity here?

@jangorecki
Copy link
Member

Not sure what exactly question is about.

@tdhock
Copy link
Member

tdhock commented Oct 4, 2024

Hi @jangorecki we are trying to adapt your examples here #4386 (comment) to be a performance test, but we are not sure about what is the right way to test the new functionality vs the old.
Could you please fill in the TODOs below with "linear" or "constant" ?
(is it true that we expect constant time if old index is used, and linear time if new index must be computed?)

  • If we first run forderv(retGrp=TRUE) then running forderv(retGrp=TRUE) should take TODO time. (TDH guess TODO=constant?)
  • If we first run forderv(retGrp=TRUE) then running forderv(retGrp=FALSE) should take TODO time. (TDH guess TODO=constant?)
  • If we first run forderv(retGrp=FALSE) then running forderv(retGrp=TRUE) should take TODO time. (TDH guess TODO=linear?)
  • If we first run forderv(retGrp=FALSE) then running forderv(retGrp=FALSE) should take TODO time. (TDH guess TODO=constant?)

I guess I don't understand the difference between retGrp=TRUE and FALSE, is that documented somewhere?

@jangorecki
Copy link
Member

jangorecki commented Oct 4, 2024

Rerunning same should be constant. As for the other two cases I cannot look so much in details into it at the moment, but possibly when computing retGrp=F when having retGrp=T we may still need to run forder on the index itself to get the original order from groups (should be fast but not constant). In one of PR was fdistinct() which is very small function that was using those so the exact logic can be looked up there easily.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add an atime performance regression test for forder caching
3 participants