Added an atime test for performance improvement in forderv when reusing existing key and index attributes #6555

Anirban166 · 2024-10-03T19:38:23Z

.ci/atime/tests.R

codecov · 2024-10-03T19:48:48Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.62%. Comparing base (b4538a0) to head (7800a62).

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #6555   +/-   ##
=======================================
  Coverage   98.62%   98.62%           
=======================================
  Files          79       79           
  Lines       14450    14450           
=======================================
  Hits        14251    14251           
  Misses        199      199

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

.ci/atime/tests.R

github-actions · 2024-10-03T19:52:55Z

Generated via commit 7800a62

Download link for the artifact containing the test results: ↓ atime-results.zip

Task	Duration
R setup and installing dependencies	3 minutes and 18 seconds
Installing different package versions	1 minutes and 32 seconds
Running and plotting the test cases	2 minutes and 34 seconds

tdhock · 2024-10-03T20:02:56Z

.ci/atime/tests.R

+"forderv improved in #4386" = atime::atime_test(
+ N = 10^seq(3, 8), # 1e9 exceeds the runner's memory (process gets killed)
+ setup = {
+ options(datatable.forder.auto.index = TRUE, datatable.forder.reuse.sorting = TRUE)


I wonder if setting these options will affect other tests?
I checked where this is defined / documented and I found for auto.index :

grep -nH --null "auto[.]index" ../*/* ../R/data.table.R:3309: if (!getOption("datatable.auto.index")) return(NULL) ../R/onLoad.R:87: "datatable.auto.index"="TRUE", # DT[col=="val"] to auto add index so 2nd time faster ../R/setkey.R:144: if (isTRUE(getOption("datatable.forder.auto.index"))) return(invisible()) grep: ../inst/include: Is a directory grep: ../inst/po: Is a directory grep: ../inst/tests: Is a directory ../man/datatable-optimize.Rd:102:\code{options(datatable.auto.index = FALSE)}. To switch off using existing Binary file ../src/data_table.dll matches ../src/forder.c:1628:// isTRUE(getOption("datatable.auto.index")) ../src/forder.c:1630: // for now temporarily 'forder.auto.index' not 'auto.index' to disabled it by default ../src/forder.c:1633: SEXP opt = GetOption(install("datatable.forder.auto.index"), R_NilValue); ../src/forder.c:1637: error("'datatable.forder.auto.index' option must be TRUE or FALSE"); // # nocov ../src/forder.c:1771: GetAutoIndex()) { // disabled by default, use datatable.forder.auto.index=T to enable, do not export/document, use for debugging only Binary file ../src/forder.o matches grep: ../vignettes/css: Is a directory ../vignettes/datatable-benchmarking.Rmd:67:options(datatable.auto.index=TRUE) ../vignettes/datatable-benchmarking.Rmd:72:- `auto.index=FALSE` disables building index automatically when doing subset on non-indexed data, but if indices were created before this option was set, or explicitly by calling `setindex` they still will be used for optimization. ../vignettes/datatable-secondary-indices-and-auto-indexing.Rmd:318:* Auto indexing can be disabled by setting the global argument `options(datatable.auto.index = FALSE)`. grep: ../vignettes/plots: Is a directory

above I see two options, datatable.forder.auto.index and datatable.auto.index

datatable.auto.index is documented in Rd and two vignettes

forder.c comments say "datatable.forder.auto.index=T to enable, do not export/document, use for debugging only"

which option does your test require setting? can you please add a comment to clarify why?

and I found for reuse.sorting:

grep -nH --null "reuse[.]sorting" ../*/* ../R/setkey.R:151:forderv = function(x, by=seq_along(x), retGrp=FALSE, retStats=retGrp, sort=TRUE, order=1L, na.last=FALSE, reuseSorting=getOption("datatable.reuse.sorting", NA)) {

in your test you have datatable.forder.reuse.sorting but in the code there is no forder in the option name, datatable.reuse.sorting ... so I guess this option is not required to get your test result... where did you get that option name?

Yes I don't need datatable.reuse.sorting, so do you want me to add a comment as to why datatable.forder.auto.index is required?

I wonder if setting these options will affect other tests?

I hope not! (we'll have to see from the results we get here)

Doesn't look like its affecting the other tests

tdhock · 2024-10-03T20:59:24Z

your original test had 3 calls to forderv.

first in setup with retGrp default(=FALSE)
then in expr with retGrp=FALSE
then again in expr with retGrp=TRUE

I was wondering if we need both retGrp=TRUE and FALSE in expr?
If we have TRUE only (or FALSE only) I get this result (constant instead of linear for Fast)

I coded a for loop over doing retGrp=T or F in setup, and also in expr. (4 test cases total)
For three of these we see constant Fast (as above)

But for retGrp=T in setup then retGrp=F in expr, we see linear Fast, as in the plot below:

is this expected @jangorecki @MichaelChirico ?

looking at datatable.verbose=T in this case, for example like the code below, I see "using existing index" which to me suggests that this should be constant (not linear as observed), so is this a performance bug??

> set.seed(1);library(data.table);options(datatable.verbose=T);dt <- data.table(index = sample(N), values = sample(N));data.table:::forderv(dt, "index", retGrp = TRUE);cat("------\n");data.table:::forderv(dt, "index", retGrp = FALSE)
forder.c received 10 rows and 2 columns
forderReuseSorting: opt=-1, took 0.000s
 [1]  4  5  7  2  6  9  3 10  1  8
attr(,"starts")
 [1]  1  2  3  4  5  6  7  8  9 10
attr(,"maxgrpn")
[1] 1
attr(,"anyna")
[1] 0
attr(,"anyinfnan")
[1] 0
attr(,"anynotascii")
[1] 0
attr(,"anynotutf8")
[1] 0
------
forder.c received 10 rows and 2 columns
forderReuseSorting: opt=-1, took 0.000s
 [1]  4  5  7  2  6  9  3 10  1  8

I think we need explain in the comments what is expected to happen (when is expr expected to be linear vs constant).
I wonder if we need to use something like data.table:::setattr(L, "index", NULL) ??

tdhock · 2024-10-03T21:53:05Z

so here is a CI result that shows 3 constant Fast vs 1 linear Fast https://asset.cml.dev/f6abccc722845026bf081f0c9aa93b071bef4a4a?cml=png&cache-bypass=5e8ae8d9-fc0e-4b9b-9831-12442334b5d5

tdhock · 2024-10-03T21:57:20Z

.ci/atime/tests.R

+ for(retGrp in retGrp_values){
+ data.table:::forderv(dt, "index", retGrp = eval(str2lang(retGrp)))
+ index.list[[retGrp]] <- attr(dt, "index")
+ }
+ }),
 expr = substitute({
- data.table:::forderv(dt, "index", retGrp = RETGRP) # Reusing the index and computing group info.
- }, list(RETGRP=str2lang(retGrp_expr))),
+ setattr(dt, "index", index.list[[retGrp_setup]]) 
+ data.table:::forderv(dt, "index", retGrp = retGrp_expr) # Reusing the index and computing group info.
+ }, list(
+ retGrp_setup=retGrp_setup,
+ retGrp_expr=str2lang(retGrp_expr)


in this refactor I compute both retGrp=T and F in setup (using CRAN data.table), and store the "index" attribute in index.list

then in expr I set the "index" attribute to one of the two indices we computed in setup (using either retGrp=T or F).
and then we call forderv using a git version of data.table...
and I observe a different result than before --- Fast is always linear (for any combination of retGrp=T or F), never constant.

I don't understand what is happening, or which version of this test we should prefer (with or without setattr).
Can someone who knows forderv better please investigate/explain what is expected time complexity here?

jangorecki · 2024-10-04T09:29:24Z

Not sure what exactly question is about.

tdhock · 2024-10-04T12:56:26Z

Hi @jangorecki we are trying to adapt your examples here #4386 (comment) to be a performance test, but we are not sure about what is the right way to test the new functionality vs the old.
Could you please fill in the TODOs below with "linear" or "constant" ?
(is it true that we expect constant time if old index is used, and linear time if new index must be computed?)

If we first run forderv(retGrp=TRUE) then running forderv(retGrp=TRUE) should take TODO time. (TDH guess TODO=constant?)
If we first run forderv(retGrp=TRUE) then running forderv(retGrp=FALSE) should take TODO time. (TDH guess TODO=constant?)
If we first run forderv(retGrp=FALSE) then running forderv(retGrp=TRUE) should take TODO time. (TDH guess TODO=linear?)
If we first run forderv(retGrp=FALSE) then running forderv(retGrp=FALSE) should take TODO time. (TDH guess TODO=constant?)

I guess I don't understand the difference between retGrp=TRUE and FALSE, is that documented somewhere?

jangorecki · 2024-10-04T16:05:34Z

Rerunning same should be constant. As for the other two cases I cannot look so much in details into it at the moment, but possibly when computing retGrp=F when having retGrp=T we may still need to run forder on the index itself to get the original order from groups (should be fast but not constant). In one of PR was fdistinct() which is very small function that was using those so the exact logic can be looked up there easily.

Added the test with minor changes to comments

5a15dd2

Anirban166 requested a review from tdhock as a code owner October 3, 2024 19:38

Narrow down on N

7e822ad

tdhock reviewed Oct 3, 2024

View reviewed changes

.ci/atime/tests.R Outdated Show resolved Hide resolved

tdhock reviewed Oct 3, 2024

View reviewed changes

.ci/atime/tests.R Outdated Show resolved Hide resolved

Go with default N spec

4ecb240

Set seed for reproducibility given the use of sample()

b0c10f5

tdhock reviewed Oct 3, 2024

View reviewed changes

Anirban166 and others added 2 commits October 3, 2024 13:05

rm sort reuse option

ba72deb

for loop over retGrp=T,F

c519de1

Toby Dylan Hocking added 4 commits October 3, 2024 17:11

historical sha1s from commits page instead of merge commit

f24984a

rm > from test name since that fails on CI

6e69305

remove extra blank lines

401f3d7

use setattr-> Fast always linear

7800a62

tdhock reviewed Oct 3, 2024

View reviewed changes

tdhock mentioned this pull request Oct 4, 2024

historical references not shown for some tests tdhock/atime#64

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added an atime test for performance improvement in forderv when reusing existing key and index attributes #6555

Added an atime test for performance improvement in forderv when reusing existing key and index attributes #6555

Anirban166 commented Oct 3, 2024

codecov bot commented Oct 3, 2024 •

edited

Loading

github-actions bot commented Oct 3, 2024 •

edited

Loading

tdhock Oct 3, 2024

Anirban166 Oct 3, 2024

Anirban166 Oct 3, 2024

Anirban166 Oct 3, 2024

tdhock commented Oct 3, 2024 •

edited

Loading

tdhock commented Oct 3, 2024

tdhock Oct 3, 2024

jangorecki commented Oct 4, 2024

tdhock commented Oct 4, 2024

jangorecki commented Oct 4, 2024 •

edited

Loading

Added an atime test for performance improvement in forderv when reusing existing key and index attributes #6555

Are you sure you want to change the base?

Added an atime test for performance improvement in forderv when reusing existing key and index attributes #6555

Conversation

Anirban166 commented Oct 3, 2024

codecov bot commented Oct 3, 2024 • edited Loading

Codecov Report

github-actions bot commented Oct 3, 2024 • edited Loading

tdhock Oct 3, 2024

Choose a reason for hiding this comment

Anirban166 Oct 3, 2024

Choose a reason for hiding this comment

Anirban166 Oct 3, 2024

Choose a reason for hiding this comment

Anirban166 Oct 3, 2024

Choose a reason for hiding this comment

tdhock commented Oct 3, 2024 • edited Loading

tdhock commented Oct 3, 2024

tdhock Oct 3, 2024

Choose a reason for hiding this comment

jangorecki commented Oct 4, 2024

tdhock commented Oct 4, 2024

jangorecki commented Oct 4, 2024 • edited Loading

codecov bot commented Oct 3, 2024 •

edited

Loading

github-actions bot commented Oct 3, 2024 •

edited

Loading

tdhock commented Oct 3, 2024 •

edited

Loading

jangorecki commented Oct 4, 2024 •

edited

Loading