You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Comparing large (>1e6) integer vectors can be prohibitively slow after shuffling one vector.
RE: #116 I tried setting options(diffobj.max.diffs = 50), but this didn't help.
Reprex
v1<- rep(1:50L, 1e6L/50L)
v2<- rep(1:50L, 1e6L/50L)
v2<-v2[sample.int(length(v2)]
bench::system_time(
waldo::compare(v1, v2)
)
# process real # 4.87m 4.87m
Use Case
Integer indexes are commonly used in data.table for RDBMS style relational modelling. As a result of this slow comparison for integer vectors, comparing such data.tables is unreasonably slow (~5 minutes * number of columns).
This prevents adoption of testthat >= 3.0 in packages using largeish tables. (EDIT: I guess this isn't technically true, since examples in tests should be small, but it would be nice to be able to compare large tables).
The problem also exists for base::data.frame but since joins are so much slower the use case isn't as compelling.
Interestingly the bottleneck problem is not the comparison (which only takes 0.4s) but building up the the display of the differences, which suggests that the max.diffs argument needs to be applied earlier.
I think this mostly reflects an edge case that is mostly unimportant for testing — you need to have both a very large vector and a very large number of differences:
v1<-v2<- rep(1:50L, 1e6L/50L)
v2[1:1e4] <-v2[sample.int(1:1e4)]
bench::system_time(
waldo::compare(v1, v2)
)
#> process real #> 1.84s 1.84s
Problem
Comparing large (>1e6) integer vectors can be prohibitively slow after shuffling one vector.
RE: #116 I tried setting
options(diffobj.max.diffs = 50)
, but this didn't help.Reprex
Use Case
Integer indexes are commonly used in
data.table
for RDBMS style relational modelling. As a result of this slow comparison for integer vectors, comparing suchdata.table
s is unreasonably slow (~5 minutes * number of columns).This prevents adoption of
testthat
>= 3.0 in packages using largeish tables. (EDIT: I guess this isn't technically true, since examples in tests should be small, but it would be nice to be able to compare large tables).The problem also exists for
base::data.frame
but since joins are so much slower the use case isn't as compelling.Environment
Note: Ubuntu is running in WSL2, but I also tested the code in R 4.1.3 on Windows 11 and the issue still exists.
The text was updated successfully, but these errors were encountered: