Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparing large integer vectors slow #142

Closed
ChristopherEeles opened this issue Apr 7, 2022 · 3 comments
Closed

Comparing large integer vectors slow #142

ChristopherEeles opened this issue Apr 7, 2022 · 3 comments
Labels
feature a feature request or enhancement

Comments

@ChristopherEeles
Copy link

ChristopherEeles commented Apr 7, 2022

Problem

Comparing large (>1e6) integer vectors can be prohibitively slow after shuffling one vector.

RE: #116 I tried setting options(diffobj.max.diffs = 50), but this didn't help.

Reprex

v1 <- rep(1:50L, 1e6L / 50L)
v2 <- rep(1:50L, 1e6L / 50L)

v2 <- v2[sample.int(length(v2)]

bench::system_time(
    waldo::compare(v1, v2)
)
# process    real 
# 4.87m   4.87m 

Use Case

Integer indexes are commonly used in data.table for RDBMS style relational modelling. As a result of this slow comparison for integer vectors, comparing such data.tables is unreasonably slow (~5 minutes * number of columns).

This prevents adoption of testthat >= 3.0 in packages using largeish tables. (EDIT: I guess this isn't technically true, since examples in tests should be small, but it would be nice to be able to compare large tables).

The problem also exists for base::data.frame but since joins are so much slower the use case isn't as compelling.

Environment

R Under development (unstable) (2021-12-19 r81394)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.4 LTS

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] fansi_1.0.3     utf8_1.2.2      crayon_1.5.1    waldo_0.4.0    
 [5] lifecycle_1.0.1 magrittr_2.0.3  pillar_1.7.0    bench_1.1.2    
 [9] rlang_1.0.2     cli_3.2.0       diffobj_0.3.5   vctrs_0.4.0    
[13] ellipsis_0.3.2  rematch2_2.1.2  tools_4.2.0     glue_1.6.2     
[17] compiler_4.2.0  pkgconfig_2.0.3 tibble_3.1.6   

Note: Ubuntu is running in WSL2, but I also tested the code in R 4.1.3 on Windows 11 and the issue still exists.

@MichaelChirico
Copy link
Contributor

it's probably preferable to delegate to all.equal for cases like this -- not sure how general a solution waldo can be expected to handle.

expect_true(all.equal(x, y, ...))

will leverage data.table's method & efficiency

@hadley
Copy link
Member

hadley commented Apr 8, 2022

@MichaelChirico waldo already uses identical() to speedily detect the simple case where both elements are equal.

I reduced the problem to a smaller example which only takes around 4s:

v1 <- rep(1:50L, 1e5L / 50L)
v2 <- rep(1:50L, 1e5L / 50L)
v2 <- v2[sample.int(length(v2))]
waldo::compare(v1, v2)

Interestingly the bottleneck problem is not the comparison (which only takes 0.4s) but building up the the display of the differences, which suggests that the max.diffs argument needs to be applied earlier.

I think this mostly reflects an edge case that is mostly unimportant for testing — you need to have both a very large vector and a very large number of differences:

v1 <- v2 <- rep(1:50L, 1e6L / 50L)
v2[1:1e4] <- v2[sample.int(1:1e4)]

bench::system_time(
  waldo::compare(v1, v2)
)
#> process    real 
#>   1.84s   1.84s

Created on 2022-04-08 by the reprex package (v2.0.1)

@hadley hadley added the feature a feature request or enhancement label Mar 23, 2023
@hadley
Copy link
Member

hadley commented Oct 21, 2024

Per my last comment, I'm going to close this and declare it out of scope for waldo.

@hadley hadley closed this as completed Oct 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

3 participants