Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

furrr much slower than purrr on nested data #234

Closed
elgabbas opened this issue May 12, 2022 · 5 comments
Closed

furrr much slower than purrr on nested data #234

elgabbas opened this issue May 12, 2022 · 5 comments

Comments

@elgabbas
Copy link

elgabbas commented May 12, 2022

Hello,

I would like to use furrr package to (row-wise) make some analysis of the data. I find that using furrr is slower than purrr, which is also reported by @hadley here: #41.

Here is a repex

require(dplyr); require(furrr); require(purrr); require(tidyr)
future::plan(multisession)

# Create some large dataset
Data <- vector(mode = "list", length = 50000)
Data <- lapply(1:50000, function(x){
  mtcars0 <- sample(mtcars)
  mtcars0$ID <- x
  Data[[x]] <- mtcars0
  }) %>% do.call(what = rbind) %>% tibble()

Here, I implement a simple function for each row of the data. There is a time difference, but not a huge difference.

SimpleFun <- function(x){ x*sample(1:100,1) }
tictoc::tic()
Data %>% mutate(disp2 = map_dbl(disp, SimpleFun))
tictoc::toc()
# 2.004 sec elapsed

tictoc::tic()
Data %>% mutate(disp2 = future_map_dbl(disp, SimpleFun, .progress = T))
tictoc::toc()
# 21.126 sec elapsed

However, when implementing another simple function on a nested dataset, it works fine using purrr but takes ages to run (if works altogether) using furrr, which is weird.

SimpleFun2 <- function(x){sample(x)}

tictoc::tic()
Data %>% group_by(ID) %>% nest() %>% 
  mutate(data2 = map(data, SimpleFun2))
tictoc::toc()
# 5.011 sec elapsed

tictoc::tic()
Data %>% group_by(ID) %>% nest() %>% 
  mutate(data2 = future_map(data, SimpleFun2))
tictoc::toc()
# did not work

I tested this on two different R installations (Windows - R4.2.0 furrr0.3.0 --- and RstudioServer/R3.6/furrr0.2.3)

Is there a reason for this? Any advice to make a parallel analysis of nested datasets faster?

Cheers,
Ahmed

@DavisVaughan
Copy link
Collaborator

For the first one, it is the progress bar that is taking so long. The progress bar should mainly be used for things where each individual iteration takes a relatively large amount of time, otherwise the overhead of the progress bar outweighs its usefulness.

Also note that the progress bar is deprecated, and should not really be used anymore. I will eventually remove it in favor of the progressr package.

I'm not surprised that furrr is slower here. When the total time is < 5 seconds or so, I expect map() to basically beat future_map() every time.

require(dplyr); require(furrr); require(purrr); require(tidyr)
future::plan(multisession, workers = 3)

# Create some large dataset
Data <- as_tibble(mtcars)
Data <- vctrs::vec_rep(Data, 50000)
Data$ID <- vctrs::vec_rep_each(1:50000, nrow(mtcars))

Data
#> # A tibble: 1,600,000 × 12
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb    ID
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#>  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4     1
#>  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4     1
#>  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1     1
#>  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1     1
#>  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2     1
#>  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1     1
#>  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4     1
#>  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2     1
#>  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2     1
#> 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4     1
#> # … with 1,599,990 more rows

tictoc::tic()
xx <- Data %>% mutate(disp2 = map_dbl(disp, identity))
tictoc::toc()
#> 0.943 sec elapsed

tictoc::tic()
xx <- Data %>% mutate(disp2 = future_map_dbl(disp, identity))
tictoc::toc()
#> 1.538 sec elapsed

Created on 2022-05-12 by the reprex package (v2.0.1)

I'll address the second question in a moment...

@DavisVaughan
Copy link
Collaborator

For the second question, you just forgot to ungroup() after the nest(). If you give nest() a grouped data frame, it remains grouped after the nesting (for better or worse). This is preventing future_map() from doing what it is good at - partitioning the data over the workers. Because there are 50,000 groups, it is calling future_map() 50,000 times. This also makes map() run slower too.

It is exactly the problem outlined in the Common Gotchas vignette

require(dplyr); require(furrr); require(purrr); require(tidyr)
future::plan(multisession, workers = 3)

# Create some large dataset
Data <- as_tibble(mtcars)
Data <- vctrs::vec_rep(Data, 50000)
Data$ID <- vctrs::vec_rep_each(1:50000, nrow(mtcars))

NestedData <- Data %>% 
  group_by(ID) %>% 
  nest() %>% 
  ungroup()

tictoc::tic()
xx <- mutate(NestedData, data2 = map(data, identity))
tictoc::toc()
#> 0.105 sec elapsed

tictoc::tic()
xx <- mutate(NestedData, data2 = future_map(data, identity))
tictoc::toc()
#> 9.069 sec elapsed

This is an acceptable overhead to me, because it has to shuffle the nested data frames to and from the workers.

@akarito
Copy link

akarito commented Jul 13, 2024

In my computer, furrr is slower than purrr using the same code, I don't know why?

测试

@DavisVaughan
Copy link
Collaborator

Look closer at #234 (comment)

I'm already showing an example where furrr is slower. That's perfectly normal when you are sending over large datasets to each worker and then running an extremely cheap function on each one of them.

When doing parallel work, there can be large costs to sending "big" datasets over to the workers, which is not something that sequential evaluation has to do.

@DavisVaughan
Copy link
Collaborator

Also, in the future, we'd prefer if you open new issues rather than commenting on old ones. It is easier for us to keep track of!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants