-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
furrr
much slower than purrr
on nested data
#234
Comments
For the first one, it is the progress bar that is taking so long. The progress bar should mainly be used for things where each individual iteration takes a relatively large amount of time, otherwise the overhead of the progress bar outweighs its usefulness. Also note that the progress bar is deprecated, and should not really be used anymore. I will eventually remove it in favor of the progressr package. I'm not surprised that furrr is slower here. When the total time is < 5 seconds or so, I expect require(dplyr); require(furrr); require(purrr); require(tidyr)
future::plan(multisession, workers = 3)
# Create some large dataset
Data <- as_tibble(mtcars)
Data <- vctrs::vec_rep(Data, 50000)
Data$ID <- vctrs::vec_rep_each(1:50000, nrow(mtcars))
Data
#> # A tibble: 1,600,000 × 12
#> mpg cyl disp hp drat wt qsec vs am gear carb ID
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 1
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 1
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 1
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 1
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 1
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 1
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 1
#> # … with 1,599,990 more rows
tictoc::tic()
xx <- Data %>% mutate(disp2 = map_dbl(disp, identity))
tictoc::toc()
#> 0.943 sec elapsed
tictoc::tic()
xx <- Data %>% mutate(disp2 = future_map_dbl(disp, identity))
tictoc::toc()
#> 1.538 sec elapsed Created on 2022-05-12 by the reprex package (v2.0.1) I'll address the second question in a moment... |
For the second question, you just forgot to It is exactly the problem outlined in the Common Gotchas vignette require(dplyr); require(furrr); require(purrr); require(tidyr)
future::plan(multisession, workers = 3)
# Create some large dataset
Data <- as_tibble(mtcars)
Data <- vctrs::vec_rep(Data, 50000)
Data$ID <- vctrs::vec_rep_each(1:50000, nrow(mtcars))
NestedData <- Data %>%
group_by(ID) %>%
nest() %>%
ungroup()
tictoc::tic()
xx <- mutate(NestedData, data2 = map(data, identity))
tictoc::toc()
#> 0.105 sec elapsed
tictoc::tic()
xx <- mutate(NestedData, data2 = future_map(data, identity))
tictoc::toc()
#> 9.069 sec elapsed This is an acceptable overhead to me, because it has to shuffle the nested data frames to and from the workers. |
Look closer at #234 (comment) I'm already showing an example where furrr is slower. That's perfectly normal when you are sending over large datasets to each worker and then running an extremely cheap function on each one of them. When doing parallel work, there can be large costs to sending "big" datasets over to the workers, which is not something that sequential evaluation has to do. |
Also, in the future, we'd prefer if you open new issues rather than commenting on old ones. It is easier for us to keep track of! |
Hello,
I would like to use
furrr
package to (row-wise) make some analysis of the data. I find that usingfurrr
is slower thanpurrr
, which is also reported by @hadley here: #41.Here is a repex
Here, I implement a simple function for each row of the data. There is a time difference, but not a huge difference.
However, when implementing another simple function on a nested dataset, it works fine using
purrr
but takes ages to run (if works altogether) usingfurrr
, which is weird.I tested this on two different R installations (Windows - R4.2.0 furrr0.3.0 --- and RstudioServer/R3.6/furrr0.2.3)
Is there a reason for this? Any advice to make a parallel analysis of nested datasets faster?
Cheers,
Ahmed
The text was updated successfully, but these errors were encountered: