-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion: About reproducible parallel RNG streams #41
Comments
So, several things go on here. The underlying issue is that foreach left it to the backends to handle certain things, including parallel RNG. This is very unfortunate because it is really hard to write foreach code that works as-is with different foreach adapters. Ideally, foreach would have implemented parallel RNG itself such that it would work the same everywhere. (This is the approach I'm taking in the future framework, e.g. future.apply). Such a discussion is probably better suited I did consider having doFuture have built-in support for parallel RNG much like future.apply. If I did (or end up doing), then you could do: library(foreach)
doFuture::registerDoFuture()
set.seed(123)
y <- foreach(i = rep(3, 2)) %dopar% {
runif(i)
} and you'd get numerically identical results regardless of future-backend used (as you hoped for in Example 1). Maybe I should just add that - it's pretty easy to do. The downside is that the same doParallel::registerDoParallel(4L) instead. Currently, all existing foreach adapters requires doRNG for parallel RNG. So, you need to do something like you propose in Example 2: doParallel::registerDoParallel(4L)
doRNG::registerDoRNG(123) Using doRNG is basically a de-facto standard (= the only solution available) that I didn't want to break. At least, I didn't want to break it or introduce yet another standard, without carefully consider it's consequences. Howeber, maybe one can argue that if doParallel::registerDoParallel(4L)
doRNG::registerDoRNG() remain in control of the end user, and not the developer, then it would be straightforward for the user to switch to: future::plan("multisession", workers = 4L)
doFuture::registerDoFuture() and y <- foreach(i = rep(3, 2)) %dopar% {
runif(i)
} would work in both cases. I'll try to think about this more but right now I tempted to add built-in support for parallel RNG to doFuture. It would certainly lower friction for both developers and end users. |
Yes, I think you found a nice balance there. Personally, I think: doRNG::registerDoRNG()
y <- foreach(...) %dopar% { ... } is nicer and less confusing than: library(doRNG)
y <- foreach(...) %dorng% { ... } One reason is the developer does not have to update from |
Yeah, maybe more people will find it here since I think the other GH repo is not as well known as doFuture. We can't change it anymore now, so let's go with it 😄 From my point of view, I would like to see something like library(foreach)
doFuture::registerDoFuture(reproducible = TRUE)
set.seed(123)
y <- foreach(i = rep(3, 2)) %dopar% {
runif(i)
} Not sure if this is somewhat of an overkill but it would make it very clear by reading only the code that something reproducible is going on here. In my dreams I would like to see this argument everywhere, including {parallel} combined with a help page in the functions explaining what's going on. Being strict with this, it would be of course also be nice if all {future} packages would go this way then to be consistent.
|
Quick comment: I think there's room for more than one definition of reproducible, e.g. numerical and statistical. In the future framework (as in future.apply), I aimed at supporting numerically identical RNG reproducibility so backend or number workers does not matter. However, that comes with some overhead and in many statistical analyses / simulations it should be sufficient use valid parallel RNG per worker (not per element), cf. futureverse/future.apply#20. The latter is basically what's used in |
Actually, when thinking more about it, it might be a bad idea to use:
instead of (*) One could, of course, imagine the developer calling |
I see. Well, there are probably pros and cons for both.
I wonder what steps might be most promising to not be too pushy to the devs but also pointing out our concerns/ideas. |
The official GitHub foreach repos is https://github.com/RevolutionAnalytics/foreach/. It used to be hosted on the R-Forge SVN server but moved to GitHub about mid 2019, I think. At this time, maintenance was also handed over too Hong Ooi. Drop an issue asking if they can add
I would drop an issue at https://github.com/renozao/doRNG/issues with a wish for clarification/thoughts from the maintainer on whether to use
Yes, this is unfortunate. It would be useful to know the history of doRNG in relationship to foreach, e.g. where they talking to each other or did doRNG fill a gap that the foreach folks didn't want to fill in themselves, ... Knowing a bit more about this would help understand the current design and philosophy better which in turn help when you propose improvements. (Personally, I think foreach should have built-in support for parallel RNGs such that code and results would not depend on the backend used. |
After you created the 'Native support for a reproducible parallel RNG streams?' issue over at foreach, I think we can close this one here. The foreach issue covers the problems and suggestions needed to go forward there. For what's it worth, with future 1.21.0 now on CRAN, the next release of doFuture will produce a more informative and specific warning message when one forgets to declare the use of the RNG; > library(doFuture)
> registerDoFuture()
> y <- foreach(x=1:2) %dopar% { rnorm(x) }
Warning message:
UNRELIABLE VALUE: One of the foreach() iterations ('doFuture-1') unexpectedly generated random numbers without
declaring so. There is a risk that those random numbers are not statistically sound and the overall results
might be invalid. To fix this, use '%dorng%' from the 'doRNG' package instead of '%dopar%'. This ensures that
proper, parallel-safe random numbers are produced via the L'Ecuyer-CMRG method. To disable this check,
set option 'future.rng.onMisuse' to "ignore". |
Continuing our mail discussion here.
The variety of options, especially when using the
foreach
framework seems overhwelming.Is is expected that "Example 1" and "Example 2" work in the same way even though the foreach operators are different?
Is "Example 2" the canonical way from your point of view right now?
Example 1
Created on 2019-12-25 by the reprex package (v0.3.0)
Example 2
Created on 2019-12-25 by the reprex package (v0.3.0)
The text was updated successfully, but these errors were encountered: