Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: About reproducible parallel RNG streams #41

Closed
pat-s opened this issue Dec 25, 2019 · 8 comments
Closed

Discussion: About reproducible parallel RNG streams #41

pat-s opened this issue Dec 25, 2019 · 8 comments

Comments

@pat-s
Copy link

pat-s commented Dec 25, 2019

Continuing our mail discussion here.

The variety of options, especially when using the foreach framework seems overhwelming.

  • Is is expected that "Example 1" and "Example 2" work in the same way even though the foreach operators are different?

  • Is "Example 2" the canonical way from your point of view right now?

Example 1
library("doFuture")
#> Loading required package: globals
#> Loading required package: future
#> Loading required package: foreach
#> Loading required package: iterators
#> Loading required package: parallel

# register parallel backend
registerDoFuture()
plan(multisession, workers = 2)

set.seed(123, "L'Ecuyer-CMRG") 
r15 = foreach(i = rep(3, 2)) %dopar% {
  runif(i)
}

set.seed(123, "L'Ecuyer-CMRG") 
r16 = foreach(i = rep(3, 2)) %dopar% {
  runif(i)
}

all.equal(r15, r16)
#> [1] "Component 1: Mean relative difference: 0.2142185"
#> [2] "Component 2: Mean relative difference: 0.5201712"

Created on 2019-12-25 by the reprex package (v0.3.0)

Example 2
library("doFuture")
#> Loading required package: globals
#> Loading required package: future
#> Loading required package: foreach
#> Loading required package: iterators
#> Loading required package: parallel
library("doRNG")
#> Loading required package: rngtools
#> Loading required package: pkgmaker
#> Loading required package: registry
#> 
#> Attaching package: 'pkgmaker'
#> The following object is masked from 'package:base':
#> 
#>     isFALSE

# register parallel backend
registerDoFuture()
plan(multisession, workers = 2)

registerDoRNG(123)
r15a = foreach(i = rep(3, 2)) %dopar% {
  runif(i)
}

registerDoRNG(123)
r16a = foreach(i = rep(3, 2)) %dopar% {
  runif(i)
}
all.equal(r15a, r16a)
#> [1] TRUE

Created on 2019-12-25 by the reprex package (v0.3.0)

@HenrikBengtsson
Copy link
Collaborator

So, several things go on here. The underlying issue is that foreach left it to the backends to handle certain things, including parallel RNG. This is very unfortunate because it is really hard to write foreach code that works as-is with different foreach adapters. Ideally, foreach would have implemented parallel RNG itself such that it would work the same everywhere. (This is the approach I'm taking in the future framework, e.g. future.apply). Such a discussion is probably better suited
over at https://github.com/RevolutionAnalytics/foreach/issues.

I did consider having doFuture have built-in support for parallel RNG much like future.apply. If I did (or end up doing), then you could do:

library(foreach)
doFuture::registerDoFuture()
set.seed(123) 
y <- foreach(i = rep(3, 2)) %dopar% {
  runif(i)
}

and you'd get numerically identical results regardless of future-backend used (as you hoped for in Example 1). Maybe I should just add that - it's pretty easy to do. The downside is that the same foreach() will not use proper parallel RNG if you use:

doParallel::registerDoParallel(4L)

instead. Currently, all existing foreach adapters requires doRNG for parallel RNG. So, you need to do something like you propose in Example 2:

doParallel::registerDoParallel(4L)
doRNG::registerDoRNG(123)

Using doRNG is basically a de-facto standard (= the only solution available) that I didn't want to break. At least, I didn't want to break it or introduce yet another standard, without carefully consider it's consequences.

Howeber, maybe one can argue that if

doParallel::registerDoParallel(4L)
doRNG::registerDoRNG()

remain in control of the end user, and not the developer, then it would be straightforward for the user to switch to:

future::plan("multisession", workers = 4L)
doFuture::registerDoFuture()

and

y <- foreach(i = rep(3, 2)) %dopar% {
  runif(i)
}

would work in both cases.

I'll try to think about this more but right now I tempted to add built-in support for parallel RNG to doFuture. It would certainly lower friction for both developers and end users.

@HenrikBengtsson
Copy link
Collaborator

Is "Example 2" the canonical way from your point of view right now?

Yes, I think you found a nice balance there. Personally, I think:

doRNG::registerDoRNG()
y <- foreach(...) %dopar% { ... }

is nicer and less confusing than:

library(doRNG)
y <- foreach(...) %dorng% { ... }

One reason is the developer does not have to update from %dopar% to %dorng% when needing parallel RNGs. Also, the latter is error prone because it's easy to forget to do that update and there will not be an warnings or errors telling you you forgot. Third, as argue in my previous comment, sticking with %dopar% everywhere will make it possible to add built-in support for parallel RNG in doFuture without having to update the foreach code.

@pat-s
Copy link
Author

pat-s commented Jan 1, 2020

Such a discussion is probably better suited over at RevolutionAnalytics/foreach/issues.

Yeah, maybe more people will find it here since I think the other GH repo is not as well known as doFuture. We can't change it anymore now, so let's go with it 😄

From my point of view, I would like to see something like

library(foreach)
doFuture::registerDoFuture(reproducible = TRUE)
set.seed(123) 
y <- foreach(i = rep(3, 2)) %dopar% {
  runif(i)
}

Not sure if this is somewhat of an overkill but it would make it very clear by reading only the code that something reproducible is going on here.
I kinda miss this explicit statement in lots of scripts because it is often not obvious to readers that the "L'Ecuyer-CMRG" RNG kind will do this (wherever it might be used, in the script or internally in a package).
I'll think of doing so in {parallelMap} and the more I think about it, the more I like it.

In my dreams I would like to see this argument everywhere, including {parallel} combined with a help page in the functions explaining what's going on.
This would be a very clear and simple explanation of what's going on, even behind the scenes. What do you think?

Being strict with this, it would be of course also be nice if all {future} packages would go this way then to be consistent.
The thinking is again: "Whenever you want to have reproducible numbers when going parallel, set reproducible = TRUE in whatever {future} related function you are calling". future.seed is already there but maybe not really descriptive by its name to the end user?
I know that changing the argument name is not really an option for backward comp and adding another one might again make the situation more complex :/
(I don't know, just some random thoughts of mine - maybe a bit to revolutionary 😄 )


dopar vs dorng

I see no good reason why the second one should even exist (just from a "do we need it" point of view).
I think for simplicity and to keep things tidy, we should just go with one operator to indicate one is going parallel and deal with the reproducibility in some other way than switching to a different operator.
I haven't seen any people using it in the wild so I am not sure how "famous" it really is.
I also was not aware of it before you mentioned it lately to me - but that is of course not a robust reference of its popularity.

@HenrikBengtsson
Copy link
Collaborator

Quick comment: I think there's room for more than one definition of reproducible, e.g. numerical and statistical. In the future framework (as in future.apply), I aimed at supporting numerically identical RNG reproducibility so backend or number workers does not matter. However, that comes with some overhead and in many statistical analyses / simulations it should be sufficient use valid parallel RNG per worker (not per element), cf. futureverse/future.apply#20. The latter is basically what's used in parallel::mclapply(..., mc.set.seed = TRUE).

@HenrikBengtsson HenrikBengtsson changed the title Discussion: About reproducible parallel streams Discussion: About reproducible parallel RNG streams Jan 8, 2020
@HenrikBengtsson
Copy link
Collaborator

Actually, when thinking more about it, it might be a bad idea to use:

doRNG::registerDoRNG()

instead of %dorng%. My reason for thinking so now is that %dorng% is in control of the developer and they are the person who knows whether parallel RNG streams are needed or not. With doRNG::registerDoRNG() we leave it to the user to make sure correct RNG settings are used, which should not be their responsibility (*).

(*) One could, of course, imagine the developer calling doRNG::registerDoRNG() internally, but I'm not sure if one can undo that with on.exit() afterward. Also, using %dorng% will make it much more clear in the code when parallel RNG is needed.

@pat-s
Copy link
Author

pat-s commented Jan 13, 2020

I see.

Well, there are probably pros and cons for both.
Part of the problem is that

  • {doRNG} is not popular at all (currently)
  • {foreach} does not talk about/list it explicitly in their package as a backend
  • {foreach} does not have a dev repo on Github (at least its somewhat hidden and not linked in the DESCRIPTION file)
  • {doRNG} does not give a nice overview in terms of usage and what might be the best practice from their point of view

I wonder what steps might be most promising to not be too pushy to the devs but also pointing out our concerns/ideas.

@HenrikBengtsson
Copy link
Collaborator

  • {foreach} does not have a dev repo on Github (at least its somewhat hidden and not linked in the DESCRIPTION file)

The official GitHub foreach repos is https://github.com/RevolutionAnalytics/foreach/. It used to be hosted on the R-Forge SVN server but moved to GitHub about mid 2019, I think. At this time, maintenance was also handed over too Hong Ooi. Drop an issue asking if they can add URL and BugReports fields to the DESCRIPTION file so it'll show up on CRAN.

  • {doRNG} does not give a nice overview in terms of usage and what might be the best practice from their point of view

I would drop an issue at https://github.com/renozao/doRNG/issues with a wish for clarification/thoughts from the maintainer on whether to use registerDoRNG() or %dorng%. It might help clarify the question if you scribble down what you currently think are the pros and cons too. Hopefully, the outcome of such a discussion will find it's way into the documentation/help pages.

  • {foreach} does not talk about/list it explicitly in their package as a backend

Yes, this is unfortunate. It would be useful to know the history of doRNG in relationship to foreach, e.g. where they talking to each other or did doRNG fill a gap that the foreach folks didn't want to fill in themselves, ... Knowing a bit more about this would help understand the current design and philosophy better which in turn help when you propose improvements.

(Personally, I think foreach should have built-in support for parallel RNGs such that code and results would not depend on the backend used.

@HenrikBengtsson
Copy link
Collaborator

After you created the 'Native support for a reproducible parallel RNG streams?' issue over at foreach, I think we can close this one here. The foreach issue covers the problems and suggestions needed to go forward there.

For what's it worth, with future 1.21.0 now on CRAN, the next release of doFuture will produce a more informative and specific warning message when one forgets to declare the use of the RNG;

> library(doFuture)
> registerDoFuture()
> y <- foreach(x=1:2) %dopar% { rnorm(x) }
Warning message:
UNRELIABLE VALUE: One of the foreach() iterations ('doFuture-1') unexpectedly generated random numbers without 
declaring so. There is a risk that those random numbers are not statistically sound and the overall results 
might be invalid. To fix this, use '%dorng%' from the 'doRNG' package instead of '%dopar%'. This ensures that
proper, parallel-safe random numbers are produced via the L'Ecuyer-CMRG method. To disable this check, 
set option 'future.rng.onMisuse' to "ignore". 

@HenrikBengtsson HenrikBengtsson added this to the Next release milestone Dec 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants