-
-
Notifications
You must be signed in to change notification settings - Fork 30.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-uniform random floats (_randommodule.c) #31606
Conversation
The original implementation produces an output that has some bias in the lower bits of mantissa (more 0's than 1's). That's because we try to fit 2^53 pigeons into 2^52 available holes, so rounding to even is unavoidable. My solution is to generate just 2^52 pigeons using the expression (randbits_53 | 1) to generate 2^52 odd numbers only in the range [1, 2^53-1] , with zero cannot be generated. but, the distribution of mantissa bits becomes more uniform.
Hello, and thanks for your contribution! I'm a bot set up to make sure that the project can legally accept this contribution by verifying everyone involved has signed the PSF contributor agreement (CLA). Recognized GitHub usernameWe couldn't find a bugs.python.org (b.p.o) account corresponding to the following GitHub usernames: This might be simply due to a missing "GitHub Name" entry in one's b.p.o account settings. This is necessary for legal reasons before we can look at this contribution. Please follow the steps outlined in the CPython devguide to rectify this issue. You can check yourself to see if the CLA has been received. Thanks again for the contribution, we look forward to reviewing it! |
Sorry, but this is misguided. There is no rounding of any kind going on. Each integer in range(2**53) is equally likely, and dividing it by 2**53 is an exact operation: the infinite-precision result is exactly representable as a 754 double. Note that while the double storage format has 52 mantissa bits, the value it represents has 53 mantissa bits: a leading 1 bit is implicit (except for tiny subnormal numbers, which are irrelevant here). |
Sorry, your links don't work for me. They land on "bpo-38576: Disallow control characters in hostnames in http.client #18995". [OOPS! I wrote some nonsense here - deleted it - sorry] Demonstration that nothing is lost to rounding:
See? No info is rounded away. |
Really I did not get that.
|
Note that there's no reason to expect the mantissa bits in the storage format to be uniformly distributed. For example, the float 1.0 has 52 zero bits in the storage format's explicit mantissa. The float 3.0 has 51 trailing zero bits in the storage format's explicit mantissa. And so on. |
What I mean for all possible numbers [0, 2** 53) when divided by 2**53. I plot the distribution of the bits of generated numbers. |
The method Python uses is the same as was shipped with the original Mersenne Twister code, so is a de facto industry standard. We're not going to change it without superb reason, and so far here there's no reason at all 😉. Note that all the buildbot tests failed here: that's because you changed I have no idea what you did, and since your links don't work as intended for me, apparently no way to find out. Here's a simple test of the distribution of the last 3 bits you can run yourself: from random import random
counters = [0] * 8
T53 = 2.0 ** 53
for x in range(1000000):
r = random()
i = int(r * T53)
assert i / T53 == r
counters[i & 7] += 1
for c in counters:
print(c) Here's output from one run:
By eyeball it looks fine. Please move this to bugs.python.org if you want to pursue it. That's the place for extended discussions, not here. And attach the actual code you used to the issue report. About your other code, there may be some niche demand for the possibility of generating random floats that can be less than 1/2**53, but, again, we cannot change the output of |
BTW, current docs already show a way to get a uniform distribution across the entire range of representable doubles in |
Here is my code showing the bias:
Here is a sample output
My explanation: Sorry if I flooded this discussion. |
is extracting the last 52 bits of the raw double storage format. As before, there's no reason to expect that to be uniformly distributed. To the contrary, it "should be" highly skewed. Let's look at the last byte instead, sticking to built-in Python functions: from random import random
from collections import defaultdict
counters = defaultdict(int)
for x in range(1000000):
r = random()
string = r.hex() # e.g., '0x1.c000000000000p+3'
i = string.index('p')
lastbyte = string[i - 1]
counters[lastbyte] += 1
for byte, c in sorted(counters.items()):
print(byte, c) with highly skewed output:
|
Finally :)
A nice uniform distribution:
|
This PR is stale because it has been open for 30 days with no activity. If the CLA is not signed within 14 days, it will be closed. See also https://devguide.python.org/pullrequest/#licensing |
On one hand, this changes the expected value of random() from
On the other hand, it makes
Consistency across versions and implementations and the one less bit of entropy per float feel like more issues to me though, so my suggestion would be to leave the code as is. Anyone needing integer precision should use integer operations. |
I'm closing this, because there's really no chance it would be adopted. As mentioned before, we'd need extremely strong reason to change what Note that while the mean would theoretically (although probably not measurably, in our lifetime) become 0.5, it would no longer be possible for |
The original implementation produces an output that has some bias in the lower bits of mantissa (more 0's than 1's). That's because we try to fit 2^53 pigeons into 2^52 available holes, so rounding to even is unavoidable. My solution is to generate just 2^52 pigeons using the expression (randbits_53 | 1) to generate 2^52 odd numbers only in the range [1, 2^53-1] , with zero cannot be generated. but, the distribution of mantissa bits becomes more uniform.