Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-uniform random floats (_randommodule.c) #31606

Closed
wants to merge 1 commit into from

Conversation

amrali-eg
Copy link

The original implementation produces an output that has some bias in the lower bits of mantissa (more 0's than 1's). That's because we try to fit 2^53 pigeons into 2^52 available holes, so rounding to even is unavoidable. My solution is to generate just 2^52 pigeons using the expression (randbits_53 | 1) to generate 2^52 odd numbers only in the range [1, 2^53-1] , with zero cannot be generated. but, the distribution of mantissa bits becomes more uniform.

The original implementation produces an output that has some bias in the lower bits of mantissa (more 0's than 1's). That's because we try to fit 2^53 pigeons into 2^52 available holes, so rounding to even is unavoidable. My solution is to generate just 2^52 pigeons using the expression (randbits_53 | 1) to generate 2^52 odd numbers only in the range [1, 2^53-1] , with zero cannot be generated. but, the distribution of mantissa bits becomes more uniform.
@the-knights-who-say-ni
Copy link

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept this contribution by verifying everyone involved has signed the PSF contributor agreement (CLA).

Recognized GitHub username

We couldn't find a bugs.python.org (b.p.o) account corresponding to the following GitHub usernames:

@amrali-eg

This might be simply due to a missing "GitHub Name" entry in one's b.p.o account settings. This is necessary for legal reasons before we can look at this contribution. Please follow the steps outlined in the CPython devguide to rectify this issue.

You can check yourself to see if the CLA has been received.

Thanks again for the contribution, we look forward to reviewing it!

@tim-one
Copy link
Member

tim-one commented Feb 27, 2022

Sorry, but this is misguided. There is no rounding of any kind going on. Each integer in range(2**53) is equally likely, and dividing it by 2**53 is an exact operation: the infinite-precision result is exactly representable as a 754 double.

Note that while the double storage format has 52 mantissa bits, the value it represents has 53 mantissa bits: a leading 1 bit is implicit (except for tiny subnormal numbers, which are irrelevant here).

@tim-one
Copy link
Member

tim-one commented Feb 28, 2022

Sorry, your links don't work for me. They land on "bpo-38576: Disallow control characters in hostnames in http.client #18995".

[OOPS! I wrote some nonsense here - deleted it - sorry]

Demonstration that nothing is lost to rounding:

>>> i = 2**52 + 1
>>> i.bit_length()
53
>>> float(i)
4503599627370497.0
>>> i
4503599627370497

See? No info is rounded away.

@amrali-eg
Copy link
Author

amrali-eg commented Feb 28, 2022

Really I did not get that.
Anyway, what do you think about generating floats like that:

def fRandom():
    mantissa = random.getrandbits(52)
    significand = 1.0 + mantissa * float_info.epsilon
    x = 0
    while not x:
        x = random.getrandbits(32)
        x = x & -x  # bitmask for rightmost 1-bit (exponential)
    return significand / (2 * x)

@tim-one
Copy link
Member

tim-one commented Feb 28, 2022

Note that there's no reason to expect the mantissa bits in the storage format to be uniformly distributed. For example, the float 1.0 has 52 zero bits in the storage format's explicit mantissa. The float 3.0 has 51 trailing zero bits in the storage format's explicit mantissa. And so on.

@amrali-eg
Copy link
Author

amrali-eg commented Feb 28, 2022

What I mean for all possible numbers [0, 2** 53) when divided by 2**53. I plot the distribution of the bits of generated numbers.
I also use 'detect-significand-bias.py' from Rademacher Floating Point Library (on GitLab), to check bits.

@tim-one
Copy link
Member

tim-one commented Feb 28, 2022

The method Python uses is the same as was shipped with the original Mersenne Twister code, so is a de facto industry standard. We're not going to change it without superb reason, and so far here there's no reason at all 😉. Note that all the buildbot tests failed here: that's because you changed random()'s output. That's not something that can be done either without superb reason.

I have no idea what you did, and since your links don't work as intended for me, apparently no way to find out. Here's a simple test of the distribution of the last 3 bits you can run yourself:

from random import random
counters = [0] * 8
T53 = 2.0 ** 53
for x in range(1000000):
    r = random()
    i = int(r * T53)
    assert i / T53 == r
    counters[i & 7] += 1
for c in counters:
    print(c)

Here's output from one run:

124729
124854
125701
125369
124791
124641
124833
125082

By eyeball it looks fine.

Please move this to bugs.python.org if you want to pursue it. That's the place for extended discussions, not here. And attach the actual code you used to the issue report.

About your other code, there may be some niche demand for the possibility of generating random floats that can be less than 1/2**53, but, again, we cannot change the output of random.random() without superb reason. People expect to be able to seed the generator to a known value, and get exactly the same results across Python releases.

@tim-one
Copy link
Member

tim-one commented Feb 28, 2022

BTW, current docs already show a way to get a uniform distribution across the entire range of representable doubles in [0, 1):

https://docs.python.org/3/library/random.html#recipes

@amrali-eg
Copy link
Author

amrali-eg commented Feb 28, 2022

Here is my code showing the bias:

import numpy as np
from random import random

def significand(xs: np.float64):
    ss = np.abs(xs).view(np.uint64) & (2**52 - 1)
    return ss

bits = 8
counters = [0] * bits
lst = []
for x in range(100000):
    lst.append(random())

xs = np.array(lst, dtype='float64')
ss = significand(xs)
for x in range(100000):
    mantissa = int(ss[x])
    for j in range(bits):
        counters[j] += (mantissa >> j) & 1
for c in counters:
    print(c)

Here is a sample output

25143
37622
43724
47063
48531
49187
49664
49867

Capture

My explanation:
dividing even number by a power of 2, causing shift of mantissa bits to the left, leaving right 0's. So, I thought it is better to avoid even numbers, and only use odd numbers to prevent that behavior.

Sorry if I flooded this discussion.
Thanks for your time

@tim-one
Copy link
Member

tim-one commented Feb 28, 2022

ss = np.abs(xs).view(np.uint64) & (2**52 - 1)

is extracting the last 52 bits of the raw double storage format. As before, there's no reason to expect that to be uniformly distributed. To the contrary, it "should be" highly skewed.

Let's look at the last byte instead, sticking to built-in Python functions:

    from random import random
    from collections import defaultdict
    counters = defaultdict(int)
    for x in range(1000000):
        r = random()
        string = r.hex() # e.g., '0x1.c000000000000p+3'
        i = string.index('p')
        lastbyte = string[i - 1]
        counters[lastbyte] += 1
    for byte, c in sorted(counters.items()):
        print(byte, c)

with highly skewed output:

0 187891
1 31447
2 62451
3 31247
4 93683
5 31066
6 62588
7 31030
8 125074
9 31042
a 62429
b 31676
c 93564
d 31127
e 62338
f 31347

@amrali-eg
Copy link
Author

Finally :)

import random
from collections import defaultdict

def unbiased_random():
    return (random.getrandbits(53) | 1) / 9007199254740992.0

counters = defaultdict(int)
for x in range(1000000):
    #r = random.random()
    r = unbiased_random()
    string = r.hex() # e.g., '0x1.c000000000000p+3'
    i = string.index('p')
    lastbyte = string[i - 1]
    counters[lastbyte] += 1
for byte, c in sorted(counters.items()):
    print(byte, c)

A nice uniform distribution:

0 62907
1 62211
2 62538
3 62681
4 62223
5 62398
6 62230
7 62183
8 62914
9 62635
a 62780
b 62323
c 62133
d 62253
e 62846
f 62745

@github-actions
Copy link

This PR is stale because it has been open for 30 days with no activity. If the CLA is not signed within 14 days, it will be closed. See also https://devguide.python.org/pullrequest/#licensing

@github-actions github-actions bot added the stale Stale PR or inactive for long period of time. label Mar 31, 2022
@sweeneyde
Copy link
Member

On one hand, this changes the expected value of random() from 1/2 - 2**-54 to 1/2 exactly:

from statistics import mean
>>> mean(x/4 for x in range(4)) # 3/8
0.375
>>> mean(x/8 for x in range(8)) # 7/16
0.4375
>>> mean(x/2**26 for x in range(2**26)) # (2**26-1)/2**27
0.4999999925494194
>>> mean((x|1)/4 for x in range(4))
0.5
>>> mean((x|1)/8 for x in range(8))
0.5
>>> mean((x|1)/2**26 for x in range(2**26))
0.5

# On my hypothetical superfast computer
>>>>>> mean(x/2**53 for x in range(2**53)) # (2**53-1)/2**54
0.49999999999999994
>>>>>> math.nextafter(0.5, 0.0)
0.49999999999999994
>>>>>> mean((x|1)/2**53 for x in range(2**53)) # (2**53-1)/2**54
0.5

On the other hand, it makes 0.0 impossible, which probably isn't ideal.

>>> (0 | 1) / 9007199254740992.0
1.1102230246251565e-16

Consistency across versions and implementations and the one less bit of entropy per float feel like more issues to me though, so my suggestion would be to leave the code as is. Anyone needing integer precision should use integer operations.

@tim-one
Copy link
Member

tim-one commented Apr 1, 2022

I'm closing this, because there's really no chance it would be adopted. As mentioned before, we'd need extremely strong reason to change what random() returns across releases, and there's at best a weak "theoretical wish list" kind of benefit here, but also stronger drawbacks.

Note that while the mean would theoretically (although probably not measurably, in our lifetime) become 0.5, it would no longer be possible for random() to return 0.5. Or 0.25, 0.75, 0.125, or any other rational of the form I / 2**53 for I in range(0, 2**53, 2). It's not just 0.0 that would vanish from the possible outputs. Of course half the possible outputs would vanish regardless, but this systemically excludes the "simplest" outputs.

@tim-one tim-one closed this Apr 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting review stale Stale PR or inactive for long period of time.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants