Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE REQUEST] Add np.random Module #569

Closed
MichaelRinger opened this issue Jan 3, 2023 · 53 comments
Closed

[FEATURE REQUEST] Add np.random Module #569

MichaelRinger opened this issue Jan 3, 2023 · 53 comments
Labels
enhancement New feature or request

Comments

@MichaelRinger
Copy link

MichaelRinger commented Jan 3, 2023

The module is useful for machine learning related applications.
I would be glad if someone could add the np.random.randn function since I dont know how to do it in C.

@MichaelRinger MichaelRinger added the enhancement New feature or request label Jan 3, 2023
@v923z
Copy link
Owner

v923z commented Jan 4, 2023

Do you need only the random.randn function? I would actually go for random.standard_normal, and wrap it with random.randn, as is done in numpy. See also the comments in Quick start.

@MichaelRinger
Copy link
Author

Atm I only the random.randn function. Yeah creating the random.standard_normal and add the alias random.randn, like its done in numpy, is better.

@v923z
Copy link
Owner

v923z commented Jan 8, 2023

I would vote for the new nomenclature in https://numpy.org/doc/stable/reference/random/generator.html#numpy.random.Generator.

However, a more fundamental question is what we should do with the None seed.

If None, then fresh, unpredictable entropy will be pulled from the OS.

Where should we pull this information from, if there is no underlying OS?

@MichaelRinger
Copy link
Author

MichaelRinger commented Jan 9, 2023

There already is a function random.seed(n=None) in micropythons random standard library that generates a seed, by getting a random number from the hardware.

When no argument (or None) is passed in it will (if supported by the port) initialise the PRNG with a true random number (usually a hardware generated random number).

The None case only works if MICROPY_PY_URANDOM_SEED_INIT_FUNC is enabled by the port, otherwise it raises ValueError.

@v923z
Copy link
Owner

v923z commented Jan 9, 2023

OK, but if that is the case, then you can already use this via the vectorize method. Is that prohibitively expensive?

Another question is, if we were to re-implement this in ulab, what would happen, if a particular port does not support a hardware random number generator?

@v923z
Copy link
Owner

v923z commented Jan 22, 2023

@MichaelRinger Do you have comments on the second question?

@MichaelRinger
Copy link
Author

@MichaelRinger Do you have comments on the second question?
I am using the vectorize method of numpy in combination with the base random func in micropython to solve.
Thanks for the help.

@v923z
Copy link
Owner

v923z commented Jan 29, 2023

@MichaelRinger Did you close this by accident? I was actually going to add the module...

@ammar-bitar
Copy link

Hi @v923z , sorry to bump this, did you end up adding the module? can't seem to find it

@v923z
Copy link
Owner

v923z commented May 16, 2023

@ammar-bitar No, I haven't added, because the OP closed the issue, and there was no response to the question after that. Do you need the module? If so, which methods? It would not be too much work to add, but it was not clear to me, whether there is interest.

@mcskatkat
Copy link

I'd like to upvote this. I think numpy is half crippled without a vectorized RNG.
The micropython random module is a usable workaround, only when speed is not an issue.

Not sure though which RNG has decent entropy while being small enough in code and work memory sizes.

Specifically the ESP32 that I'm interested in has a hardware RNG that according to the docs yields 32-bit integers in <100 CPU cycles.

@v923z v923z reopened this Jul 19, 2023
@v923z
Copy link
Owner

v923z commented Jul 19, 2023

micropython uses yasmarang, which fares quite well in random number tests, as is cheap.

I opened a discussion some time ago here: https://github.com/orgs/micropython/discussions/10550, and I think I could pick up the thread from there, and implement the module in a reasonable time.

In the meantime, as suggested above, you can use the vectorize function. That could easily save you a factor of two in execution time.

@mcskatkat
Copy link

mcskatkat commented Jul 19, 2023

Thanks. I wasn't aware of vectorize.

As for yasmarang, I think it really is too spartan... a 32 bit state is not enough, even for a microcontroller. Especially seeing as in the other thread you were worried about quality (of seeding).

I think that any hardware capable of running a python-like interpreter with per-object overhead and a GC, can afford a few 10s of bytes for an RNG with several 100 bits of state. Even more so if the system is capable of holding ndarrays. No need to be frugal about it.

If you do end up implementing an RNG, I implore you to forget micropython's poor choice of yasmarang and consider the much more modern xoroshiro, that has a 32-bit variant with a 128-bit state and a 64-bit variant with a 256-bit state, useful for the fortunate who have double precision floats.

Looking at the next() function in the xoroshiro source, it looks like code size is comparable to yasmarang, and can be tuned with choice of inlining and constant folding.
On a microcontroller, the sub-ns/bit performance they promise is not attainable, but it will still be fast.

Also I would advise against taking micropython's (following original Python's) path of having one global static generator state. If we already paid the code size overhead, why hardwire it to a single piece of memory...

Thank you so much for the effort you put into ulab.

@v923z
Copy link
Owner

v923z commented Jul 19, 2023

As for yasmarang, I think it really is too spartan... a 32 bit state is not enough, even for a microcontroller. Especially seeing as in the other thread you were worried about quality (of seeding).

OK, then I would start here https://www.pcg-random.org/.

I think that any hardware capable of running a python-like interpreter with per-object overhead and a GC, can afford a few 10s of bytes for an RNG with several 100 bits of state. Even more so if the system is capable of holding ndarrays. No need to be frugal about it.

If you do end up implementing an RNG, I implore you to forget micropython's poor choice of yasmarang and consider the much more modern xoroshiro, that has a 32-bit variant with a 128-bit state and a 64-bit variant with a 256-bit state, useful for the fortunate who have double precision floats.

I think the actual choice of the RNG is irrelevant from the viewpoint of implementation effort. If you don't like a particular RNG, we can always replace it later by a better one. But could we use https://github.com/numpy/numpy/blob/3d8cfd68d8df1af4e25e8dffd0a155fda20dc770/numpy/random/src/pcg64/pcg64.c#L64 as a starting point?

Also I would advise against taking micropython's (following original Python's) path of having one global static generator state. If we already paid the code size overhead, why hardwire it to a single piece of memory...

Again, I feel that this is a minor implementation detail that we could change, if necessary. Do you happen to know how numpy deals with the question?

Also, do you need distributions, too, or just a uniform RNG? Which method of the 'random` module would be important in your application?

@mcskatkat
Copy link

I think the actual choice of the RNG is irrelevant from the viewpoint of implementation effort. If you don't like a particular RNG, we can always replace it later by a better one. But could we use https://github.com/numpy/numpy/blob/3d8cfd68d8df1af4e25e8dffd0a155fda20dc770/numpy/random/src/pcg64/pcg64.c#L64 as a starting point?

You are right of course about the actual RNG being immaterial to the ulab code that provides the user facing functionality.
The problem with PCG specifically is that at its heart are 64-bit by 64-bit multiplications used to implement even larger 128 bit multiplications. This is not cheap for microcontrollers that may struggle to get 32 bit multiplications done quickly. XOR based RNGs are cheap on virtually any hardware.

Anyway as you said the RNG itself is interchangeable and you can also choose to support multiple and have them compiled selectively via configuration file for code size efficiency (see BitGenerator below).

Again, I feel that this is a minor implementation detail that we could change, if necessary.

True again. But change after deployment is itself expensive. Your discussion thread in the micropython repo proves just that. There is nothing in the yasmarang and seeding code in micropython that prevents your desired use pattern (obtain random seed) or that prevents using yasmarang with multiple generator states instead of a single static one. But their choice not to export the proper interface makes this later extension difficult.

Do you happen to know how numpy deals with the question?

Sure. Old numpy.random had the RandomState class that you could instantiate as many times as you need to create new generator states. This is now considered "legacy".

Modern numpy.random revised this (at the painful cost of breaking compatibility, see above, cost of change after deployment) so that there is clear architectural separation between generating random bits (BitGenerator class) and the subsequent generation of distributions from those bits (Generator class).
The discussion we had above about yasmarang, xoroshiro and PCG is all about BitGenerators. And you can choose which one(s) to implement.

The whole point of the Generator class is to be invariant to BitGenerator. And the nice thing is that the ulab version of Generator can start narrow and get expanded later as needed. I doubt many microcontroller users will find the dirichlet function necessary.

I would start with the following Generator methods:

  • .integers()
  • .random(), the classic [0.0, 1.0)
  • .bytes(), is essentially the same as .integers(), possibly slightly more efficient
  • .normal(), is slightly harder and usually twice the effort of .random()
  • .uniform() is a minor extension of .random()

In all of the above I think it's also important to support the out= keyword argument, so that memory allocation is not forced on users who already own some memory and just want some computation done into it.

BTW I think this out= feature is generally missing in ulab.

If you feel like being even more generous to users, these are the second tier methods. Useful, but not as often as the ones above, and require a lot more code. I think they are entirely optional in a microcontroller environment:

  • .choice()
  • .shuffle(), .permutation(), .permuted()
  • .binomial(), .poisson()

@v923z
Copy link
Owner

v923z commented Jul 20, 2023

You are right of course about the actual RNG being immaterial to the ulab code that provides the user facing functionality. The problem with PCG specifically is that at its heart are 64-bit by 64-bit multiplications used to implement even larger 128 bit multiplications. This is not cheap for microcontrollers that may struggle to get 32 bit multiplications done quickly. XOR based RNGs are cheap on virtually any hardware.

OK, this is a good point.

Anyway as you said the RNG itself is interchangeable and you can also choose to support multiple and have them compiled selectively via configuration file for code size efficiency (see BitGenerator below).

Yes, this would be a reasonable approach. Thich could even be done automatically, based on whether you have floats or doubles on a microcontroller.

True again. But change after deployment is itself expensive. Your discussion thread in the micropython repo proves just that. There is nothing in the yasmarang and seeding code in micropython that prevents your desired use pattern (obtain random seed) or that prevents using yasmarang with multiple generator states instead of a single static one. But their choice not to export the proper interface makes this later extension difficult.

I understand that, and there is no disagreement in this regard. But we don't have to stick with micropython's interface, in fact, we should conform to numpy's. I would, perhaps, suggest that we hash out the interface first, and settle on the particular generator later.

I would start with the following Generator methods:

  • .integers()
  • .random(), the classic [0.0, 1.0)
  • .bytes(), is essentially the same as .integers(), possibly slightly more efficient
  • .normal(), is slightly harder and usually twice the effort of .random()
  • .uniform() is a minor extension of .random()

OK.

In all of the above I think it's also important to support the out= keyword argument, so that memory allocation is not forced on users who already own some memory and just want some computation done into it.
BTW I think this out= feature is generally missing in ulab.

I agree. You're not the first, who noticed this: #592

Just out of curiosity, what is your use case? One reason I didn't originally consider the random module is that it's hard to test on a microcontroller, given that you can't easily plot/analyse generated data.

@mcskatkat
Copy link

My use case is quite trivial, I just need some randomness for real-time graphics with many objects moving on the screen.

This is roughly similar to game programming, where to maintain a reasonable frame rate I need to be quick about computing many object state updates. Without the array processing that ulab allows, using Python loops it would probably be way too slow. So ulab allows me to maintain states for 100s of tiny objects, with floating point precision, and still have enough time to convert final screen positions to integers and render objects.

So for my use case, just having .uniform() is enough. .normal() would be nice to have.

One reason I didn't originally consider the random module is that it's hard to test on a microcontroller, given that you can't easily plot/analyse generated data.

I don't see why you'd need to plot / analyse, and you definitely don't have to do that on-device.

The thing with PRNGs is that you can easily seed them with a known value, generate some sequence data and store it to a file on an SD card. If you want to be really fancy you can send that out over Ethernet or Wifi if so equipped.
This data just needs to be bit-identical to the well known PRNG sequence, which can be generated on a laptop/desktop machine from the PRNG's reference code.

As for the Generator part of it, you create a BitGenerator subclass that produces a pre-stored or known-seed sequence. Feed that into numpy's internal Generator class and get reference results.
Then you do the same with the ulab implementation of Generator, again save that to a file and analyze on a "real" computer. Results need to be identical down to floating point roundoff errors, where applicable.

I think this would be practical to test up to sequences of several 100GB using a cheap SD card and a lot of patience to wait. I'm not sure it's necessary to test large volumes of data. Once a PRNG follows the sequence, it will not stray from it, unless its code is insanely written.

For the Generator part of things that deals with floating point - basically what's required is porting the code in numpy's Generator from Python/C to micropython-friendly C. There should be no interface or algorithm changes whatsoever. Just thinning down to supported methods and dtypes. In testing you may want to ensure that some floating point edge cases (denormalized, infinities, boundaries etc) are reached, where relevant. But I think for a first version there is not even a need to bother with that.

@v923z
Copy link
Owner

v923z commented Jul 23, 2023

My use case is quite trivial, I just need some randomness for real-time graphics with many objects moving on the screen.

This is roughly similar to game programming, where to maintain a reasonable frame rate I need to be quick about computing many object state updates.

In that case, the quality of the generator is not really an issue, I guess. In any case, here is a first implementation. https://github.com/v923z/micropython-ulab/tree/random The out keyword is not yet supported, but you can do

MicroPython v1.20.0-311-gea1a5e43d on 2023-07-23; linux [GCC 12.2.0] version
Use Ctrl-D to exit, Ctrl-E for paste mode
>>> from ulab import numpy as np
>>> rng = np.random.Generator()
>>> rng.random(12)
array([1.688118533849092e-10, 0.7703293013781235, 0.6074833036730274, ..., 0.2697689388821915, 0.6484892791277977, 0.6955782450470177], dtype=float64)
>>> rng.random(size=(5, 5))
array([[0.7724060907456224, 0.1599236337293051, 0.7339406510831528, 0.5274410894040502, 0.6714481027287427],
       [0.6492265779888077, 0.06322272656419292, 0.4808784435223349, 0.8659584576783896, 0.6510361989559311],
       [0.7035330373879642, 0.1143488863528934, 0.3228607724092158, 0.9696441515163603, 0.3353713223418759],
       [0.4890397567261386, 0.8449387836308111, 0.4607909039388876, 0.1653575696739978, 0.4966210702938057],
       [0.3707932026109159, 0.7451473233433786, 0.1918072763583848, 0.7469201739384557, 0.7688914647671239]], dtype=float64)

@mcskatkat
Copy link

mcskatkat commented Jul 23, 2023

Great. That was fast... Looks good.

This certainly goes the numpy way about having as many generator states as needed. Just need a way to seed them differently from each other :-)

And while you're right that the quality of the generator is not an issue, its speed is, so I would love to have an option to use an XOR based BitGenerator under the hood of Generator in addition to the PCG multiplier based one. Doesn't need to go all the way with separating to two classes as in numpy.

Something like np.random.Generator(seed: Union[None, int32, tuple[int32,...]], bitgen: Literal['pcg', 'xoroshiro'] = 'pcg')
The seed tuple is either truncated or repeated to match the BitGenerator's state size.
When seed is an int, treat it as a tuple of length 1.
When seed is None I'm not sure what to do... perhaps require it not to be. This leaves the problem of proper seeding to the user, who probably knows better what is 'proper' for the application.

@v923z
Copy link
Owner

v923z commented Jul 24, 2023

Great. That was fast... Looks good.

This certainly goes the numpy way about having as many generator states as needed. Just need a way to seed them differently from each other :-)

Sure. I haven't yet had time to add that, but it should be trivial.

And while you're right that the quality of the generator is not an issue, its speed is, so I would love to have an option to use an XOR based BitGenerator under the hood of Generator in addition to the PCG multiplier based one. Doesn't need to go all the way with separating to two classes as in numpy.

PCG64 has a single integer multiplication, and bit shifts. Are you concerned about the time it takes to perform the multiplication?

Something like np.random.Generator(seed: Union[None, int32, tuple[int32,...]], bitgen: Literal['pcg', 'xoroshiro'] = 'pcg') The seed tuple is either truncated or repeated to match the BitGenerator's state size. When seed is an int, treat it as a tuple of length 1. When seed is None I'm not sure what to do... perhaps require it not to be. This leaves the problem of proper seeding to the user, who probably knows better what is 'proper' for the application.

Well, we've got to be compatible with numpy. Unfortunately, the documentation on this is rather sketchy, and we can't expect micropython to support the typing module, so I'm afraid Union and the like are out of question. I'd also like to be flash-aware. What you suggest can be done in a simple for loop in python as

seeds = [1, 2, 3, 4, 5]
generators = [random.Generator(seed) for seed in seeds]

so I wouldn't waste flash space for having that in C. In fact, if you really want to, you an also hook into https://github.com/v923z/micropython-ulab/tree/master/snippets, and extend the C code with python, pretty much as is done in numpy.

@mcskatkat
Copy link

mcskatkat commented Jul 24, 2023

PCG64 has a single integer multiplication, and bit shifts. Are you concerned about the time it takes to perform the multiplication?

Yes. I'm working with ESP32 which has an Xtensa LX6 CPU.
I'm not sure exactly what configuration of Xtensa was licensed by Espressif for the chip but with the LX6 core, a hardware multiplier is optional, and when implemented it is either a 32-bit or a 16-bit hardware integer multiplier (16×16→32). IIRC these are not single cycle and not pipelined.
PCG needs 64×64 multiplication which may require 16 times 16×16 or 4 times 32×32 (or perhaps half that if only the lower half of the result is needed).
In contrast, XOR is single cycle and pipelined on almost any architecture.
Perhaps I'm too worried about this...

Well, we've got to be compatible with numpy. Unfortunately, the documentation on this is rather sketchy, and we can't expect micropython to support the typing module, so I'm afraid Union and the like are out of question.

My bad, I was trying to be succinct so I used typing annotations for communicating between us, not as a suggestion to actually use typing in the code.

To really be compatible with numpy you'd have to implement a BitGenerator class. Then each Generator instance would have to keep a reference to an instance of BitGenerator, and would have to delegate to it every time it wants random bits.
This is great flexibility but too inefficient for an embedded scenario.

Here is the gist of my suggestion above, it makes things match numpy almost exactly (slightly different from the previous suggestion):

  • Create a PCG64 class as in ulab. Unlike numpy, this class will not actually be used to produce random bits. It will only be used to indicate the bit-generation algorithm, and optionally pass a seeded state.

  • The constructor for PCG64 follows the numpy guideline for general BitGenerator, i.e. it takes one of the following:

    • None (default), indicating some default seeding method, preferably using HW random or clock-based seed
    • int or a sequence (tuple?) of int: numbers from which to generate the initial RNG state. Since different RNGs have different state sizes, there needs to be a mapping from these ints to the RNG state. Numpy uses a class called SeedSequence. I think this is too much for embedded. I suggested truncation/repetition of the int sequence, as needed to fill the state buffer.
      In any case, this only seeds a single generator, it is not meant to create multiple generator instances.
  • This leaves an opening to later add different subtype of BitGenerators, e.g. Philox or xoroshiro. Again, for efficiency, they will not really be independent of Generator. Instead, they only tell the Generator constructor which algorithm to use, and optionally pass an initial state.

  • Inside the Generator constructor, you examine the object passed. If it is one of the implemented BitGenerators, you initialize the Generator by allocating an appropriately sized state buffer and setting a pointer to the algorithm's next function that yields 32 random bits on request. Everything else in the implementation of Generator is then agnostic to the BitGenerator used.

@mcskatkat
Copy link

I just read a little more in the numpy docs, apparently this is exactly what numpy does too, they don't delegate to the BitGenerator via Python calls.

Instead, the concrete BitGenerator (e.g. PCG64) provides C function pointers that yield the next 32 bits (or more, in full blown numpy).
The Generator class just stores a reference to the PCG64 object (to keep it alive). When it needs to use it, it fetches the function pointer (and probably a state buffer pointer), and then uses it to make C-level calls when it needs random bits:

PCG64 provides a capsule containing function pointers that produce doubles, and unsigned 32 and 64- bit integers. These are not directly consumable in Python and must be consumed by a Generator or similar object that supports low-level access.

@v923z
Copy link
Owner

v923z commented Jul 24, 2023

Yes. I'm working with ESP32 which has an Xtensa LX6 CPU. I'm not sure exactly what configuration of Xtensa was licensed by Espressif for the chip but with the LX6 core, a hardware multiplier is optional, and when implemented it is either a 32-bit or a 16-bit hardware integer multiplier (16×16→32). IIRC these are not single cycle and not pipelined. PCG needs 64×64 multiplication which may require 16 times 16×16 or 4 times 32×32 (or perhaps half that if only the lower half of the result is needed). In contrast, XOR is single cycle and pipelined on almost any architecture. Perhaps I'm too worried about this...

We could replace the RNG, if you insist.

To really be compatible with numpy you'd have to implement a BitGenerator class. Then each Generator instance would have to keep a reference to an instance of BitGenerator, and would have to delegate to it every time it wants random bits. This is great flexibility but too inefficient for an embedded scenario.

I don't really care about compatibility to the very last bit. It should be compatible in the standard use cases, and only the user-facing implementation. In fact, using a XOR-based RNG would already break compatibility, for numpy has PCG64 as default.

  • Create a PCG64 class as in ulab. Unlike numpy, this class will not actually be used to produce random bits. It will only be used to indicate the bit-generation algorithm, and optionally pass a seeded state.

I don't want to implement everything, just because it's done so in numpy. We have to keep in mind that MCUs are significantly smaller, in terms of flash, RAM, everything.

  • Inside the Generator constructor, you examine the object passed. If it is one of the implemented BitGenerators, you initialize the Generator by allocating an appropriately sized state buffer and setting a pointer to the algorithm's next function that yields 32 random bits on request. Everything else in the implementation of Generator is then agnostic to the BitGenerator used.

Is a function pointer to an inline function going to work? I thought that you wanted to inline the function, so that you can save on execution time. Now, using a pointer to a function is not going to speed up the code.

@mcskatkat
Copy link

We could replace the RNG, if you insist.

I don't. I'm just hoping we can keep it open to additions.

I don't really care about compatibility to the very last bit. It should be compatible in the standard use cases, and only the user-facing implementation. In fact, using a XOR-based RNG would already break compatibility, for numpy has PCG64 as default.

I agree. I would keep PCG64 and add an optional xoroshiro. If you leave it open to extension, I might find time to add that later myself (under a #if config).
Still specifically in this case it seems that maintaining an identical user-facing interface is not difficult or expensive.

Is a function pointer to an inline function going to work? I thought that you wanted to inline the function, so that you can save on execution time. Now, using a pointer to a function is not going to speed up the code.

No, obviously a function called via a pointer cannot be inlined. Good point.
I wonder what numpy does about that. Perhaps vectorization to avoid per-element call overhead.

Anyway, since there is no intention to make this open to adding BitGenerator classes at runtime, there really isn't a need to use function pointers. The Generator code could switch on the actual type of BitGenerator from the small finite set of implemented ones and make inlined calls to the appropriate function.

@v923z
Copy link
Owner

v923z commented Jul 24, 2023

We could replace the RNG, if you insist.

I don't. I'm just hoping we can keep it open to additions.

I don't mind having multiple RNG implementations, and as you pointed out, performance could be platform-dependent.

I agree. I would keep PCG64 and add an optional xoroshiro. If you leave it open to extension, I might find time to add that later myself (under a #if config). Still specifically in this case it seems that maintaining an identical user-facing interface is not difficult or expensive.

No, that should be trivial. Either we simply decide this at compile time with the #if directive, or we find a way of switching between various implementations at run time. In any case, it would be great, if you could report on execution time/performance, when all the dust settles.

One thing to keep in mind, however, is that it's not only the flash for the RNG that you have to pay for this, but the loops, too. If you want to inline the RNG, then you've got to write out the loops for each case that you implement. So, you would have something like this (pseudocode):

if(RNG == PCG64) {
    i = 0
    while(i < ndarray.shape[3]) {
        j = 0
           while(j < ndarray.shape[2]) {
               k = 0
               while(k < ndarray.shape[1]) {
                   l = 0
                   while(l < ndarray.shape[0]) {
                       // PCG64 implementation here
                       *array = ... 
                       array += ndarray.strides[0]
                   }
                   l++
                   array += ndarray.strides[1]
               }
               array += ndarray.strides[2]
               k++
           }
           array += ndarray.strides[3]
           j++
    }
    i++
} else if(RNG == xoroshiro) {
    i = 0
    while(i < ndarray.shape[3]) {
        j = 0
           while(j < ndarray.shape[2]) {
               k = 0
               while(k < ndarray.shape[1]) {
                   l = 0
                   while(l < ndarray.shape[0]) {
                       // xoroshiro implementation here
                       *array = ... 
                       array += ndarray.strides[0]
                   }
                   l++
                    array += ndarray.strides[1]
               }
               k++
               array += ndarray.strides[2]
           }
           j++
           array += ndarray.strides[3]
    }
    i++
}

There is no other way of avoiding the function call overhead.

@mcskatkat
Copy link

you've got to write out the loops for each case that you implement

That would be true if the inner operation was on the order of one or two cycles. Since we are dealing with next() functions that are much more complex, we can sacrifice a branch in the inner loop. Using switch is usually well optimized:

int const algo_index = algo_index_for_bitgen_type(figure_out_type(bitgen_argument));
void * const state_ptr = get_state_pointer(bitgen_argument);
while(...)
  while (...)
    while (...) {
        register uint32_t bits;
        switch(algo_index) {
            case 0: bits = algo_0_next(state_ptr); break;
            case 1: bits = algo_1_next(state_ptr); break;
            ...
        }
        *array = do_something_with(bits);
    }

@v923z
Copy link
Owner

v923z commented Jul 24, 2023

That would be true if the inner operation was on the order of one or two cycles. Since we are dealing with next() functions that are much more complex, we can sacrifice a branch in the inner loop.

OK.

@mcskatkat
Copy link

mcskatkat commented Jul 24, 2023

The switch can actually always be placed outside the innermost while, which would amortize it better and might enable use of zero overhead loops:

while(...)
  while (...)
    while (...)
        switch(algo_index) {
            case 0:  for(i=0; i<inner_extent; ++i, array += inner_stride) *array = do_something_with(algo_0_next(state_ptr)); break; 
            case 1:  for(i=0; i<inner_extent; ++i, array += inner_stride) *array = do_something_with(algo_1_next(state_ptr)); break; 
        }

Also, you could (and perhaps already do) mark ndarrays as contiguous when they are, and then run lower dimensionality loops on them. That would save time on loop code when users run something on arrays of shape (1000, 2, 1) and stride (2, 1 ,1) by looping on shape (2000,) stride (1,) instead. But that is all second order, I don't think it's really necessary.

@v923z
Copy link
Owner

v923z commented Jul 25, 2023

You can specify the seed now, an integer, or a tuple of integers.

@mcskatkat
Copy link

Thanks. I'm going on vacation for a while. Will test this when I get back.

@JanekBaumgartRelativity
Copy link

JanekBaumgartRelativity commented Jan 7, 2024

Hello All, I see the there was quite a progress here 6 months ago any expectations when it could reach closed/prebuilt stage? I would love to use this.

@v923z
Copy link
Owner

v923z commented Jan 7, 2024

I think the only outstanding issue is the out keyword argument. In the case of random that's rather trivial, so I might be able to add that today, and create a PR.

@JanekBaumgartRelativity
Copy link

Awesome, I don't want to rush anything so I'll patiently wait for "random". Glad that this is not dead. Thanks! ;)

@v923z v923z mentioned this issue Jan 9, 2024
@v923z
Copy link
Owner

v923z commented Jan 9, 2024

I think #654 is pretty close to what you wanted, so we could in principle merge that, and add new features in separate PRs.

@JanekBaumgartRelativity

Amazing work, yeah that would unblock me for a start if I will find anything missing I will let you know but this is HUGE! Thanks a lot!

@v923z
Copy link
Owner

v923z commented Jan 13, 2024

Completed in #654.

@v923z v923z closed this as completed Jan 13, 2024
@MicroControleurMonde
Copy link

Hi Zoltán,

I was looking for information on Yasmarang for my home made RNG when I came across this thread.

I have implemented an RNG on ESP32 by directly calling the appropriate register in micropython and I was hooked and implemented another one for ESP8266 and one for RP2040.
When you like it, you don't count ^^

The thing is, I'm not a great specialist in random generators and I'm wading a bit...

I've published my source code, documentation, datasets and tests, so if you could have a look and give me your impressions, I'd be delighted. Could you tell me whether these RNGs seem to be insightful (i.e. whether they generate sufficiently unpredictable and random sequences), and whether they meet the entropy criteria expected to guarantee a good level of security and diversity?

PS: I'm more in drones and geomatics.

https://github.com/MicroControleurMonde

@v923z
Copy link
Owner

v923z commented Nov 28, 2024

@MicroControleurMonde I can't claim to be an expert on the question of how good a random number generator is. As you can also see from this thread, at the time of the implementation, @mcskatkat had a rather strong opinion on the issue, so it might be worthwhile to ping them.

In any case, the site https://www.pcg-random.org/ gives a rather good overview of the features one has to watch out for. There is also a link to a publication that is in-depth and detailed. I think you would find that useful.

@MicroControleurMonde
Copy link

Thank you for the swift response Zoltán.

Yes, I've started looking pcg-random.org . A little stressful for me... I'm going to start digging around.

I just received my Pyboard v1.1 card with hardcoded RNG hardcoded today. I'm going to do a test series and have a little fun with it this weekend.

@mcskatkat any advise for a poor geomatician guy like me ?

@MicroControleurMonde
Copy link

A quick feedback regarding the Pyboard and its hardcoded RNG:

Time spent to generate 200'000 numbers: 108.295601 seconds
Numbers generated per second: 1846.80

I'm surprised, it's going at good speed !
The ESP32 has close performances according to what a measured.
Let me see the Ent and Dieaharder tests.

@MicroControleurMonde
Copy link

MicroControleurMonde commented Nov 29, 2024

Entropy = 3.458180 ... not good enough !

https://github.com/MicroControleurMonde/PyBOARD_RNG

@mcskatkat
Copy link

mcskatkat commented Dec 1, 2024

@MicroControleurMonde

I'm not an expert on the mathematical properties of RNGs either. Serious RNG research is a very deep field in which many PhDs have been written.

What I do know is that random registers on devices that have them are not intended to be used to generate streams of random numbers.

The common practice is to use them to seed a mathematically robust pseudo-random generation algorithm, thus providing some amount of true randomness into what is otherwise always a deterministic sequence.
The statistical properties of a pseudo-RNG may be mathematically provable. If they are, and the state-space of the RNG is large enough, then seeding the RNG with a state obtained by sampling a truly random physical phenomenon will yield sequences that can be considered good approximations of truly random data.

As for the speed of ~2k numbers per second, you didn't mention how many bits in each number but even if those are 128-bit numbers, 500 microseconds per scalar cannot be considered fast. Personally I wrote a simulation that ran on GPU and produced ~10 billion high quality, 64-bit, accurately guassian distributed random number per second. That's about 5M times faster and it was on hardware that would now be ~13 years old.

Also note that physical phenomenon random registers often have very specific instructions on how they need to be operated. This usually involves allowing some design-determined time to pass between taking samples, to allow the physical phenomenon to develop sufficiently to avoid sample correlation. This could explain why, if you generate sequences of numbers by repeated sampling of the register, you get unsatisfactory performance in entropy tests.

I couldn't figure out exactly what you are trying to accomplish in your project, but I hope this helps.

@v923z
Copy link
Owner

v923z commented Dec 1, 2024

@mcskatkat Thanks, this is useful insight!

@MicroControleurMonde
Copy link

Yes indeed it is. Thank you very much Mc Skatkat.
I'm preparing my answer with my findings.

@MicroControleurMonde
Copy link

MicroControleurMonde commented Dec 1, 2024

You are not an expert in the mathematical properties of RNGs, of course, but still having an excellent understanding of the thing. It helps.
Yes, it is a vast field.

Concerning the random registers of MCUs equipped with RNG, such as the ESP32 and the STM32F405:

here is what I found and understood:

  • In the context of random hardware registers (TRNG), the generation of 500 or 100,000 values has no significant impact on the quality or randomness of the values produced.

  • The statistical tests commonly used to evaluate the quality of a TRNG, such as those of FIPS 140-2 or NIST SP 800-22, are applied on large samples of generated numbers (generally several million bits).

  • The exact number of values generated does not have a significant impact on the overall statistical properties of the generator, as long as the sample is large enough for testing.

  • TRNGs exploit physical phenomena, such as thermal noise or quantum fluctuations, which are inherently unpredictable and do not depend on the number of values generated.

  • The capacity of a TRNG may be influenced by hardware limitations (processing speed, power consumption, etc.), but this does not affect the random nature of the numbers generated. As long as the device can supply a sufficient quantity of data, the random quality of each number remains constant.

  • The CAcert website offers a free evaluation of RNG. The minimum sample size is 12 MB (or more if possible, maximum 1 GB). I used it to cross-check my own tests. They use the Ent and Diehard tests.
    https://www.cacert.at/random/

Number of bits for each number

I thought I had given it, silly me ...

  • for RNG on RP2040, I generated 64-bit numbers.
  • for RNG on Pyboard (STM32), 32-bit numbers.
  • for RNG on ESP32, 32-bit numbers.
  • for RNG on ESP8266, 64-bit numbers.

In the case of the RP2040 and the ESP8266, I ‘tweaked’ a grinder based on reading the ADC converter register, mixed with SHA256 sauce (whitening).

My project:

To test the most common MCUs available to me in terms of performance and production.

It's a far from my field of geomatics, but given that my drones, lidars and GPS cards are based on these little monsters, it's a good idea to check out what they've got and what they're worth. A small ESP32 or RP2040 MCU mounted on a card is only worth US$5 ... and in fact it allows you to do a lot of things, but it also has its own limitations.

If I want to do triple-axis (X,Y,Z) coordinates collection with a bit of real-time processing + radio transmission, it's better to go for more robust Cortex ARMs like the STM32, for example.

The speed of the TRNG on an ESP32 or a Pyboard is still very good. Don't forget that the concept is still to test the relevance of small 32-bit MCUs. So inevitably, we're going to find ourselves very far from a 64-bit CPU or a small GPU.
That's normal.

Note:

~10 billion high-quality, 64-bit, precisely distributed Gaussian random numbers per second
I'm in admiration ... hats off !

@MicroControleurMonde
Copy link

MicroControleurMonde commented Dec 1, 2024

Today, I ported the PCG32 that you suggested and I have generated a sample of 2,000,000 values.
It works on Pyboard, RP2040 and should work on ESP32 as well.

Here are the results:

Target Values: 2000002
Total Values: 1999998
Total Bytes: 7999992
Elapsed Time (sec): 685
Throughput (bytes/sec): 11678.8203
Throughput (values/sec): 2919.7051
Throughput (bits/sec): 93430.5625
Sample size: 20.6 MB
32 bits per random value

If I have enough time, I will test my batch tonight and will push the results tomorrow.

Cheers ^^

@v923z
Copy link
Owner

v923z commented Dec 1, 2024

Today, I ported the PCG32 that you suggested and I have generated a sample of 2,000,000 values.

The random module is based on PCG32. Or what do you mean when saying that you ported the PCG32?

@mcskatkat
Copy link

@MicroControleurMonde

I still can't say I understand where you are going with this.
It is generally a non-goal for TRNGs to be fast. In normal use I would be surprised if TRNGs are sampled more than once per second. Even once per day, perhaps.

They are put there for the sole purpose of letting a tiny bit of controlled non-determinism seep into computer systems, that otherwise spend a lot of energy and design effort on suppressing the natural randomness of the physical world they operate in, as they strive to produce deterministic, predictable and repeatable results.

given that my drones, lidars and GPS cards are based on these little monsters, it's a good idea to check out what they've got and what they're worth. A small ESP32 or RP2040 MCU mounted on a card is only worth US$5 ... and in fact it allows you to do a lot of things, but it also has its own limitations.
If I want to do triple-axis (X,Y,Z) coordinates collection with a bit of real-time processing + radio transmission, it's better to go for more robust Cortex ARMs like the STM32, for example.

If you just want to experiment with compute on MCUs (which is indeed surprisingly good per $), there may be more promising paths such as learning to use their dedicated HW accelerators. You don't need to go to ARM for that. Xtensa (the CPU on ESP32) is actually very capable and very extensible, as its name implies.
For example, the ESP32-S3 has a 128-bit SIMD vector unit. I don't know the exact architecture (it really depends on arbitrary choices made by Espressif), but assuming there is only one unit serving the two Xtensa cores, and it can only attain 0.5 instructions/cycle, that is still 0.5Gop/s for 32-bit scalars, for ~$5...

If producing one output (random numbers or anything else) requires, say, 16 compute operations, it should come close to ~30M 32-bit values per second when running carefully designed software.

@MicroControleurMonde
Copy link

MicroControleurMonde commented Dec 1, 2024

Which random module Zoltán ?
The one from micropython itself ?
Oupsss, sorry I didn't pay attention.

@MicroControleurMonde
Copy link

@mcskatkat

Quickly, I will make a more elaborated answer tomorrow.

It all started several weeks ago when I looked at the block diagram of my ESP32-S. That's when I wondered what the Espressif chip's hard-coded functions were worth. Digging around in the source code, I couldn't find an RNG function for the ESP module. So, I went in the Espressif doc to find the information and I wrote a little library that will read the correct register. It all started from there.

To my great surprise, the Xtensa LX6 core is extremely swift for a chip that was produced in 2016. However, it does not have a SIMD vector unit or hardware acceleration. Correct me if I'm wrong.

I can't get my hands on an ESP32-S3 card any time soon here.

@v923z
Copy link
Owner

v923z commented Dec 1, 2024

Which random module Zoltán ? The one from micropython itself ? Oupsss, sorry I didn't pay attention.

No, the one in ulab.

@MicroControleurMonde
Copy link

MicroControleurMonde commented Dec 2, 2024

Which random module Zoltán ? The one from micropython itself ? Oupsss, sorry I didn't pay attention.

No, the one in ulab.

Oh OK, I understand now.
Nope, I didn't touch anything related to ulab.
I just made a simple port from PCG.
From that one: Minimal C Implementation.

@MicroControleurMonde
Copy link

MicroControleurMonde commented Dec 2, 2024

@mcskatkat

"The speed of TRNGs is not generally an objective."

Yes, we agree. I would also be surprised if an MCU with TRNG was subjected to hundreds of thousands of random number generation requests per day... It doesn't make much sense from what I understand.

Now there's the reality of tests like NIST, Dieharder and others... And there, on the other hand, you need a minimum sample volume to test things!

"...which, ... spend a lot of energy and design effort to remove natural randomness"

This is so true. I've plotted the disparity of the data over various random draws and the problem is that the dispersion is pretty bad. The sources of entropy aren't that entropic. Lol. On a more serious note, the variation intervals are quite small.

As part of what I do, I also require the ability to ‘encrypt’ communications. This explains my power research, as I'm collecting GPS/topo data at the same time. And from what I can see the performance level should be sufficient with an ESP32.
I think I might take a look at the S3 one of these days.

Scatter X

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants