Python Accelerations

Here's a whiteboard diagram we came up with which seemed to communicate our current thinking on how we are proposing doing Python performance acceleration options on large vector operations.

It's kind of "choose your own adventure". For any given vector computation, one can go as far to the right as one is capable. One can write a numpy implementation and call it good for now. One can go further and do a numba implementation. Or go further and do a ctypes compiled c extension. Or go a bit further and make this faster with OpenMP. Or go even further and make it faster with a SIMD implementation.

Some notes:

At the end of the day, from numpy to simd, these are all just ways of operating on contiguous arrays of numbers. All these solutions operate on the same datatypes really.
numba works on for loops, not numpy operations.
In theory, if you have GPU hardware, you can go from numba to numba+cuda with just the cuda extension to numba, a decorator and "a little setup code". We have yet to truly demonstrate this for ourselves though.
numba doesn't really start to show its worth unless you use prange.
Going from numpy to numba or ctypes requires starting to thinking about datatypes.
Developing a ctypes extension requires a compiler. (Numba has a built in compiler.)
Speeding up a C implementation with OpenMP is truly simple.
OpenMP naturally uses all the cores it finds, without recompiling, which is very useful for making sure one is leveraging all available hardware, but also means that performance gains tested & recorded will depend on what machine it is being run on.
Note that the benefits of parallelization can be mostly eliminated if there is a critical section (e.g., updating a common counter) somewhere in the for loop being parallelized. Though this can usually be fixed by using thread-local storage and then adding an accumulator section at the end. GPT can help with this.
SIMD acceleration requires a certain low-level programming comfort. Working code can all be provided by GPT/Copilot, but it can take some trial and error.
We expect to provide SIMD implementations for AVX2, AVX512 and SSE. One can compile all of these together but one can only test on the hardware one has. Testing all 3 implementations requires getting a bit fancy, though presumably GHA can support this.
We are actively working on a reference, stress test which demonstrates performance numbers on each of these implementations.

Examples

Let's look at some sample code for a function that ages everyone in our model. We have some expansion slots for the "not yet born" with negative age so we co a condition check in addition to an addition, which makes our code at least not totally trivial. We'll also show some sample wallclock times for each one, though a lot can vary depending on what hardware one runs on. For now we're use a 16-core DeepNote VM with 128GB of RAM and run with an array of 1e7 integers uniformly distributed between -8365 and +24365.

Naive Python

def naive_one(data: np.ndarray) -> None:
    one_day = np.float32(1.0 / 365)
    for i in range(data.shape[0]):
        if data[i] >= 0:
            data[i] += one_day
    return

Duration: 15.91715s

Numpy

def vectorized(data: np.ndarray) -> None:
    one_day = np.float32(1.0 / 365)
    data[data >= 0] += one_day
    return

Duration: 0:00:00.09175s

Numba

@nb.njit((nb.float32[:],), parallel=True, nogil=True)
def numbaized(data: np.ndarray) -> None:
    one_day = np.float32(1.0 / 365)
    for i in nb.****prange****(data.shape[0]):
        if data[i] >= 0:
            data[i] += one_day
    return

Duration: 0.00314s 30x faster than numpy!

Compiled C (ctypes extension)

Python part:

# 'import' the module
update_ages_lib = ctypes.CDLL('./update_ages.so')
# define the interface to the function (datatypes)
update_ages_lib.update_ages_vanilla.argtypes = [
    ctypes.c_size_t,  # start_idx
    ctypes.c_size_t,  # stop_idx
    np.ctypeslib.ndpointer(dtype=np.float32, flags='C_CONTIGUOUS')
]
# Invoke
def compiled_c(data: np.ndarray) -> None:
    update_ages_lib.update_ages_vanilla(
        0,
        len( data ),
        data
    )

C Part:

const float one_day = 1.0f/365.0f;
void update_ages_vanilla(unsigned long int start_idx, unsigned long int stop_idx, float *ages) {
    for (size_t i = start_idx; i <= stop_idx; i++) {
        if( ages[i] >= 0 )
        {
            ages[i] += one_day;
        }
    }
}

Duration: 0.028020s

Slower than Numba without parallelization, but faster than numpy

C with OpenMP

#include <omp.h>

const float one_day = 1.0f/365.0f;
void update_ages_vanilla(unsigned long int start_idx, unsigned long int stop_idx, float *ages) {
    #pragma omp parallel for simd
    for (size_t i = start_idx; i <= stop_idx; i++) {
        if( ages[i] >= 0 )
        {
            ages[i] += one_day;
        }
    }
}

Duration: 0.00268s

15% faster than numba.

With -march=native and -ffast-math... Duration: 0.000648s!

SIMD (avx2)

// avxv2
void update_ages(unsigned long int start_idx, unsigned long int stop_idx, float *ages) {
    #pragma omp parallel for
    for (size_t i = start_idx; i <= stop_idx; i += 8) {
        // Load age values into SIMD registers
        __m256 ages_vec = _mm256_loadu_ps(&ages[i]);

        // Create a vector with the one_day value replicated 8 times
        __m256 one_day_vec = _mm256_set1_ps(one_day);

        // Mask for elements greater than or equal to zero
        __m256 mask = _mm256_cmp_ps(ages_vec, _mm256_setzero_ps(), _CMP_GE_OQ);

        // Increment age values by one_day for elements greater than or equal to zero
        ages_vec = _mm256_blendv_ps(ages_vec, _mm256_add_ps(ages_vec, one_day_vec), mask);

        // Store the result back to memory
        _mm256_storeu_ps(&ages[i], ages_vec);
    }
}

Duration: 0:00:00.000543s 5-7x faster than Numba

avx512
Coming soon

Duration: 0.000346s

Other Machines

k8s 16 core VM naive = 15.740068 numpy = 0.090909 numba = 0.003329 c = 0.001735 simd (512) = 0.000998

Codespaces (16-core)

naive = 17.416576 numpy = 0.078130 numba = 0.004040 c = 0.001324 simd = 0.000883

SIMD (sse & avx2 & avx512 with run-time instruction set detection)

https://gist.github.com/clorton/16d8ca411c83058b31c6275a4e54f352

Some Conclusions

numpy looks very fast compared to naive for-loops. But not much else.
numba is typically 20x faster than numpy on 16 cores.
numpy presumably uses simd according to documentation but is not multi-core.
naive, numba and c all involve some variation of "universally understandable" for loops, for what that's worth.
numba should probably be preferred over numpy in all cases.
compiled c (with omp) is often between 2x and 5x faster than numba and should be preferred if someone has a C compiler available.
compiled c (with omp) performance is majorly impacted by gcc flags including -ffast-math. We should check if the the corners that are cut by fast-math are acceptable in terms of accuracy.
avx2 intrinsics seem to be about 25% faster than compiled c with omp and the best gcc flags.
avx512 intrinsics seem to be about 50% faster than compiled c with omp and the best gcc flags.
We are yet to see the simd extensions to the OpenMP API provide any benefit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly