-
Notifications
You must be signed in to change notification settings - Fork 5
Python Accelerations
Here's a whiteboard diagram we came up with which seemed to communicate our current thinking on how we are proposing doing Python performance acceleration options on large vector operations.
It's kind of "choose your own adventure". For any given vector computation, one can go as far to the right as one is capable. One can write a NumPy implementation and call it good for now. One can go further and do a Numba implementation. Or go further and do a ctypes compiled C extension. Or go a bit further and make this faster with OpenMP. Or go even further and make it faster with a SIMD implementation.
Some notes:
- At the end of the day, from NumPy to SIMD, these are all just ways of operating on contiguous arrays of numbers. All these solutions operate on the same datatypes really.
- numba works on for loops, not numpy operations.
- In theory, if you have GPU hardware, you can go from Numba to Numba+Cuda with just the cuda extension to Numba, a decorator and "a little setup code". We have yet to truly demonstrate this for ourselves though.
- Numba doesn't really start to show its worth unless you use
prange()
. - Going from NumPy to Numba or ctypes requires starting to thinking about datatypes.
- Developing a ctypes extension requires a compiler. (Numba has a built in compiler.)
- Speeding up a C implementation with OpenMP is truly simple.
- OpenMP naturally uses all the cores it finds, without recompiling, which is very useful for making sure one is leveraging all available hardware, but also means that performance gains tested & recorded will depend on what machine it is being run on.
- Note that the benefits of parallelization can be mostly eliminated if there is a critical section (e.g., updating a common counter) somewhere in the for loop being parallelized. Though this can usually be fixed by using thread-local storage and then adding an accumulator section at the end. GPT can help with this.
- SIMD acceleration requires a certain low-level programming comfort. Working code can all be provided by GPT/Copilot, but it can take some trial and error.
- We expect to provide SIMD implementations for AVX2, AVX512 and SSE. One can compile all of these together but one can only test on the hardware one has. Testing all 3 implementations requires getting a bit fancy, though presumably GHA can support this.
- We are actively working on a reference, stress test which demonstrates performance numbers on each of these implementations.
Let's look at some sample code for a function that ages everyone in our model. We have some expansion slots for the "not yet born" with negative age so we co a condition check in addition to an addition, which makes our code at least not totally trivial. We'll also show some sample wallclock times for each one, though a lot can vary depending on what hardware one runs on. For now we're use a 16-core DeepNote VM with 128GB of RAM and run with an array of 1e7 integers uniformly distributed between -8365 and +24365.
def naive_one(data: np.ndarray) -> None:
one_day = np.float32(1.0 / 365)
for i in range(data.shape[0]):
if data[i] >= 0:
data[i] += one_day
return
Duration: 15.91715s (15,917ms)
def vectorized(data: np.ndarray) -> None:
one_day = np.float32(1.0 / 365)
data[data >= 0] += one_day
return
Duration: 0:00:00.09175s (92ms)
@nb.njit((nb.float32[:],), parallel=True, nogil=True)
def numbaized(data: np.ndarray) -> None:
one_day = np.float32(1.0 / 365)
for i in nb.****prange****(data.shape[0]):
if data[i] >= 0:
data[i] += one_day
return
Duration: 0.00314s (3ms) 30x faster than NumPy!
Python part:
# 'import' the module
update_ages_lib = ctypes.CDLL('./update_ages.so')
# define the interface to the function (datatypes)
update_ages_lib.update_ages_vanilla.argtypes = [
ctypes.c_size_t, # start_idx
ctypes.c_size_t, # stop_idx
np.ctypeslib.ndpointer(dtype=np.float32, flags='C_CONTIGUOUS')
]
# Invoke
def compiled_c(data: np.ndarray) -> None:
update_ages_lib.update_ages_vanilla(
0,
len( data ),
data
)
C Part:
const float one_day = 1.0f/365.0f;
void update_ages_vanilla(unsigned long int start_idx, unsigned long int stop_idx, float *ages) {
for (size_t i = start_idx; i <= stop_idx; i++) {
if( ages[i] >= 0 )
{
ages[i] += one_day;
}
}
}
Duration: 0.028020s (28ms)
Slower than Numba without parallelization, but faster than NumPy
#include <omp.h>
const float one_day = 1.0f/365.0f;
void update_ages_vanilla(unsigned long int start_idx, unsigned long int stop_idx, float *ages) {
#pragma omp parallel for simd
for (size_t i = start_idx; i <= stop_idx; i++) {
if( ages[i] >= 0 )
{
ages[i] += one_day;
}
}
}
Duration: 0.00268s (2.7ms)
15% faster than Numba.
With -march=native
and -ffast-math
...
Duration: 0.000648s! (0.65ms)
// avxv2
void update_ages(unsigned long int start_idx, unsigned long int stop_idx, float *ages) {
#pragma omp parallel for
for (size_t i = start_idx; i <= stop_idx; i += 8) {
// Load age values into SIMD registers
__m256 ages_vec = _mm256_loadu_ps(&ages[i]);
// Create a vector with the one_day value replicated 8 times
__m256 one_day_vec = _mm256_set1_ps(one_day);
// Mask for elements greater than or equal to zero
__m256 mask = _mm256_cmp_ps(ages_vec, _mm256_setzero_ps(), _CMP_GE_OQ);
// Increment age values by one_day for elements greater than or equal to zero
ages_vec = _mm256_blendv_ps(ages_vec, _mm256_add_ps(ages_vec, one_day_vec), mask);
// Store the result back to memory
_mm256_storeu_ps(&ages[i], ages_vec);
}
}
Duration: 0:00:00.000543s (0.54ms) 5-7x faster than Numba
AVX512
Coming soon
Duration: 0.000346s (0.35ms)
k8s 16 core VM
naive = 15.740068
numpy = 0.090909
numba = 0.003329
c = 0.001735
simd (512) = 0.000998
Codespaces (16-core)
naive = 17.416576
numpy = 0.078130
numba = 0.004040
c = 0.001324
simd (512) = 0.000883
12-core Ubuntu laptop
naive = 13.695447
numpy = 0.072277
numba = 0.003344
c = 0.002402
avx2 = 0.002355
https://gist.github.com/clorton/16d8ca411c83058b31c6275a4e54f352
I started a comparison of naive/numpy/numba/c on a new function, a little more interesting than the "conditionally add float to floats". This function is from "progress_immunities" and implements:
for each person in the population:
if person is immune and their immunity timer is greater than 0:
decrement their immunity timer by 1
if their immunity timer has reached 0:
set their immunity status to False
On 12-core laptop:
naive = 0:00:28.693374
numpy = 0:00:00.057744
numba = 0:00:00.428175
c = 0:00:00.003224
The 10x outperformance of numba by numpy is a reversal of what we were seeing before. Note that this function operates on 2 vectors, not just one, and does 1 conditional math operation and 1 conditional boolean flip. So there's a little more going on.
- NumPy looks very fast compared to naive for-loops. But not much else.
- Numba is typically 20x faster than NumPy on 16 cores.
- NumPy presumably uses SIMD according to documentation but is not multi-core.
- Naive, Numba and C all involve some variation of "universally understandable" for loops, for what that's worth.
- Numba should probably be preferred over NumPy in all cases.
- Compiled C (with OpenMP) is often between 2x and 5x faster than Numba and should be preferred if someone has a C compiler available.
- Compiled C (with OpenMP) performance can be majorly impacted by gcc flags including -ffast-math. On some platforms 5x, but on others just 15%. But there are some serious concerns that the corner-cutting in fast-math are unacceptable in terms of accuracy.
- AVX2 intrinsics seem to be about 25% faster than compiled C with OpenMP and the best gcc flags.
- AVX512 intrinsics seem to be about 50% faster than compiled C with OpenMP and the best gcc flags.
- We are yet to see the SIMD extensions to the OpenMP API provide any benefit.
- All of the data so far is for just one function which is simple though not entirely trivial. We should really collect these metrics for about 4 different types of vector math functions.
- Do perf comparison on more "types" of functions, e.g., reductions, critical section updates, etc.
- Understand more about -ffast-math and if there are sub-flags in there that are responsible for the gain without unacceptable inaccuracies.
- See if there's some way we can get simd code on the cheap.