Skip to content

Python Accelerations

Jonathan Bloedow edited this page May 24, 2024 · 21 revisions

Here's a whiteboard diagram we came up with which seemed to communicate our current thinking on how we are proposing doing Python performance acceleration options on large vector operations.

image

It's kind of "choose your own adventure". For any given vector computation, one can go as far to the right as one is capable. One can write a numpy implementation and call it good for now. One can go further and do a numba implementation. Or go further and do a ctypes compiled c extension. Or go a bit further and make this faster with OpenMP. Or go even further and make it faster with a SIMD implementation.

Some notes:

  • At the end of the day, from numpy to simd, these are all just ways of operating on contiguous arrays of numbers. All these solutions operate on the same datatypes really.
  • numba works on for loops, not numpy operations.
  • In theory, if you have GPU hardware, you can go from numba to numba+cuda with just the cuda extension to numba, a decorator and "a little setup code". We have yet to truly demonstrate this for ourselves though.
  • numba doesn't really start to show its worth unless you use prange.
  • Going from numpy to numba or ctypes requires starting to thinking about datatypes.
  • Developing a ctypes extension requires a compiler. (Numba has a built in compiler.)
  • Speeding up a C implementation with OpenMP is truly simple.
  • OpenMP naturally uses all the cores it finds, without recompiling, which is very useful for making sure one is leveraging all available hardware, but also means that performance gains tested & recorded will depend on what machine it is being run on.
  • SIMD acceleration requires a certain low-level programming comfort. Working code can all be provided by GPT/Copilot, but it can take some trial and error.
  • We expect to provide SIMD implementations for AVX2, AVX512 and SSE. One can compile all of these together but one can only test on the hardware one has. Testing all 3 implementations requires getting a bit fancy, though presumably GHA can support this.
  • We are actively working on a reference, stress test which demonstrates performance numbers on each of these implementations.
Clone this wiki locally