Skip to content
This repository has been archived by the owner on Nov 27, 2024. It is now read-only.

WIP: Add a non-forking compiler option #614

Draft
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

connorjward
Copy link
Collaborator

This PR:

  • Refactors compilation.py
  • Adds a new non-forking compiler class so we can compile on systems where the MPI does not allow forking a subprocess

I've fiddled with the environment variables a bit too. For example it now uses PYOP2_CC to determine the compiler instead of CC and I've moved PYOP2_CFLAGS and PYOP2_LDFLAGS out of configuration.py because they made little sense there.

@wence-
Copy link
Member

wence- commented Apr 29, 2021

Cool. There are a bunch of other places where we (possibly without knowing) fork subprocesses. e.g the _version.py versioneer stuff. I wonder if we should expunge all of that too?

@connorjward
Copy link
Collaborator Author

connorjward commented Apr 29, 2021

I've done a lot of reading about different ways that we could compile the code without forking and I've found 3 approaches that might work:

  • cppyy is a C++ interpreter built on top of Cling which is well-supported project. I tried implementing this before but ran into the issue that C++ is less permissive about passing void pointers as arguments which is a problem because we use ctypes.c_voidp all over the place.

  • DragonFFI is a Clang-based JIT. I haven't figured out how to pass in all of the linker arguments yet but this is a really promising project. The main issue is that it is developed by a single person although he seems to be fairly active at maintaining it.

  • TinyCCompiler/TCC has a library libtcc that allows you to JIT code. The performance would likely not be great but it would still be better than nothing.

Before going any further I have two questions:

  1. Is it essential that we use the MPI wrapper for compilation? If so then I'll need to figure out a way to find the flags that the wrapper compiler adds.
  2. Is ABI compatibility a concern (i.e. if PETSc is installed with GCC then would a Clang JIT even work)? If so then DragonFFI is the only valid approach and we would have to enforce that the entire toolchain is compiled with Clang.

@connorjward connorjward marked this pull request as draft April 29, 2021 09:54
@wence-
Copy link
Member

wence- commented Apr 29, 2021

Is it essential that we use the MPI wrapper for compilation? If so then I'll need to figure out a way to find the flags that the wrapper compiler adds.

Wrapper code that calls PETSc needs to link against MPI (and find the MPI headers) so I think yes. I think firedrake can grab them out of the petscvariables configuration like it does for Eigen include paths (see firedrake/slate/slac/compiler.py)

Is ABI compatibility a concern (i.e. if PETSc is installed with GCC then would a Clang JIT even work)? If so then DragonFFI is the only valid approach and we would have to enforce that the entire toolchain is compiled with Clang.

I think that GCC and Clang have the same C ABI, but maybe not.

As to the C++ void strictness, I would have assumed (but maybe I am wrong) that with these approaches we no longer use the ctypes interface to call compiled code?

@wlav
Copy link

wlav commented Apr 29, 2021

Clang JIT and GCC are mostly compatible, assuming you hand them the same compiler flags (esp. the math options, so most definitely your point 1. is something that you will want to look into) and same standard header files. There are corner cases, e.g. thread local storage and typeinfo come to mind, which will fail or are major trouble when JIT-ed; and the two also have different default run-times for OpenMP for example.

OTOH, cppyy munches ctypes.c_voidp just fine. Try this:

import cppyy, ctypes
  
cppyy.cppdef("""\
   void f(void* p) { std::cerr << p << std::endl; }
""")

p = ctypes.c_voidp(0x1234)

cppyy.gbl.f(p)

Not b/c it's "permissive", but b/c ctypes.c_voidp is explicitly recognized internally.

The larger problem with it and MPI though, is that many "constants" in MPI are in fact preprocessor macro's and thus not available automatically through cppyy in Python-land (they're still fine to use in JIT-ed code, of course).

Is the use case for compiling programs locally when running under MPI public information? In a different context, we're advocating for more JIT-ing (and hence the need for better support) in HPC. It'd be useful for us to add another use case to the growing list.

@connorjward
Copy link
Collaborator Author

@wlav thank you for joining the discussion! We definitely appreciate your expertise.

Cool. There are a bunch of other places where we (possibly without knowing) fork subprocesses. e.g the _version.py versioneer stuff. I wonder if we should expunge all of that too?

That sounds sensible. What would the versioning stuff be replaced with?

Wrapper code that calls PETSc needs to link against MPI (and find the MPI headers) so I think yes. I think firedrake can grab them out of the petscvariables configuration like it does for Eigen include paths (see firedrake/slate/slac/compiler.py)

Yep this will work.

As to the C++ void strictness, I would have assumed (but maybe I am wrong) that with these approaches we no longer use the ctypes interface to call compiled code?

We definitely still use ctypes. The .so is loaded with ctypes.CDLL and we set the function argtypes to usually be ctypes.c_voidp (see here and here).

@wlav the issue is that we want to cast these void * to, say, double *. C is perfectly happy to do an implicit cast but C++ complains. An example of the sort of function we want to be calling is:

void wrap_expression_kernel(int32_t const start, int32_t const end, double *__restrict__ dat1, double const *__restrict__ dat0, int32_t const *__restrict__ map0)

I think getting this to work is just a case of being a bit less lazy about how we track the argument types.

Is the use case for compiling programs locally when running under MPI public information? In a different context, we're advocating for more JIT-ing (and hence the need for better support) in HPC. It'd be useful for us to add another use case to the growing list.

PyOP2 is used by Firedrake so I would say that yes the use case is public information 👍.

Yesterday I spent some time playing with DragonFFI and TinyCC and I've come to the conclusion that neither is really feature complete. I couldn't figure out how to pass pointers into a DragonFFI function and TinyCC doesn't support complex numbers. cppyy might well be the way to go.

@wlav
Copy link

wlav commented Apr 30, 2021

I have this feeling that it may help here to think of the bindings and the JIT-ing separately. Both to preserve the current ctypes code, but it also looks like all you really need is to pass a code string and some options to the Clang JIT, so your requirements on a binding to it are pretty trivial. Thus, if you keep the two concerns separate now, you can replace the current choice of JIT access later, allowing, for example, to roll your own to directly control the selection of optimization passes you want to use.

Below is an example of what I mean: it uses cppyy for JIT access (but obviously any of the options will do), but not for the bindings. Rather, just grab the function pointer, hand it to ctypes and then let ctypes think the argument types are all void*, allowing the implicit conversion behavior you want.

import cppyy
import cppyy.ll
import ctypes

cppyy.cppdef("""\
void wrap_expression_kernel(int32_t const start, int32_t const end, double *__restrict__ dat1, double const *__restrict__ dat0, int32_t const *__restrict__ map0) {
    std::cerr << start << " " << end << " " << dat1 << " " << dat0 << " " << map0 << std::endl;
}""")

ftype = ctypes.CFUNCTYPE(None, ctypes.c_int, ctypes.c_int, ctypes.c_voidp, ctypes.c_voidp, ctypes.c_voidp)
f = ftype(cppyy.ll.cast['intptr_t'](cppyy.gbl.wrap_expression_kernel))

p = ctypes.c_voidp(0x1234)
f(0, 32, p, p, p)

@wence-
Copy link
Member

wence- commented Apr 30, 2021

We're kind of willing to invest some effort to do the right thing. For example, I know that cffi has lower cross-calling overheads than ctypes (which in the limit case is not the biggest deal, but every little helps), what's the "right" way to call stuff via cpppy in that sense?

@wlav
Copy link

wlav commented May 2, 2021

The "right" way depends on use. Assuming for example that all types are known and correct, it's pretty straightforward:

import cppyy
import numpy as np

cppyy.cppdef("""\
void wrap_expression_kernel(int32_t const start, int32_t const end, double *__restrict__ dat1, double const *__restrict__ dat0, int32_t const *__restrict__ map0) {
    std::cerr << start << " " << end << " " << dat1 << " " << dat0 << " " << map0 << std::endl;
}""")

dat1 = np.array(range(32), dtype=np.float64)
dat0 = np.array(range(32), dtype=np.float64)
map0 = np.array(range(32), dtype=np.int32)
cppyy.gbl.wrap_expression_kernel(0, 32, dat1, dat0, map0)

If the buffers don't come from python, but from C++, cppyy will create LowLevelView objects of the right type. The best thing to do is to annotate the size (if not know already) and deal with ownership (e.g. by placing a reference on the array from the client code) immediately upon their entry into Python, unless of course these objects are never used other than for passing around the pointer. The LowLevelViews can be handed to numpy to create (zero-copy) views that act as fully functional arrays.

Run-time CPU overhead of cppyy is close to cffi (C++ has some complications, such as overloads, that one doesn't have to deal with in C), but memory overhead is higher b/c of the presence of Clang/LLVM for the JIT (b/c of C++ being a much larger language).

Maybe this notebook from Matti is of value: https://github.com/mattip/c_from_python/blob/master/c_from_python.ipynb (the actual presentation is on youtube).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants