Skip to content
This repository has been archived by the owner on Feb 2, 2023. It is now read-only.

OCaml Multicore Runtime

KC Sivaramakrishnan edited this page Jul 17, 2016 · 1 revision

This page gives an outline of the design of the Multicore OCaml runtime. I suspect many people have strong opinions about this sort of thing; feel free to direct them our way.

Most of this document is about garbage collection, which accounts for almost all of the work needed to build a multicore OCaml. GC in OCaml is interesting because of pervasive immutability: many objects are immutable, allowing us to play various GC games like copying objects without caring which copy future accesses touch. On the other hand, OCaml programs are expected to sustain a truly obscene rate of allocation for immutable objects.

Operations on immutable objects are very fast in OCaml: allocation is by bumping a pointer, initialising writes (the only ones) are done with no barriers, and reads require no barriers. Most of the following design is based around keeping these operations as fast as they are at the moment, with some compromises for mutable objects.

Xavier Leroy and Damian Doligez wrote a paper about efficient multicore GC for OCaml [1] (well, for Concurrent Caml Light), based around having many thread-private heaps and a single shared heap, and maintaining the invariant that there are no pointers from the shared to the private heaps.

This means that whenever an attempt is made to write a pointer to a private object into the shared heap, the private object and all objects reachable from it are promoted to the shared heap en masse. The problem with this is that this eagerly promotes many objects that were never really shared: just because an object is pointed to by a shared object does not mean another thread is actually going to attempt to access it.

We'd like to go for something similar but lazier along the lines of the multicore Haskell work [2], where objects are promoted to the shared heap whenever another thread actually tries to access them. This means our sharing operation is slower, since it requires synchronisation of two different threads, but we need to do a lot less of it. It also avoids some pathological cases since storing a large temporary datastructure in a shared reference won't generally require it to be copied in its entirety.

[1]: Damien Doligez and Xavier Leroy. "A concurrent, generational garbage collector for a multithreaded implementation of ML." POPL 1993.

[2]: Simon Marlow and Simon Peyton Jones. "Multicore garbage collection with local heaps." ACM SIGPLAN Notices. Vol. 46. No. 11. ACM, 2011

Threads

The first problem is getting multiple threads of OCaml to play nicely in the same memory space. The main issue here is that the runtime is full of global mutable state. Some of it is actual runtime-wide state (e.g. the location of the end of the minor heap), and some of it is gratuitous (e.g. the regex engine stores state in globals for called functions to read, instead of passing it as in arguments).

There's a project ongoing at OCamlPro to get rid of global state by packing it all into a context structure that gets passed around. This is quite an invasive patch (essentially adding a new argument to nearly every function, and changing every access to global state), but allows you to do things like having two independent OCamls in the same process (a caravan of camls?).

I don't particularly care about having two independent OCamls in one process, so I've gone for a simpler approach: marking global state as thread-local. This is a vastly smaller patch, and lets multiple threads use OCaml independently.

This is implemented. It amounts to some grunt work annotating variables, and some scripts to verify that the .text section of the runtime's object files is empty (so we can be certain we've elimintated all global state). There are a couple of patches from this work to eliminate some of the more gratuitous uses of state which might be worth sending upstream indepenently of any other multicore work.

Message-passing

The next part is a means for the threads to interrupt each other, to synchronise GC operations. Each thread has a multiple-writer, single-reader message queue, which it frequently polls for new messages. The polling mechanism re-uses the existing mechanism to check for unix signals, and is implemented for the bytecode interpreter but not yet for native code.

In native code, I intend to have threads sending messages reset the minor heap pointer so that the receiving thread is interrupted at next allocation (again, this is the same mechanism currently used for unix signals).

It's important that a thread respond to a message in a bounded period of time, and that messages are only processed at GC safepoints. This might require adding new safepoints during function calls or loops, although OCaml programs generally allocate so frequently that we might get away without this.

During long-running C calls, a thread will not be able to handle messages. In this case, we will notify the other threads of this by releasing a relatively heavyweight lock, which shouldn't add much to the runtime of a slow call.

Shared and private heaps

The shared heap is a large memory area available to all threads, with a non-moving concurrent GC. Immutable ones will be allocated in the private heaps (see below) and promoted to the shared heap if they survive long enough or are accessed by multiple threads. For mutable objects, there are a couple of options, discussed in a section below.

The GC will be a mostly-concurrent mark-and-sweep. Each thread maintains its own grey stack and periodically does some marking work. Marking is by changing the GC bits in the object header, which is done without any synchronisation between threads. This means there is a race condition whereby multiple threads will attempt to mark the same object, but since marking is idempotent this won't cause any problems.

When a thread finishes marking, it will trigger a stop-the-world phase by interrupting every other thread via their message queues. During the stop-the-world phase, each thread will scan their roots and mark any unmarked objects they find. Generally, this does not involve much marking: the stop-the-world phase exists so that all threads can verify that marking has completed successfully, so should be short.

After the stop-the-world phase completes, any unmarked objects are garbage and available to be swept.

This part of the memory system is fairly pedestrian and in no way novel. I quite like VCGC [3] as a design for the mark-and-sweep section, since it avoids having explicit phase transitions between "marking" and "sweeping" which are a traditional source of bugs, and I intend to use an allocator design based on size-segmented thread-local pages along the lines of Streamflow [4], since this seems to have very good multicore behaviour and decent fragmentation performance.

Initially, it won't support compaction. Stop-the-world compaction can be implemented without too much pain later on, but I consider incremental compaction to be outside the scope of this project (and besides, the OCaml community seems to be OK without it at the moment).

Each thread also has a private heap, which is used to allocate immutable objects, and possibly mutable objects depending on how we choose to handle those (see below). Private heaps can be collected without synchronisation, and hopefully without modifying the existing minor collection code (much).

One part of minor collection is updating intergenerational pointers (that is, pointers which used to point to private objects which have been moved). In the current single-threaded GC this can happen at any point and be correct, but in a multhreaded GC we need to ensure this happens after all of the objects are promoted and initialised, to prevent any other threads seeing a half-formed shared object.

Objects that survive long enough are be promoted to the shared heap. Thus, dead cycles of shared and non-shared objects will be collected eventually, when they become cycles of purely shared objects.

[3]: Lorenz Huelsbergen and Phil Winterbottom. "Very concurrent mark-&-sweep garbage collection without fine-grain synchronization." ISMM 1998.

[4]: Scott Schneider, Christos D. Antonopoulos, and Dimitrios S. Nikolopoulos. "Scalable, locality-conscious multithreaded memory allocation." ISMM 2006.

Reading and writing

The following invariants are maintained:

  • No thread ever follows a pointer to another thread's private heap

  • Immutable fields of shared objects point to shared objects

So, the private heaps may point to themselves and to the shared heap, the values on a thread's stack may point to that thread's private heap and to the shared heap, and shared heap pointers may point anywhere.

The second invariant is easily maintained, since objects are promoted all at once: an immutable object only points to older objects, and an immutable object becomes shared only after all of objects older than it have become shared.

This means that reading from and matching on immutable objects will require no extra overhead in order to maintain the first invariant, since an immutable object that one thread can access can never point to another thread's private heap.

Mutable fields are more complex, and require both a write barrier and a read barrier. The write barrier (caml_modify) is just a standard write barrier maintaining the tricolor invariant for the shared GC, and maintaining a per-thread remembered set of pointers from the shared to the private heap.

The read barrier examines the value being read, and decides whether it can be used directly as follows:

  • If it is an integer, it may be used directly

  • If it is a pointer to the shared heap, it may be used directly

  • Otherwise, it is a pointer to a private heap:

    • If this thread's, it may be used directly

    • Otherwise, it must be requested from the thread owning the private heap

This seems quite complex, but has a surprisingly efficient implementation (described at the end of this document).

The last case, of a pointer to another thread's private heap, is the only one that requires any processing. In this case, which we refer to as a "read fault", a message is sent to the the other thread requesting the object. Upon receipt of this message, the other thread copies the requested object to the shared heap, including its transitive closure, and replies with the address of the object in the shared heap. The requesting thread blocks until this reply arrives.

This is expensive, but should be rare. It happens only when multiple threads are contending for the same mutable reference, which involves expensive inter-core cache line transfers in any case.

Code written in a message-passing style may hit this case quite frequently if immutable, newly-created messages are being passed to code running on a different thread. However, this can be solved at user-level in the implementation of the message-passing library: if messages are being passed to a queue which has recently been accessed by a different thread, they can be pre-emptively promoted to the shared heap so the fault is avoided.

Read faults

When a thread tries to access an object in the private heap of another thread, a read fault occurs. In this case, the thread performing the access sends a message to the thread that owns the private heap in question, asking it to perform the read instead.

Upon receipt of such a request, a thread performs the read, copies the resulting object to the shared heap, and replies with the new shared copy.

There is a race condition where a field can be modified to point to another thread's private heap while the read fault message is in flight. This means that the thread servicing the fault will itself fault, causing another message to be sent. This actually causes no problems: the faulting is guaranteed to terminate, since the race condition requires that another thread overwrite the field during the fault processing. In the worst case, every thread on the system gets dragged into servicing the fault, and there are no threads left to overwrite the field and cause the race.

Mutable objects

We have a number of options for dealing with mutable objects. The differences between the options are slight changes to allocation behaviour and barrier code, so it's not difficult to implement several of these options and compare them for performance.

The issue is that read faults cause copies of private objects to be made in the shared heap. For objects with only immutable fields, this is semantically fine, and programs can use the copies interchangeably. For mutable fields, however, it is important that only the shared copy is used, otherwise programs might see stale copies of already-modified objects. We can deal with this in several ways:

  • Allocate all mutable objects directly in the shared heap, so this problem doesn't arise.

  • Have an extra check in the read barrier for mutable fields to see if there's a more recent shared copy of this object.

  • Scan the private heap after every read fault to update pointers to mutable objects.

The first option would mean losing generational collection for mutable objects, although a non-moving generational scheme [5] for the shared heap would fit reasonably well. The third option may turn out to be the most efficient, if the fault rate is low enough.

[5]: Katsuhiro Ueno, Atsushi Ohori, and Toshiaki Otomo. "An efficient non-moving garbage collector for functional languages." ICFP 2011.

Read barrier implementation

To make code that heavily uses mutable objects run acceptably fast, it is important that the read barrier have low overhead in the uncontended (common) case. The condition we are checking for is quite complex, but happily can be implemented with a single branch, no memory references, and some bit-twiddling.

We arrange the memory space of the OCaml process so that all of the minor heaps are arranged within a single, large, power-of-two aligned memory region, and each individual heap is a power-of-two aligned region within that. This can be done quite portably on Posix platforms using mmap(), or on Windows using VirtualAlloc().

Using hypothetical 16-bit addresses, we assume for the sake of an example that the private heaps are all allocated within 0x4200 - 0x42ff, with the individual thread's private heaps at:

  • thread 1: 0x4220 - 0x422f
  • thread 2: 0x4250 - 0x425f
  • thread 3: 0x42a0 - 0x42af
  • thread 4: 0x42b0 - 0x42bf

It's not necessary that all of the memory in the region 0x4200 - 0x42ff be allocated or even mapped, as long as no part of the shared heap lies within that region. Again, this can be done with some mildly fiddly but not especially difficult mmap() trickery.

So, if we have a pointer p, then it points to a private heap if and only if its high bits are 0x42, and the bits in the middle indicate which private heap. So, we can check whether it lies in the current thread's private heap by comparing its bits with a pointer to somewhere in the current thread's private heap.

We have such a pointer available: the current minor heap pointer is stored in a register in native code. We XOR this value with the pointer p, resulting a value whose bits we write as 0xPQRS. Our various conditions for the read barrier can be read off from these bits: if the low bit of S is 1, then it's an int. Otherwise, if PQ is nonzero, then it's a pointer to the shared heap. Finally, if PQ is zero and R is zero, then it's a pointer to the current thread's private heap, while if PQ is zero and R is nonzero it's a pointer to another thread's private heap.

Checking whether PQ or the low bit of S are nonzero gets us the first two conditions. Otherwise, if PQ is zero we need to check whether R is zero. Happily, we can combine these conditions with more bitgrinding. We subtract 0x10 from 0xPQRS. If R is zero, this carries upwards, and causes PQ to become nonzero (it will never cause PQ to become zero, since we ensure that the first page of memory is not used for the shared heap). So, all of our conditions can be checked simultaneously by checking whether (0xPQRS - 0x10) has any of its high bits or its lowest bit set. There's an instruction to test bitmasks on most architectures (x86 calls this "test", PowerPC calls it "and."), bringing us to a total of three instructions: xor, sub and test.

In pseudo-amd64 code, assuming the value to be tested is in %rax and the minor heap pointer is in %r15, we get:

xor   %r15,    %rax
sub   $0x10,   %rax
test  $0xff01, %rax