Skip to content
This repository has been archived by the owner on May 27, 2021. It is now read-only.

EXPERIMENTAL: Implement a GC #419

Open
wants to merge 150 commits into
base: master
Choose a base branch
from
Open

Conversation

jonathanvdc
Copy link

Hi! Here's a PR that implements a garbage collector for CUDAnative. Major additions include:

  • The garbage collector itself (gc.jl). The GC is a non-moving, semi-conservative, stop-the-world GC that uses a free list for memory allocations.
  • Numerous changes to the compiler itself to support the GC.
  • An implementation of the important bits of the low-level Julia array API. This allows us to use regular arrays from Julia kernels.
  • A flexible interrupt mechanism that implements GPU-to-CPU callbacks. The GC uses it to trigger collections, which happen on the CPU rather than the GPU.
  • Threading primitives. The GC uses them to ensure mutual exclusion for critical sections.
  • A bump allocator that can be used as a fast alternative to the GC. Unlike the GC, the bump allocator can't expand its heap when necessary. However, it is very fast and its heap is easy to dispose after a kernel completes. This latter point is a big win over the old memory allocator based on CUDA malloc, which will leak memory perpetually, even across kernel invocations.
  • A set of GC benchmarks for evaluating the GC's performance.

Note: these changes depend on the 'configurable-lowering-2' branch of my
fork of the julia repo (jonathanvdc/julia). The lowering scheme won't work
unless that version of Julia is used.
The 'init' kwarg to '@cuda' allows users to define custom kernel
initialization logic, which is run just prior to the kernel.
The main use case for this kwarg right now is setting up globals.
I built these examples mostly as experiments. Their core logic ended up
in 'interrupts.jl', which is cleverly designed to expose a high-level interface.
The examples deleted by this commit are not: they're low-level and kind of
hacky.
@maleadt
Copy link
Member

maleadt commented Jun 11, 2019

Great! Good to have this here as a PR.

Would you mind factoring out the bump allocator? I'd prefer to merge something less complex first. It's also a strict improvement, whereas the complex GC might degrade performance.

@jonathanvdc
Copy link
Author

Wow, that was fast!

The bump allocator's implementation is a bit intertwined with the GC's, but I'll see what I can do.

With regard to performance, both the bump allocator and the GC are opt-in. So neither will affect the performance of existing CUDAnative kernels.

@maleadt
Copy link
Member

maleadt commented Jun 11, 2019

With regard to performance, both the bump allocator and the GC are opt-in. So neither will affect the performance of existing CUDAnative kernels.

Yeah, but I think we could reasonably try and default to the bump allocator in the short term, since the current allocator is so bad (both in terms of performance and usability). That would be easier if it doesn't depend on the rest of the functionality in this PR.

@maleadt maleadt changed the title Implement a GC EXPERIMENTAL: Implement a GC Jul 12, 2019
@maleadt maleadt force-pushed the master branch 4 times, most recently from 17dfd92 to 3c9b279 Compare January 22, 2020 15:17
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants