EXPERIMENTAL: Implement a GC #419

jonathanvdc · 2019-06-11T16:12:49Z

Hi! Here's a PR that implements a garbage collector for CUDAnative. Major additions include:

The garbage collector itself (gc.jl). The GC is a non-moving, semi-conservative, stop-the-world GC that uses a free list for memory allocations.
Numerous changes to the compiler itself to support the GC.
An implementation of the important bits of the low-level Julia array API. This allows us to use regular arrays from Julia kernels.
A flexible interrupt mechanism that implements GPU-to-CPU callbacks. The GC uses it to trigger collections, which happen on the CPU rather than the GPU.
Threading primitives. The GC uses them to ensure mutual exclusion for critical sections.
A bump allocator that can be used as a fast alternative to the GC. Unlike the GC, the bump allocator can't expand its heap when necessary. However, it is very fast and its heap is easy to dispose after a kernel completes. This latter point is a big win over the old memory allocator based on CUDA malloc, which will leak memory perpetually, even across kernel invocations.
A set of GC benchmarks for evaluating the GC's performance.

Note: these changes depend on the 'configurable-lowering-2' branch of my fork of the julia repo (jonathanvdc/julia). The lowering scheme won't work unless that version of Julia is used.

@cuda

The 'init' kwarg to '@cuda' allows users to define custom kernel initialization logic, which is run just prior to the kernel. The main use case for this kwarg right now is setting up globals.

I built these examples mostly as experiments. Their core logic ended up in 'interrupts.jl', which is cleverly designed to expose a high-level interface. The examples deleted by this commit are not: they're low-level and kind of hacky.

maleadt · 2019-06-11T16:40:19Z

Great! Good to have this here as a PR.

Would you mind factoring out the bump allocator? I'd prefer to merge something less complex first. It's also a strict improvement, whereas the complex GC might degrade performance.

jonathanvdc · 2019-06-11T17:25:36Z

Wow, that was fast!

The bump allocator's implementation is a bit intertwined with the GC's, but I'll see what I can do.

With regard to performance, both the bump allocator and the GC are opt-in. So neither will affect the performance of existing CUDAnative kernels.

maleadt · 2019-06-11T17:57:20Z

With regard to performance, both the bump allocator and the GC are opt-in. So neither will affect the performance of existing CUDAnative kernels.

Yeah, but I think we could reasonably try and default to the bump allocator in the short term, since the current allocator is so bad (both in terms of performance and usability). That would be easier if it doesn't depend on the rest of the functionality in this PR.

jonathanvdc added 30 commits April 4, 2019 14:25

Implement a lowering for the intrinsics generated by 'LateLowerGCFrame'

5739881

Also lower 'julia.queue_gc_root'

61b9f94

Fix correctness bugs in the new GC lowering pass

a921f3a

Use the new GC intrinsic lowering

80af54b

Note: these changes depend on the 'configurable-lowering-2' branch of my fork of the julia repo (jonathanvdc/julia). The lowering scheme won't work unless that version of Julia is used.

Add a simple unified memory example

f177a27

Add a host-to-device communication example

5fd8a0a

Fix an outdated comment

1c250c7

Add a kwarg to '@cuda' that serves as a hook for kernel setup

f8e6c4b

The 'init' kwarg to '@cuda' allows users to define custom kernel initialization logic, which is run just prior to the kernel. The main use case for this kwarg right now is setting up globals.

Add an example that initializes a kernel global

5426bec

Include an atomic cmpxchg example

537bfca

Create a fully-featured interrupt example

614d04b

Update interrupt example to include memory transfer during interrupts

6ed1acf

Define a high-level interrupt interface

d960a91

Refactor interrupt examples

7c627c0

Document interrupt API

c45f33d

Define interrupt tests

7c6906b

Add another interrupt test

9c73a28

Remove experimental examples

47439eb

I built these examples mostly as experiments. Their core logic ended up in 'interrupts.jl', which is cleverly designed to expose a high-level interface. The examples deleted by this commit are not: they're low-level and kind of hacky.

Implement a reader-writer lock

297bedc

Create an allocator prototype for the GC

cfb6dd8

Rename 'GCFreeListEntry' to 'GCAllocationRecord'

279f6ff

Avoid partially overwriting allocation records

c0c06e2

Refactor GC collection triggering logic

79dc0d4

Have the GC maintain a list of allocated blocks

563c3c0

Introduce the notion of a GC master record

4ffc62f

Reserve GC memory for GC frames

454a6ef

Have the GC allocate memory for root buffer sizes

24c184f

Use 32-bit integers to describe GC root buffer sizes

33e54b7

Define GC frame management functions

71aa78f

Make globals created by 'get_global_pointer' 'linkonce_odr'

da046af

jonathanvdc added 11 commits June 7, 2019 18:56

Better document array functions

3382772

Implement jl_new_array

1942659

Implement jl_ptr_to_array{,_1d}

8612466

Compare GC strategies when running benchmarks

952a645

Tweak array-features benchmark

be276cf

Update optim.jl to use stock Julia

85766d5

Fix misnomer in utils.jl

2e640f5

Include mean in gc-heap-sizes.csv

bb7b440

Remove experimental allocator implementations

be1692c

Remove binary tree example

2c058c7

Merge remote-tracking branch 'upstream/master' into gc-pr

c801922

maleadt added the Julia support label Jun 11, 2019

jonathanvdc added 5 commits June 12, 2019 13:45

Update GC benchmark runner

2f4f773

Tweak benchmarks

350f0ed

Add a mean to 'strategies.csv' too

93a2f57

Remove strategies.csv from root dir

2782244

Include array reduction benchmark in GC benchmark suite

7380683

jonathanvdc mentioned this pull request Jun 17, 2019

WIP: Support custom allocators, implement a bump allocator #422

Open

jonathanvdc added 3 commits June 22, 2019 16:15

Insert a root buffer overflow check

c6390ed

Update benchmarks with pinned memory bump allocator

a91baef

Write breakdown-computing code

4b76aec

maleadt changed the title ~~Implement a GC~~ EXPERIMENTAL: Implement a GC Jul 12, 2019

maleadt force-pushed the master branch from 951a1d1 to 09dcb67 Compare October 3, 2019 14:26

maleadt force-pushed the master branch 4 times, most recently from 17dfd92 to 3c9b279 Compare January 22, 2020 15:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EXPERIMENTAL: Implement a GC #419

EXPERIMENTAL: Implement a GC #419

jonathanvdc commented Jun 11, 2019

maleadt commented Jun 11, 2019

jonathanvdc commented Jun 11, 2019

maleadt commented Jun 11, 2019

EXPERIMENTAL: Implement a GC #419

Are you sure you want to change the base?

EXPERIMENTAL: Implement a GC #419

Conversation

jonathanvdc commented Jun 11, 2019

maleadt commented Jun 11, 2019

jonathanvdc commented Jun 11, 2019

maleadt commented Jun 11, 2019