Speed up wgpu passes/allocations #56

ArthurBrussee · 2024-08-09T15:42:56Z

This PR significantly reduces CPU overhead, and slightly optimizes GPU performance, of cube on wgpu. It got a bit bigger than intended, as there was a cascade of things to do so bear with me!

Outline

Wgpu 22.0.0 has a new method on ComputePass leak_lifetime which allows using the compute pass without a complicated lifetime. It's still safe, there's runtime checks as well, just means you have to be more careful. This allows cube CL to batch executions into one Compute pass, which is faster.
However, you can't use encoder.copy_buffer_to_buffer while a compute pass is still open. Since client.create() gets called before each execute, this prevents the compute pass optimization from really working.
The way client.create() works is sub-optimal anyway: queue.write_buffer_with is faster, and on most GPUs doesn't need a staging buffer & copy at all, saving CPU and GPU work. This should also be a nice saving when eg. uploading a batch of data for training.
Using queue.write_buffer_with is tricky however. The GPU memory gets written at the start of the command buffer submitted to the queue. This doesn't work with pooled memory though! Cube might want an order like kernel A -> write_buffer -> kernel B. Writing first means kernel A stamps over the memory.
So, we need a way to not use memory used earlier in the pass. I've considered some other options, like a separate memory pool for create() allocations, but ultimately have gone with the approach that also applies to this issue, where you can exclude a set of resources to use when allocating. For this case, we only need to not overlap memory regions not whole buffers, but that would get more complicated and isn't worth it probably.
An earlier attempt at the issue above was to exclude ChunkId's. This is ok but a bit strange. The requirement is really to not share buffers. In the implementation one chunk == one buffer so it could work, but that raises the question why there's ChunkIds in the first place. This seemed viable to simplify, so the API just maps StorageIds -> chunk metadata. This saves some overhead & lookups as well.
The current pool code was generic over the chunk and slice type, and mocked these for testing. That meant I couldn't get the storage ID however, and imho this abstraction & mocking isn't great. I've removed this and made the tests work with a "real" setup which exercises more parts of the code. This might catch some more bugs.
The sync() method in mm.alloc() and mm.reserve() made all of this tricker as well and with gnarlier semantics (how to deal with writes that happen soon as we sync?). This doesn't seem to be used atm and imho sounds like trouble in general! I'veremoved this for now, but lmk if I'm missing the crucial reason it should be there.
The ability of ComputeStorage to copy buffers is also a bit problematic, as it probably also needs to be ordered properly within the encoder. I've just left a note of this for now as it's not currently used, I'm not sure what the plans are for extending the max memory but would be great if it could be done without copying.

Benchmarks

Some benchmarks on my burn gaussian splatting code:

Inference CPU

Inference + GPU sync

Training step CPU

Possible future work

Merge all constants into a single info struct for wgsl. Allocating small 4 byte buffers for each float isn't very efficient.
Mark these constant buffers as uniform memory. This might require a separate pool of 'uniform buffers' after all, not sure.
More memory juggling to solve Deferred tensor allocations to avoid buffer sharing with conflicting visibility in WebGPU burn#1996
Remove the SmallMemoryPool - I suspect if all allocations for uniforms were batched it would be redundant.
Remove the simple memory allocator as the new dynamic one is probably better all around.

nathanielsimard

I only have two comments! Otherwise LGTM

nathanielsimard · 2024-08-09T18:57:49Z

crates/cubecl-wgpu/src/compute/storage.rs

+        // TODO: The semantics of this seems really strange, as we might be copying before other commands
+        // are flushed & done. Ideally, we should pass in the actual current encoder and queue commands there
+        // or remove the need to copy entirely. This is not used at the moment.


I think I would remove copy from the storage in this PR, since it's not safe to use without sync, which is removed in this PR.

Oh great! I've removed it and the code that was using it. Lmk if that's too aggressive, can also just replace wgpu copy specifically with a panic.

nathanielsimard · 2024-08-09T19:00:03Z

crates/cubecl-wgpu/src/compute/server.rs

+            _ => None,
+        };
+
+        let pass = self.start_compute_pass();


I think I would rename start_compute_pass and end_compute_pass to add_compute_pass and clear_compute_pass, otherwise it's strange that we start way more compute pass than we end.

Yeah good point. Actually have just inlined start_compute_pass now, it's probably not a good idea to just call that anywhere anyway. Renamed end_compute_pass to clear_compute_pass

It does prevent in O(n^2) in max_tasks but not worth it

nathanielsimard

LGTM

ArthurBrussee added 3 commits August 9, 2024 13:14

Speed up wgpu passes/allocations

90f7360

Fix buffer usage in read buffer

13b77d2

Format

5693148

nathanielsimard requested changes Aug 9, 2024

View reviewed changes

ArthurBrussee added 5 commits August 10, 2024 12:50

Remove ComputeStorage::copy and code using it

c0ae375

Rename end_compute_pass, inline start/get compute pass

4794a03

Factor out creating encoder

7709fa3

Remove todo, hashset was measurably slower.

608bbac

It does prevent in O(n^2) in max_tasks but not worth it

Remove unused function

7eaa7e9

nathanielsimard approved these changes Aug 12, 2024

View reviewed changes

nathanielsimard merged commit d94a07a into tracel-ai:main Aug 12, 2024
1 of 2 checks passed

ArthurBrussee deleted the wgpu-speed branch August 13, 2024 13:42

ArthurBrussee mentioned this pull request Oct 4, 2024

Fix wgpu memory corruption for CubeCount::Dynamic #156

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up wgpu passes/allocations #56

Speed up wgpu passes/allocations #56

ArthurBrussee commented Aug 9, 2024

nathanielsimard left a comment

nathanielsimard Aug 9, 2024

ArthurBrussee Aug 10, 2024

nathanielsimard Aug 9, 2024

ArthurBrussee Aug 10, 2024

nathanielsimard left a comment

Speed up wgpu passes/allocations #56

Speed up wgpu passes/allocations #56

Conversation

ArthurBrussee commented Aug 9, 2024

Outline

Benchmarks

Possible future work

nathanielsimard left a comment

Choose a reason for hiding this comment

nathanielsimard Aug 9, 2024

Choose a reason for hiding this comment

ArthurBrussee Aug 10, 2024

Choose a reason for hiding this comment

nathanielsimard Aug 9, 2024

Choose a reason for hiding this comment

ArthurBrussee Aug 10, 2024

Choose a reason for hiding this comment

nathanielsimard left a comment

Choose a reason for hiding this comment