Fix wgpu memory corruption for CubeCount::Dynamic #156

ArthurBrussee · 2024-10-04T11:59:30Z

A previous PR #56 has made it a bit trickier to deal with wgpu resources, and turns out there's another memory corruption possible now :) Resources used for CubeCount::Dynamic were not registered as "in use" which means it was possible for a copy operation & dispatch to be ordered incorrectly.

Another problem is that get_resource is public which means users might add operations using a buffer themselves. These should also be tracked! The fix is to always register a buffer as 'used' if it's used in get_resource.

This fix also seems to alleviate the need for the mysterious "double use for used copy buffers" that was previously needed, mystery solved!

#56 has introduced two memory corruptions now, so it's clearly a bit too flaky... some simpler strategy to enable fast wgpu copying might be needed but don't have much inspiration atm!

nathanielsimard · 2024-10-04T13:07:09Z

crates/cubecl-wgpu/src/compute/server.rs

+        // Keep track of any buffer that might be used in the wgpu queue, as we cannot copy into them
+        // after they have any outstanding compute work. Calling get_resource repeatedly
+        // will add duplicates to this, but that is ok.
+        let handle = self.memory_management.get(binding.memory.clone());
+        self.storage_in_flight.push(handle.id);


We could remove duplicates by using a set? If the storage_in_flight is cleared often it's probably not worth it.

I think we should actually probably create a data structure with methods. We use a slice in the memory management functions, but it could be a struct with the proper methods. Then we could change the data structure more easily (Vec under size of 10-20, a set if more than that).

Last time I checked this a set was slower, since this array isn't that big. But yeah probably worth just keeping the semantics simple here rather than optimizing. Added a "MemoryLock" type now. Also changed the function to take an Option<> as allocating an empty hashset for each call is a bit sad.

nathanielsimard · 2024-10-04T13:11:31Z

crates/cubecl-wgpu/src/compute/server.rs

-            x.1 += 1;
-            x.1 < 2
-        });
+        self.storage_in_flight.clear();


Why syncing means that we can remove the external readonly buffers from the list?

Right yes good point! The whole get_resource API is rather unsafe, as the "resource" is tied to the original allocation. As long as the allocation is upheld all the queue copy/write stuff here is fine as well, so it's the same issue as already exists - keeping a resource past the lifetime of the original allocation is unsafe.

I've implemented the solution we came up with in the Discord to keep the binding as part of the resource, or rather, wrapped stuff in a BindingResource so all servers can do this. Lmk how this approach looks!

ArthurBrussee added 4 commits October 4, 2024 12:41

Fix wgpu memory corruption for CubeCount::Dynamic

ffc26ec

Merge outstanding copy & compute buffer & rename it

ce41cae

Remove leftover duplicate registration

136bcd2

Leftover clone

80de85d

nathanielsimard reviewed Oct 4, 2024

View reviewed changes

ArthurBrussee added 4 commits October 4, 2024 16:10

Change exclude -> MemoryLock

629b90a

Add BindingResource

cac1876

Merge remote-tracking branch 'origin/main' into fix-dynamic-mem

9ecf2f6

Fix after merge (spir-v server)

8f8b369

nathanielsimard approved these changes Oct 7, 2024

View reviewed changes

nathanielsimard merged commit 774d823 into tracel-ai:main Oct 7, 2024
5 checks passed

ArthurBrussee deleted the fix-dynamic-mem branch October 29, 2024 19:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix wgpu memory corruption for CubeCount::Dynamic #156

Fix wgpu memory corruption for CubeCount::Dynamic #156

ArthurBrussee commented Oct 4, 2024

nathanielsimard Oct 4, 2024

nathanielsimard Oct 4, 2024

ArthurBrussee Oct 4, 2024

nathanielsimard Oct 4, 2024

ArthurBrussee Oct 4, 2024

Fix wgpu memory corruption for CubeCount::Dynamic #156

Fix wgpu memory corruption for CubeCount::Dynamic #156

Conversation

ArthurBrussee commented Oct 4, 2024

nathanielsimard Oct 4, 2024

Choose a reason for hiding this comment

nathanielsimard Oct 4, 2024

Choose a reason for hiding this comment

ArthurBrussee Oct 4, 2024

Choose a reason for hiding this comment

nathanielsimard Oct 4, 2024

Choose a reason for hiding this comment

ArthurBrussee Oct 4, 2024

Choose a reason for hiding this comment