Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix wgpu memory corruption for CubeCount::Dynamic #156

Merged
merged 8 commits into from
Oct 7, 2024

Conversation

ArthurBrussee
Copy link
Contributor

A previous PR #56 has made it a bit trickier to deal with wgpu resources, and turns out there's another memory corruption possible now :) Resources used for CubeCount::Dynamic were not registered as "in use" which means it was possible for a copy operation & dispatch to be ordered incorrectly.

Another problem is that get_resource is public which means users might add operations using a buffer themselves. These should also be tracked! The fix is to always register a buffer as 'used' if it's used in get_resource.

This fix also seems to alleviate the need for the mysterious "double use for used copy buffers" that was previously needed, mystery solved!

#56 has introduced two memory corruptions now, so it's clearly a bit too flaky... some simpler strategy to enable fast wgpu copying might be needed but don't have much inspiration atm!

Comment on lines 197 to 201
// Keep track of any buffer that might be used in the wgpu queue, as we cannot copy into them
// after they have any outstanding compute work. Calling get_resource repeatedly
// will add duplicates to this, but that is ok.
let handle = self.memory_management.get(binding.memory.clone());
self.storage_in_flight.push(handle.id);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could remove duplicates by using a set? If the storage_in_flight is cleared often it's probably not worth it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should actually probably create a data structure with methods. We use a slice in the memory management functions, but it could be a struct with the proper methods. Then we could change the data structure more easily (Vec under size of 10-20, a set if more than that).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last time I checked this a set was slower, since this array isn't that big. But yeah probably worth just keeping the semantics simple here rather than optimizing. Added a "MemoryLock" type now. Also changed the function to take an Option<> as allocating an empty hashset for each call is a bit sad.

x.1 += 1;
x.1 < 2
});
self.storage_in_flight.clear();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why syncing means that we can remove the external readonly buffers from the list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right yes good point! The whole get_resource API is rather unsafe, as the "resource" is tied to the original allocation. As long as the allocation is upheld all the queue copy/write stuff here is fine as well, so it's the same issue as already exists - keeping a resource past the lifetime of the original allocation is unsafe.

I've implemented the solution we came up with in the Discord to keep the binding as part of the resource, or rather, wrapped stuff in a BindingResource so all servers can do this. Lmk how this approach looks!

@nathanielsimard nathanielsimard merged commit 774d823 into tracel-ai:main Oct 7, 2024
5 checks passed
@ArthurBrussee ArthurBrussee deleted the fix-dynamic-mem branch October 29, 2024 19:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants