Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ann-bench multithreading #2021

Merged

Conversation

achirkin
Copy link
Contributor

@achirkin achirkin commented Nov 22, 2023

In the current state, ann-benchmarks running in the --throughput mode (multi-threaded) share ANN wrappers among CPU threads. This is not thread-safe and may result in incorrectly measured time (e.g. sharing cuda events among CPU threads) or various exceptions and segfaults (e.g. doing state-changing cublas calls from multiple CPU threads).

This PR makes the search benchmarks copy ANN wrappers in each thread. The copies of the wrappers then selectively:

  • share thread-safe resources (e.g. rmm memory pool) and large objects that are not expected to change during search (e.g. index data);
  • duplicate the resources that are not thread-safe or carry the thread-specific state (e.g. cublas handles, CUDA events and streams).

Alongside, the PR adds a few small changes, including:

  • enables ann-bench NVTX annotations for the non-common-executable mode (shows benchmark labels and iterations in nsys timeline);
  • fixes compile errors for the common-executable mode.

@achirkin achirkin added bug Something isn't working non-breaking Non-breaking change 2 - In Progress Currenty a work in progress labels Nov 23, 2023
@cjnolet cjnolet marked this pull request as ready for review November 23, 2023 16:14
@cjnolet cjnolet requested review from a team as code owners November 23, 2023 16:14
@achirkin
Copy link
Contributor Author

achirkin commented Nov 24, 2023

Progress notes:

  • At the moment, a new raft::resources handle is created for each thread, as the handle seems to be not thread-safe. This can be quickly fixed by modifying the copy-constructor of the configured_raft_resources once we're sure raft::resources is thread-safe.
  • The PR needs more testing!

@achirkin achirkin self-assigned this Nov 24, 2023
@achirkin achirkin changed the title [WIP] Fix ann-bench multithreading Fix ann-bench multithreading Nov 24, 2023
@achirkin achirkin added 3 - Ready for Review and removed 2 - In Progress Currenty a work in progress labels Nov 27, 2023
Copy link
Contributor

@wphicks wphicks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one substantive question, but this looks great. Love the design of configured_raft_resources.

*
* To avoid copying the index (database) in each thread, we make both the index and
* the gpu_resource shared.
* This means faiss GPU streams are possibly shared among the CPU threads;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have an issue tracking this?

cpp/bench/ann/src/faiss/faiss_gpu_wrapper.h Show resolved Hide resolved
@rapidsai rapidsai deleted a comment from wphicks Dec 3, 2023
@cjnolet
Copy link
Member

cjnolet commented Dec 3, 2023

@wphicks this PR isn't quite ready to be merged yet. We have some further discussions before this is ready.

Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making sure we discuss this before merging.

Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cjnolet pointed me to this PR because I was asking about uses of MR comparisons. Looking at the call to is_equal used in this PR I had a few other suggestions to improve the surrounding code.

Comment on lines 63 to 76
configured_raft_resources()
: configured_raft_resources{
{[]() {
auto* mr = new rmm::mr::pool_memory_resource<rmm::mr::device_memory_resource>{
rmm::mr::get_current_device_resource(), 1024 * 1024 * 1024ull};
rmm::mr::set_current_device_resource(mr);
return mr;
}(),
[](rmm::mr::pool_memory_resource<rmm::mr::device_memory_resource>* mr) {
if (rmm::mr::get_current_device_resource()->is_equal(*mr)) {
rmm::mr::set_current_device_resource(mr->get_upstream());
}
delete mr;
}}}
Copy link
Member

@harrism harrism Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Initialize the members of the class by name, not as a struct.
suggestion: don't bury the re-setting of the previous device resource inside an unnamed deleter lambda. Just reset it in your class destructor.
question: does the mr_ really need to be shared?
suggestion: use a unique_ptr.

Suggested change
configured_raft_resources()
: configured_raft_resources{
{[]() {
auto* mr = new rmm::mr::pool_memory_resource<rmm::mr::device_memory_resource>{
rmm::mr::get_current_device_resource(), 1024 * 1024 * 1024ull};
rmm::mr::set_current_device_resource(mr);
return mr;
}(),
[](rmm::mr::pool_memory_resource<rmm::mr::device_memory_resource>* mr) {
if (rmm::mr::get_current_device_resource()->is_equal(*mr)) {
rmm::mr::set_current_device_resource(mr->get_upstream());
}
delete mr;
}}}
configured_raft_resources()
: mr_{make_unique<rmm::mr::pool_memory_resource<rmm::mr::device_memory_resource>>(
rmm::mr::get_current_device_resource(), 1024 * 1024 * 1024ull)}
{
}
~configured_raft_resources()
{
// reset the previous current_device_resource before deleting our pool.
rmm::mr::set_current_device_resource(mr_->get_upstream());
}

Note that this also removes the call to is_equal() -- is there a reason in this benchmark that you are worried about the possibility of the current resource changing during the lifetime of this class?

In the future, please use operator== on MRs, because is_equal is likely to be removed.

Copy link
Contributor Author

@achirkin achirkin Dec 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Mark, thanks for having a look at this!

First of all, yes, we must use the shared pointer here. We create a new configured_raft_resources at most once per benchmark case, and let multiple concurrent threads spawned by gbench copy it. We do this because we need some of the resources not be shared due to thread safety (the raft handle and the event). However, we must have exactly one rmm pool resource shared by everything, which is dictated by rmm current_device_resource semantics. Hence we need the pool resource be initialized with the original configured_raft_resources and destructed after the last copy of configured_raft_resources is destructed.

The same applies to setting/unsetting the rmm current_device_resource. Hence I cannot just reset it my class destructor and not bury it inside an unnamed deleter lambda. An alternative would be to create a new class with the destructor that does the same and put it in the shared pointer, but why bother?

Also, I'm just curious as I'm not that experienced with the latest C++ best practices. What are the downsides of using curly braces to initialize the members "as a struct" as I did in this code?

Copy link
Member

@harrism harrism Dec 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The downsides are that you aren't initializing them by name, so it is less readable, and if the order of the class members changes this initialization will be invalid.

My suggestions were aimed at improving the readability and maintainability of the code, which currently I find obfuscated and error prone. But really this is not a library I have to read or maintain so it's up to you and the RAFT reviewers.

The one thing that I ask (and the thing that brought me here) is that you remove the call to is_equal() for the resources, as this is probably going away. If you must compare the resources, use ==.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing out to the rmm issue, I've updated the code and added a bit more comments alongside to make the intent clear.

I'm genuinely willing to improve the readability of the code, but I'm still not sure I understand what you mean regarding the initializing-as-a-struct issue. To me the pattern seems pretty similar to your rmm::cuda_stream, which I don't find obfuscated or unreadable. Could you please adjust your suggestion to not change the semantics of the code (lifetime of the pool)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@achirkin the big difference between the rmm example you've pointed out and the example here is that the rmm initializers are still naming the members so if the names should change or he rearranged, maintenance becomes much more straightforward.

I do agree with Mark here, and to be quite honest, I find the use of the pattern here (unnamed parameter initialization through lambdas) to be quite hard to read myself. I also prefer that we at least name the parameters, especially for a list that includes multiple different members. It seems like not much additional work in proportion to the significant gain in readability.

Copy link
Contributor Author

@achirkin achirkin Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, now I really feel dumb staring at the code and not seeing what you both mean :) I'm just calling here another constructor with a single argument that takes the pool, am I not? If you're referring to the aggregate initialization, it's not even possible for this class, because it has two user-declared constructors.

Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@cjnolet
Copy link
Member

cjnolet commented Dec 13, 2023

/merge

@rapids-bot rapids-bot bot merged commit d9a7290 into rapidsai:branch-24.02 Dec 13, 2023
61 checks passed
rapids-bot bot pushed a commit that referenced this pull request Dec 13, 2023
With the changes introduced by #2021, the copied FAISS benchmark wrapper contains a cuda event that is used for synchronizing between streams during search. The lifetime of the event is the same as of the wrapper, but the event handle itself is copied between the wrappers; this leads to illegal memory accesses and crashes.
This PR fixes the bug by creating a new cuda event on each wrapper copy, so that the wrappers do not share their synchronization events.

Authors:
  - Artem M. Chirkin (https://github.com/achirkin)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #2062
ChristinaZ pushed a commit to ChristinaZ/raft that referenced this pull request Jan 17, 2024
In the current state, ann-benchmarks running in the `--throughput` mode (multi-threaded) share ANN wrappers among CPU threads. This is not thread-safe and may result in incorrectly measured time (e.g. sharing cuda events among CPU threads) or various exceptions and segfaults (e.g. doing state-changing cublas calls from multiple CPU threads).

This PR makes the search benchmarks copy ANN wrappers in each thread. The copies of the wrappers then selectively:
  - share thread-safe resources (e.g. rmm memory pool) and large objects that are not expected to change during search (e.g. index data);
  - duplicate the resources that are not thread-safe or carry the thread-specific state (e.g. cublas handles, CUDA events and streams).

Alongside, the PR adds a few small changes, including:
 - enables ann-bench NVTX annotations for the non-common-executable mode (shows benchmark labels and iterations in nsys timeline);
 - fixes compile errors for the common-executable mode.

Authors:
  - Artem M. Chirkin (https://github.com/achirkin)
  - William Hicks (https://github.com/wphicks)

Approvers:
  - William Hicks (https://github.com/wphicks)
  - Mark Harris (https://github.com/harrism)
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#2021
ChristinaZ pushed a commit to ChristinaZ/raft that referenced this pull request Jan 17, 2024
…apidsai#2062)

With the changes introduced by rapidsai#2021, the copied FAISS benchmark wrapper contains a cuda event that is used for synchronizing between streams during search. The lifetime of the event is the same as of the wrapper, but the event handle itself is copied between the wrappers; this leads to illegal memory accesses and crashes.
This PR fixes the bug by creating a new cuda event on each wrapper copy, so that the wrappers do not share their synchronization events.

Authors:
  - Artem M. Chirkin (https://github.com/achirkin)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#2062
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review bug Something isn't working CMake cpp non-breaking Non-breaking change
Projects
Development

Successfully merging this pull request may close these issues.

4 participants