Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep NMS index gathering on cuda device #8766

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Ghelfi
Copy link

@Ghelfi Ghelfi commented Nov 29, 2024

Performs the unwrap of IoU mask directly on the cuda device in NMS.

This prevents device -> host -> device data transfer.

fixes #8713

Copy link

pytorch-bot bot commented Nov 29, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/8766

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link

Hi @Ghelfi!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

@Ghelfi Ghelfi force-pushed the nms-unwrap-on-cuda branch from 2db6ab2 to 3f40bdb Compare December 20, 2024 12:29
@Ghelfi Ghelfi marked this pull request as ready for review December 20, 2024 13:54
@Ghelfi
Copy link
Author

Ghelfi commented Dec 20, 2024

@NicolasHug for review

@CNOCycle
Copy link

CNOCycle commented Jan 7, 2025

I would like to share an important finding regarding the root cause of slow performance in NMS when dealing with a large number of boxes (#8713). The issue is primarily due to the occurrence of minor page faults, as detailed in [1]. When data is transferred from the GPU to the host, the operating system must locate an appropriate memory space to store these temporary variables, and this operation can be quite costly in terms of execution time. As shown in Fig. 2 of [1], execution time curves diverge into two distinct groups as the number of objects increases, with varying results across different hardware configurations. Further analysis is provided in Table 1 and Section Performance Analysis of [1].

To summarize, the key takeaways are:

  • The slow execution time of NMS can be reproduced on both x86-64 and ARM systems, across various generations of Nvidia GPU combanations.
  • To reduce the execution time of NMS, we should aim to minimize the occurrence of minor page faults.

Since end users typically do not have the privilege to modify the operating system, and adjusting the lifecycle of these temporary variables may fall outside the scope of torchvision, I suggest the following approaches, which are detailed in [1]:

  1. CPU-free NMS: This method is the same as the approach proposed in PR (Keep NMS index gathering on cuda device #8766), and its mechanism has been extensively studied.
  2. Async NMS: The performance of Async NMS depends on various factors, including the version of CUDA, GPU driver, the operating system, and current memory usage. More details can be found in the Overhead of Minor Page Faults section of the Discussion in [1].
  3. Hybrid NMS: This approach is more complex, requiring meta-information about the currently used system. It is thus more suitable for advanced users.

The experimental comparison among three approaches can be found in [1]. Personally, I highly recommend CPU-free NMS. Nevertheless, the simplest implementation of Async NMS could potentially provide performance benefits if data copy is set to non-blocking mode, along with adding system synchronization before the CPU accesses the data:

//  nms_kernel_impl on GPU
at::Tensor mask_cpu = mask.to(at::kCPU, /* non_blocking*/ true);
// some non-relevant CPU codes
cudaStreamSynchronize(stream)
// unsigned long long* mask_host = (unsigned long long*)mask_cpu.data_ptr<int64_t>();

The following is result of default NMS and CPU-free NMS executed on V100 with the latest docker image nvcr.io/nvidia/pytorch:24.12-py3. It can be seen that the execution time of CPU-free NMS is half of the default NMS. In the worst-case scenario, we simulated a situation in which all objects output by the object detection model survive. The best-case scenario returned only one object, while the random case randomly assigned properties to all objects. By implementing these three distinct cases, we are able to effectively evaluate the performance of the proposed methods under a range of circumstances, and make meaningful comparisons between them.
fig_nms_objs_scatter_default
fig_nms_objs_scatter_gpu

I would strongly recommend that the community can cite the paper [1] to help users understand that the code modifications involved in the PR (#8766) are based not just on experimental results, but also on clear explanations and insights drawn from computing architecture.

References:

[1] Chen, E. C., Chen, P. Y., Chung, I., & Lee, C. R. (2024). Latency Attack Resilience in Object Detectors: Insights from Computing Architecture. In Proceedings of the Asian Conference on Computer Vision (pp. 3206-3222).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

torchvision.ops.boxes.batched_nms slow on large box numbers
3 participants