Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix to_scalar issue of very high latency #2240

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

RoggeOhta
Copy link

I'm hoping to resolve this issue.
Since the dimension of the tensor you wish to call to_scalar is 0.
So I changed the match arm of CudaStorage in to_scalar by directly calling dtoh_sync_copy on a CudaView<T> created designated for copy to CPU mem.
Calling dtoh_sync_copy directly with CudaSlice<T> will cause a full copy on the underlying storage. By calling CudaView will resolve this problem.

Code sample.

...
CudaStorageSlice::F32(slice) => {
    let sub_slice = slice.slice(offset..offset + 1);
    let device = slice.device();
    let cpu_storage = device.dtoh_sync_copy(&sub_slice).w()?;
    return from_cpu_storage_single(&CpuStorage::F32(cpu_storage));
}
...

Fix #2239

@RoggeOhta
Copy link
Author

Passed all tests with & without cuda feature under my local environment.

@EricLBuehler
Copy link
Member

@RoggeOhta do you measure a performance increase when you run with CUDA_LAUNCH_BLOCKING=1?

@RoggeOhta
Copy link
Author

RoggeOhta commented Jun 3, 2024

@RoggeOhta do you measure a performance increase when you run with CUDA_LAUNCH_BLOCKING=1?

Output of Minimal reproducible example in the linked issue:
All build with release profile.
It's a simple test without serious benchmarking, it can only prove the change in the order of magnitudes.

Before fix with CUDA_LAUNCH_BLOCKING=1

Tensor on CPU: Tensor[dims 10000, 10000; f32]
117.986679ms
Tensor on GPU: Layout { shape: [10000, 10000], stride: [10000, 1], start_offset: 0 }
4.219µs
259.142736ms  # to_scalar
[candle-test/src/main.rs:26:5] scalar = 0.09058428

Before fix without CUDA_LAUNCH_BLOCKING=1

Tensor on CPU: Tensor[dims 10000, 10000; f32]
92.308598ms
Tensor on GPU: Layout { shape: [10000, 10000], stride: [10000, 1], start_offset: 0 }
7.003µs
292.98987ms  # to_scalar
[candle-test/src/main.rs:26:5] scalar = 0.630551

After fix with CUDA_LAUNCH_BLOCKING=1

Tensor on CPU: Tensor[dims 10000, 10000; f32]
92.458806ms
Tensor on GPU: Layout { shape: [10000, 10000], stride: [10000, 1], start_offset: 0 }
2.835µs
37.574µs # to_scalar
[candle-test/src/main.rs:26:5] scalar = 0.23715067

After fix without CUDA_LAUNCH_BLOCKING=1

Tensor on CPU: Tensor[dims 10000, 10000; f32]
159.794325ms
Tensor on GPU: Layout { shape: [10000, 10000], stride: [10000, 1], start_offset: 0 }
10.94µs
24.328µs # to_scalar
[candle-test/src/main.rs:26:5] scalar = 0.076687455

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tensor::to_scalar very high latency
2 participants