fix to_scalar issue of very high latency #2240

RoggeOhta · 2024-06-03T09:37:57Z

I'm hoping to resolve this issue.
Since the dimension of the tensor you wish to call to_scalar is 0.
So I changed the match arm of CudaStorage in to_scalar by directly calling dtoh_sync_copy on a CudaView<T> created designated for copy to CPU mem.
Calling dtoh_sync_copy directly with CudaSlice<T> will cause a full copy on the underlying storage. By calling CudaView will resolve this problem.

Code sample.

...
CudaStorageSlice::F32(slice) => {
    let sub_slice = slice.slice(offset..offset + 1);
    let device = slice.device();
    let cpu_storage = device.dtoh_sync_copy(&sub_slice).w()?;
    return from_cpu_storage_single(&CpuStorage::F32(cpu_storage));
}
...

Fix #2239

RoggeOhta · 2024-06-03T09:38:58Z

Passed all tests with & without cuda feature under my local environment.

EricLBuehler · 2024-06-03T09:49:21Z

@RoggeOhta do you measure a performance increase when you run with CUDA_LAUNCH_BLOCKING=1?

RoggeOhta · 2024-06-03T10:04:11Z

@RoggeOhta do you measure a performance increase when you run with CUDA_LAUNCH_BLOCKING=1?

Output of Minimal reproducible example in the linked issue:
All build with release profile.
It's a simple test without serious benchmarking, it can only prove the change in the order of magnitudes.

Before fix with CUDA_LAUNCH_BLOCKING=1

Tensor on CPU: Tensor[dims 10000, 10000; f32]
117.986679ms
Tensor on GPU: Layout { shape: [10000, 10000], stride: [10000, 1], start_offset: 0 }
4.219µs
259.142736ms  # to_scalar
[candle-test/src/main.rs:26:5] scalar = 0.09058428

Before fix without CUDA_LAUNCH_BLOCKING=1

Tensor on CPU: Tensor[dims 10000, 10000; f32]
92.308598ms
Tensor on GPU: Layout { shape: [10000, 10000], stride: [10000, 1], start_offset: 0 }
7.003µs
292.98987ms  # to_scalar
[candle-test/src/main.rs:26:5] scalar = 0.630551

After fix with CUDA_LAUNCH_BLOCKING=1

Tensor on CPU: Tensor[dims 10000, 10000; f32]
92.458806ms
Tensor on GPU: Layout { shape: [10000, 10000], stride: [10000, 1], start_offset: 0 }
2.835µs
37.574µs # to_scalar
[candle-test/src/main.rs:26:5] scalar = 0.23715067

After fix without CUDA_LAUNCH_BLOCKING=1

Tensor on CPU: Tensor[dims 10000, 10000; f32]
159.794325ms
Tensor on GPU: Layout { shape: [10000, 10000], stride: [10000, 1], start_offset: 0 }
10.94µs
24.328µs # to_scalar
[candle-test/src/main.rs:26:5] scalar = 0.076687455

fix to_scalar issue of very high latency

c390c53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix to_scalar issue of very high latency #2240

fix to_scalar issue of very high latency #2240

RoggeOhta commented Jun 3, 2024

RoggeOhta commented Jun 3, 2024

EricLBuehler commented Jun 3, 2024

RoggeOhta commented Jun 3, 2024 •

edited

Loading

fix to_scalar issue of very high latency #2240

Are you sure you want to change the base?

fix to_scalar issue of very high latency #2240

Conversation

RoggeOhta commented Jun 3, 2024

RoggeOhta commented Jun 3, 2024

EricLBuehler commented Jun 3, 2024

RoggeOhta commented Jun 3, 2024 • edited Loading

RoggeOhta commented Jun 3, 2024 •

edited

Loading