Metal GPU profiling #1591

LouisChourakiSonos · 2024-12-03T13:49:24Z

Most Metal model operators currently run on the Metal GPU. However, the current profiler only shows the CPU timespan for operations, which does not provide an accurate insight into chip performance due to the asynchronous nature of the process (i.e Sync operators are shown as the most consuming ones).

The PR enables Metal GPU profiling using metal-rs API.
It also updates the CLI to nicely print GPU performances (or more generally accelerator performances if any) next to CPU ones

hdlj · 2024-12-06T08:22:00Z

libcli/src/profile.rs

+            session_handler.before_plan_eval(&mut state.session_state)?;
+
+            let start = crate::time::now();
+            while iters < bench_limits.max_loops && start.elapsed() < bench_limits.max_time {


the start statement must be before session_handler before plan eval and after plan eval. MetalSessionHandler doesn't need to be profiled because it is a step shared between iterations.

hdlj · 2024-12-06T08:25:46Z

metal/src/command_buffer.rs

+        let counter_sample_buffer =
+            device.new_counter_sample_buffer_with_descriptor(&counter_sample_buffer_desc).unwrap();
+
+        handle_compute_pass_sample_buffer_attachment(


why having this function ? the function body is short and we will have at the same place all the necessary setup code

hdlj · 2024-12-06T08:30:38Z

metal/src/command_buffer.rs

+        if let Some(profiler) = &self.profiler {
+            let mut profiler = profiler.borrow_mut();
+
+            let destination_buffer = profiler.device.new_buffer(


Creating new buffer inside each encode maybe slow down the overall profile. Could we know in this code the number of nodes ? If it is the case, we can replace the HashMap with a Vec and preallocate all destination buffer for each node in advance. On top of it having a vec is faster than a HashMap to update the data for each node

hdlj · 2024-12-06T08:33:32Z

metal/src/context.rs

+
+    pub fn profile<EvalCallback>(
+        &self,
+        eval: EvalCallback,


we could had a profile capacity which is more or less the number of nodes in the graph. This will let us preallocate all destination buffers in advance and only create when the capacity is reached

hdlj · 2024-12-06T09:02:12Z

metal/src/command_buffer.rs

+    }
+
+    pub fn add_buffer(&mut self, buffer: Buffer) {
+        let current_node_id = &self.current_node_id.unwrap();


unwrap should be avoid. We could return an Error instead with TractResult:

self.current_node_id.ok_or_else(|| anyhow!("Metal profile doesn't have any current node id which is unexpected while adding the sampling buffer"))?;

hdlj · 2024-12-06T09:03:01Z

metal/src/command_buffer.rs

+                    buffer.contents() as *const u64,
+                    NUM_SAMPLES as usize,
+                );
+                node_duration_ns += slice[1] - slice[0];


the sampling is a cycle number or a true duration ?

It is a true duration in ns

hdlj · 2024-12-06T09:04:38Z

Thanks @LouisChourakiSonos for the PR. Could you add the metal session handler support in the bench subcommand line also ?

kali · 2024-12-10T07:54:01Z

cli/src/bench.rs

-    let mut state = SimpleState::new(plan)?;
+    let mut state = {
+        #[cfg(any(target_os = "macos", target_os = "ios"))]
+        {


should this only happen with --metal ?

Good catch! Creating a MetalSessionHandler from a non-Metal plan is not really an issue, as it will just do nothing and the model will execute as normal. But for user readability I'll add a check on the flag

LouisChourakiSonos requested review from kali and hdlj December 3, 2024 13:49

hdlj reviewed Dec 6, 2024

View reviewed changes

LouisChourakiSonos force-pushed the metal-gpu-profiling branch 2 times, most recently from 0345b86 to bffae70 Compare December 6, 2024 13:21

LouisChourakiSonos and others added 24 commits December 9, 2024 12:08

Initial commit for Metal GPU profiling

a1f2ca7

Scratch Graph-level profiling

b8095a6

Fixed rebase on bin_ops

eea129b

Metal graph profiling

bafcf46

Removed useless imports/mut

c8089e6

code cleaning

0b14410

code refactoring

a996a4e

More refactoring

b20897f

set back profile loop condition

0802b9b

minimized diffs and clippy warnings

8032b14

Use memory pools for profiling

eb65855

Fix bad variable name

b07b552

Renamed TractCommandBuffer

4763167

moved notify_node to metal_eval

6a90391

print both CPU and accelerator profiling (if an accelerator is present)

d67c8d5

format render_summaries

50de642

Improved profile render

f75312b

fix clippy

a867ef3

Hide metal reference in profile.rs if not on macos

365ad94

Fix cfg blocks

2b1a07c

Fix CPU duration not sorted for CPU only profiling

ca8388c

Mem plan only once before profiling + Print CPU-Accelerator time diff

a85cf29

fixed potentially unintialized variable

71118e1

Bail for metal profiling on non-metal devices

86fc052

mathieupoumeyrolsonos and others added 13 commits December 9, 2024 12:08

more robust tensor extraction from command line

588f045

improve error reporting in costs

3dfa451

curly brackets around opaque facts

aeeaad7

use fake french quote marks instead of curly

7a98bf9

use magnifying glass emoji for opaque facts

a9c8bcd

Delete function with short body

bc87c24

clarified error

64086fd

updated libcli

2ca87cc

fixed ok_or_else

6b90ad8

Replace HashMap with Vec<Vec<>>

0c3a2f4

fixed wrong displayed ratio for accel ops

2c18740

do not count profiler init in total profile time

66c131e

formatting

806925c

LouisChourakiSonos force-pushed the metal-gpu-profiling branch from c714315 to 806925c Compare December 9, 2024 11:12

LouisChourakiSonos added 6 commits December 9, 2024 16:29

Put profile inside autorelease pool

7c80e47

format

5d1fd64

resolve symbols before profiling for metal

83edf43

add mem pool mgmt to bench command

c33b266

fixed non-Metal compilation

ccdd234

fixed forgotten not

fffe297

kali reviewed Dec 10, 2024

View reviewed changes

Checks for metal before mem pool optimization for bench

2edc0f5

LouisChourakiSonos requested a review from kali December 10, 2024 09:22

kali merged commit 43cc3af into main Dec 10, 2024
68 checks passed

kali deleted the metal-gpu-profiling branch December 12, 2024 08:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metal GPU profiling #1591

Metal GPU profiling #1591

LouisChourakiSonos commented Dec 3, 2024

hdlj Dec 6, 2024

hdlj Dec 6, 2024

hdlj Dec 6, 2024

hdlj Dec 6, 2024

hdlj Dec 6, 2024

hdlj Dec 6, 2024

LouisChourakiSonos Dec 6, 2024

hdlj commented Dec 6, 2024

kali Dec 10, 2024

LouisChourakiSonos Dec 10, 2024

Metal GPU profiling #1591

Metal GPU profiling #1591

Conversation

LouisChourakiSonos commented Dec 3, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hdlj commented Dec 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment