-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metal GPU profiling #1591
Metal GPU profiling #1591
Conversation
libcli/src/profile.rs
Outdated
session_handler.before_plan_eval(&mut state.session_state)?; | ||
|
||
let start = crate::time::now(); | ||
while iters < bench_limits.max_loops && start.elapsed() < bench_limits.max_time { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the start statement must be before session_handler before plan eval and after plan eval. MetalSessionHandler doesn't need to be profiled because it is a step shared between iterations.
metal/src/command_buffer.rs
Outdated
let counter_sample_buffer = | ||
device.new_counter_sample_buffer_with_descriptor(&counter_sample_buffer_desc).unwrap(); | ||
|
||
handle_compute_pass_sample_buffer_attachment( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why having this function ? the function body is short and we will have at the same place all the necessary setup code
metal/src/command_buffer.rs
Outdated
if let Some(profiler) = &self.profiler { | ||
let mut profiler = profiler.borrow_mut(); | ||
|
||
let destination_buffer = profiler.device.new_buffer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating new buffer inside each encode maybe slow down the overall profile. Could we know in this code the number of nodes ? If it is the case, we can replace the HashMap with a Vec and preallocate all destination buffer for each node in advance. On top of it having a vec is faster than a HashMap to update the data for each node
|
||
pub fn profile<EvalCallback>( | ||
&self, | ||
eval: EvalCallback, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could had a profile capacity which is more or less the number of nodes in the graph. This will let us preallocate all destination buffers in advance and only create when the capacity is reached
metal/src/command_buffer.rs
Outdated
} | ||
|
||
pub fn add_buffer(&mut self, buffer: Buffer) { | ||
let current_node_id = &self.current_node_id.unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unwrap should be avoid. We could return an Error instead with TractResult:
self.current_node_id.ok_or_else(|| anyhow!("Metal profile doesn't have any current node id which is unexpected while adding the sampling buffer"))?;
buffer.contents() as *const u64, | ||
NUM_SAMPLES as usize, | ||
); | ||
node_duration_ns += slice[1] - slice[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the sampling is a cycle number or a true duration ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a true duration in ns
Thanks @LouisChourakiSonos for the PR. Could you add the metal session handler support in the bench subcommand line also ? |
0345b86
to
bffae70
Compare
c714315
to
806925c
Compare
cli/src/bench.rs
Outdated
let mut state = SimpleState::new(plan)?; | ||
let mut state = { | ||
#[cfg(any(target_os = "macos", target_os = "ios"))] | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this only happen with --metal
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Creating a MetalSessionHandler from a non-Metal plan is not really an issue, as it will just do nothing and the model will execute as normal. But for user readability I'll add a check on the flag
Most Metal model operators currently run on the Metal GPU. However, the current profiler only shows the CPU timespan for operations, which does not provide an accurate insight into chip performance due to the asynchronous nature of the process (i.e Sync operators are shown as the most consuming ones).
The PR enables Metal GPU profiling using metal-rs API.
It also updates the CLI to nicely print GPU performances (or more generally accelerator performances if any) next to CPU ones