Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metal GPU profiling #1591

Merged
merged 44 commits into from
Dec 10, 2024
Merged

Metal GPU profiling #1591

merged 44 commits into from
Dec 10, 2024

Conversation

LouisChourakiSonos
Copy link
Collaborator

Most Metal model operators currently run on the Metal GPU. However, the current profiler only shows the CPU timespan for operations, which does not provide an accurate insight into chip performance due to the asynchronous nature of the process (i.e Sync operators are shown as the most consuming ones).

The PR enables Metal GPU profiling using metal-rs API.
It also updates the CLI to nicely print GPU performances (or more generally accelerator performances if any) next to CPU ones

session_handler.before_plan_eval(&mut state.session_state)?;

let start = crate::time::now();
while iters < bench_limits.max_loops && start.elapsed() < bench_limits.max_time {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the start statement must be before session_handler before plan eval and after plan eval. MetalSessionHandler doesn't need to be profiled because it is a step shared between iterations.

let counter_sample_buffer =
device.new_counter_sample_buffer_with_descriptor(&counter_sample_buffer_desc).unwrap();

handle_compute_pass_sample_buffer_attachment(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why having this function ? the function body is short and we will have at the same place all the necessary setup code

if let Some(profiler) = &self.profiler {
let mut profiler = profiler.borrow_mut();

let destination_buffer = profiler.device.new_buffer(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating new buffer inside each encode maybe slow down the overall profile. Could we know in this code the number of nodes ? If it is the case, we can replace the HashMap with a Vec and preallocate all destination buffer for each node in advance. On top of it having a vec is faster than a HashMap to update the data for each node


pub fn profile<EvalCallback>(
&self,
eval: EvalCallback,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could had a profile capacity which is more or less the number of nodes in the graph. This will let us preallocate all destination buffers in advance and only create when the capacity is reached

}

pub fn add_buffer(&mut self, buffer: Buffer) {
let current_node_id = &self.current_node_id.unwrap();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unwrap should be avoid. We could return an Error instead with TractResult:

self.current_node_id.ok_or_else(|| anyhow!("Metal profile doesn't have any current node id which is unexpected while adding the sampling buffer"))?;

buffer.contents() as *const u64,
NUM_SAMPLES as usize,
);
node_duration_ns += slice[1] - slice[0];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the sampling is a cycle number or a true duration ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a true duration in ns

@hdlj
Copy link
Collaborator

hdlj commented Dec 6, 2024

Thanks @LouisChourakiSonos for the PR. Could you add the metal session handler support in the bench subcommand line also ?

@LouisChourakiSonos LouisChourakiSonos force-pushed the metal-gpu-profiling branch 2 times, most recently from 0345b86 to bffae70 Compare December 6, 2024 13:21
cli/src/bench.rs Outdated
let mut state = SimpleState::new(plan)?;
let mut state = {
#[cfg(any(target_os = "macos", target_os = "ios"))]
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this only happen with --metal ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Creating a MetalSessionHandler from a non-Metal plan is not really an issue, as it will just do nothing and the model will execute as normal. But for user readability I'll add a check on the flag

@kali kali merged commit 43cc3af into main Dec 10, 2024
68 checks passed
@kali kali deleted the metal-gpu-profiling branch December 12, 2024 08:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants