Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mo/8223 fd2 dispatch core profiler support #8609

Merged
merged 1 commit into from
Jun 5, 2024

Conversation

mo-tenstorrent
Copy link
Contributor

@mo-tenstorrent mo-tenstorrent commented May 17, 2024

This brings profiling dispatch cores.

Both cq_prefetch and cq_dispatch can now be profiled with a stack of parent and child functions.
DeviceZoneScopedND( name , nocBuffer, nocIndex ) macro is dedicated to dispatch core profiling. Noc Buffer and
index are global to dispatch and prefetch kernels that need to be passed.
e.g.
Screenshot 2024-06-05 at 9 15 37 AM

The main while loops of prefetcher and dispatcher are committed with the profiling macro.

Dispatch profiling is disabled by default to avoid the overhead. It is enabled by env var TT_METAL_DEVICE_PROFILER_DISPATCH=1.

Because dispatch cores have much more activity, their profiling overhead can add up and slow the entire model run down.

Dispatch kernel now runs on NCRISC, this brought the requirement for providing profiler push to DRAM for NCRISC as well.

For a much more efficient usage of the NOC, quick_send was introduced that pushes L1 data to DRAM when profiler L1 buffer is full. This allowed for about 100 iterations of the dispatch loops to happen before a costly L1 to DRAM NOC transactions.

quick_send is marked in tracy with red as shown below,
Screenshot 2024-06-04 at 9 37 34 AM

Green CI

Post Commit: https://github.com/tenstorrent/tt-metal/actions/runs/9375214697
T3K Profiler: https://github.com/tenstorrent/tt-metal/actions/runs/9386432350
Device Perf: https://github.com/tenstorrent/tt-metal/actions/runs/9375217111
uBenchmark: https://github.com/tenstorrent/tt-metal/actions/runs/9375219316

@mo-tenstorrent mo-tenstorrent force-pushed the mo/8223_FD2_dispatch_core_profiler_support_2 branch 4 times, most recently from 49ec429 to 5a43477 Compare May 28, 2024 22:32
Copy link
Contributor

@TT-BrianLiu TT-BrianLiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments.

Dispatch kernels can be profiled using the
DeviceZoneScopedND( name , nocBuffer, nocIndex ) macro. noc Buffer and
index are globals to dispatch and prefetch kernels.

Dispatch profiling is disabled by default to avoid the overhead. It is
enabled by env var `TT_METAL_DEVICE_PROFILER_DISPATCH=1`
@mo-tenstorrent mo-tenstorrent force-pushed the mo/8223_FD2_dispatch_core_profiler_support_2 branch from 809f5c5 to 181458d Compare June 5, 2024 16:06
@mo-tenstorrent mo-tenstorrent merged commit dbee4ba into main Jun 5, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tracy profiler Feature and bugs related to tracy profiler
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants