-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PAPI connector: per thread metrics #94
Comments
@keichi , yeah, all the tricks I can think of would require some pretty invasive changes to Kokkos to make this work. As a workaround, you can of course use a sampling tool that has PAPI support, but I acknowledge that's an unsatisfying answer. I'm going to file an issue against the main Kokkos repo to talk about adding this support, but it is likely to be a major effort. This is a really interesting problem, thanks for bringing it to our attention |
A scheme that could potentially work but would be a bit of work: use gotcha to wrap pthread start and use that gotcha to set an alarm which only gets handled on that thread. When that alarm is delivered, it does a papi read of the current counters and just updates the values in a global location. When the master thread updates, it just reads in the per-thread array that has the latest value from the last time the alarm was delivered on each thread. |
Or actually, forget the gotcha. Kokkos could add a tool initialization call on each thread that it initializes. |
Thank you for looking into this.
To me, this approach seems relatively simple and noninvasive. All existing Kokkos profiling hooks are backend-agnostic so changing that might be a problem? |
Well, we wouldn't call the hooks from CUDA threads but we calling the thread initialization routine from each thread in the pthread backend when we create the pool and calling the routine from each thread in the openmp CPU threading backend shouldn't be invasive at all and it's not like it's an overhead issue -- this would be a routine that only gets called once on a thread after Kokkos initialize. Anything more would open us up to performance degradation and using that alarm scheme would provide a relatively easy way to do it vs. how you'd have to do it otherwise (intercept thread creation function call) |
I noticed that the PAPI connector records the events only on the master thread. This is because PAPI performance counters are thread-local and
PAPI_hl_region_begin
/PAPI_hl_region_end
are called from the master thread only.It would be nice if the PAPI connector could record the performance counters on all threads. However, I'm guessing this needs changes in Kokkos itself to call the profiling hooks from inside the parallel regions.
The text was updated successfully, but these errors were encountered: