Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance degradation when running HTR #1652

Closed
Tracked by #1032
mariodirenzo opened this issue Mar 16, 2024 · 43 comments
Closed
Tracked by #1032

Performance degradation when running HTR #1652

mariodirenzo opened this issue Mar 16, 2024 · 43 comments

Comments

@mariodirenzo
Copy link

As mentioned in a previous Legion meeting, we are seeing a significant degradation of performance when running HTR.
We are comparing the performance of an old commit (cba415a) to the performance of a new commit (91b55ce) when weak scaling HTR.
Note that we are using the same version of HTR to perform this test.

For instance, we are reporting here the profiles obtained on 8 nodes with the old commit
and with the new commit.

In this profiles we see a minor degradation of the GPU usage (that reflects in about a 10% increase of the time per step) but, more importantly, a very large section of the profile when it seems that nothing is happening at the beginning of the run.
This increase of "idle" time at the beginning seems to increase with some power of the number of nodes used in the calculations.

Adding @lightsighter and @seemamirch for visibility

@elliottslaughter, can you add this issue to #1032 ?
Is it too late to insert this issue in the list tracked for the March release?

@lightsighter
Copy link
Contributor

Something bad is happening in the network. Please regenerate the profiles with the Rust profiler. It is probably too late for this to make the March release.

@albovoci
Copy link

@lightsighter
Copy link
Contributor

These profiles were not generated for the most recent version of the profiler. You either need to regenerate them or provide the precise Legion commit hash that you used to generate them.

@albovoci
Copy link

Legion commit hash from 30 January 2023: cba415a857c2586b2ad2f4848d6d1cd75de7df00
Legion commit hash from 28 January 2024: 91b55ce217933ec6cd546c0731a6652031dfc7e9

@lightsighter
Copy link
Contributor

Those are both going to be too old to show the information we need. Please generate a slow profile from the most recent Legion master branch.

@lightsighter
Copy link
Contributor

Also, please make sure all profiles are generated with the same version of GASNetEx.

@albovoci
Copy link

I'm assuming the slow profile is generated using the python file. I don't exactly know how to make sure they're generated using the same version of GASNetEx; any hints on that? Thanks.

@lightsighter
Copy link
Contributor

I'm saying you need to generate log files from a run using the most recent Legion master branch as they will have considerably more data about the timing of various operations.

I don't exactly know how to make sure they're generated using the same version of GASNetEx; any hints on that?

Build and link against this version of GASNetEX from here. I'm not sure which build system you are using so it is up to you to integrate that into your build system.

@elliottslaughter
Copy link
Contributor

If you use setup_env.py this version will get pulled in automatically when you build the latest Legion.

@albovoci
Copy link

Thanks @elliottslaughter for the tip 👍🏼
@lightsighter I've uploaded the profiles in the same directory.
Both legion versions use now the GASNetEX you advised.
The log files corresponding to the run which takes too long to get started uses the most recent Legion master branch.

@lightsighter
Copy link
Contributor

@elliottslaughter What happened to the dependent partitioning channel in the profiler?

@lightsighter
Copy link
Contributor

@albovoci Have you marked all your partitions disjoint/aliased and complete/incomplete?

@elliottslaughter
Copy link
Contributor

@albovoci
Copy link

I know the partitions are marked as disjoint/aliased; I'm not so sure about complete/incomplete.
Maybe @mariodirenzo can clarify that better.

@mariodirenzo
Copy link
Author

No, we do not mark the regions as complete/incomplete as we do not know this attribute at compile time and it depends on the input of the calculation

@lightsighter
Copy link
Contributor

Run with -ll:force_kthreads and capture order ten periodic backtraces of all the threads on all the nodes during the first minute of the run and report them here.

@albovoci
Copy link

Sorry, I'm kind of new to this :/
How do I "capture order ten periodic backtraces of all the threads on all the nodes"?

@elliottslaughter
Copy link
Contributor

Here's an example script for you:

#!/bin/bash

set -e

"$@" &
pid=$!

for i in {1 .. 10}; do
  sleep 6
  gdb -p $pid -ex 'set confirm off' -ex 'set height 0' -ex 'set width 0' -ex 'thread apply all backtrace' -ex 'quit'
end

# wait for backtraces to complete and then kill the process
if [[ ${KILL_PROCESS_AFTER_BACKTRACE:-1} = 1 ]]; then
    sleep 2m
    kill $pid
    sleep 1m
    kill -9 $pid
fi

Run it like this:

srun ... ./run_with_backtraces.sh ./my_app ...

You might want to redirect the output of each rank to a separate file to make sure they stay separate. Modify as necessary for your particular system.

@seemamirch
Copy link
Contributor

@albovoci - which version of HTR are you using to reproduce this?

@albovoci
Copy link

@seemamirch It's this commit: 2198880d

@seemamirch
Copy link
Contributor

Backtraces (10 per run with 1 every 10 seconds) are on sapling /scratch2/seemah/htr_issue_1652/results_O2_g
Legion commit hash from 30 January 2023: cba415a -> output_file_old_<num_nodes>.<node#>
Legion commit hash from 28 January 2024: 91b55ce -> output_file_<num_nodes>.<node#>

@lightsighter
Copy link
Contributor

Can we do these runs without -lg:safe_ctrlrepl? That is known to cause huge amounts of blocking in the top-level task and are messing with the results. Unless of course the original runs were done with -lg:safe_ctrlrepl in which case the solution is going to be to not do that for performance runs.

@lightsighter
Copy link
Contributor

If you do re-run, you just need to do the newer commit and not the older one.

@lightsighter
Copy link
Contributor

@elliottslaughter Why would Regent be attaching semantic tags other than for semantic tag 0 (the name tag). For example,

#12 0x0000200000bad81c in Legion::Internal::Runtime::attach_semantic_information (this=0x194713b0, handle=..., fid=<optimized out>, tag=54321, 
buffer=0x2031d5824f30, size=4, is_mutable=<optimized out>) at /usr/WS1/mirchandaney1/legion_new/runtime/legion/runtime.cc:20212
#13 0x00002000006ea4fc in Legion::Runtime::attach_semantic_information (this=<optimized out>, handle=..., fid=<optimized out>, tag=<optimized o
ut>, buffer=<optimized out>, size=<optimized out>, is_mut=<optimized out>) at /usr/WS1/mirchandaney1/legion_new/runtime/legion/legion.cc:7386
#14 0x000020000070f920 in legion_field_id_attach_semantic_information (runtime_=..., handle_=..., id=<optimized out>, tag=54321, buffer=0x2031d
5824f30, size=4, is_mutable=<optimized out>) at /usr/WS1/mirchandaney1/legion_new/runtime/legion/legion_c.cc:2064
#15 0x000000001018f8b4 in $<workSingle> ()
#16 0x00000000101281b4 in $__regent_task_workSingle_primary ()

Why would there be 54321 semantic tags?

@elliottslaughter
Copy link
Contributor

@seemamirch
Copy link
Contributor

@lightsighter
Backtraces (10 per run with 1 every 10 seconds) without -lg:safe_ctrlrepl are on sapling /scratch2/seemah/htr_issue_1652/results_O2_g_no_safe_ctrlrepl.
The performance runs don't have this option.
8 node runs failed, the others completed
I also see these warnings i.e.
output_file_8.1:[1 - 20406181f8b0] 4.605703 {4}{runtime}: [warning 1114] LEGION WARNING: Failed to find a refinement for KD tree with 3 dimensions and 21 rectangles. Please report your application to the Legion developers' mailing list. (from file /usr/WS1/mirchandaney1/legion_new/runtime/legion/region_tree.inl:6446)

@lightsighter
Copy link
Contributor

The only major thing that the workSingle task seems to be waiting on are futures. This means that the workSingle task is waiting on at least hundred if not thousands of futures every single run (as evidenced by all the different waits in the profile)., especially at the start. If it was waiting on something else more frequently that would show up in the backtraces, but I only see a few semantic attach waits, whereas the future wait calls are 10X more prominent. It's still possible that semantic attach operations are the issue and the samples of the backtraces just happen latter in the run. It's hard to tell without timestamps. @elliottslaughter can we disable all the semantic attach operations? Can we also see why we might be waiting on lots of futures in Regent tasks?

8 node runs failed, the others completed

I wouldn't worry about any crashes. Even the new branch is two months old and there have been many bug fixes since then.

output_file_8.1:[1 - 20406181f8b0] 4.605703 {4}{runtime}: [warning 1114] LEGION WARNING: Failed to find a refinement for KD tree with 3 dimensions and 21 rectangles. Please report your application to the Legion developers' mailing list. (from file /usr/WS1/mirchandaney1/legion_new/runtime/legion/region_tree.inl:6446)

Mario and I already discussed those and they are the result of an unusual aliased partition that is being made. You don't need to worry about them and they are not going to be responsible for the profiling effects we're observing. They would show up very loudly as long running meta-tasks and would be prominently visible in profiles if they were a problem.

@mariodirenzo
Copy link
Author

I want to remind here that we are using the same version of HTR to test the old and new versions of Legion. The application asks Regent and the runtime the same semantic-attach and future-wait operations regardless of what version of the runtime we are using.

can we disable all the semantic attach operations?

We need that semantic information in the mapper to perform our mapping decisions.

Can we also see why we might be waiting on lots of futures in Regent tasks?

We are waiting for some futures in the top-level task and the only workaround is to use predication. Unfortunately, predication is not compatible with tracing so, if we gain some performance on one side, we lose it on the other.

@elliottslaughter
Copy link
Contributor

For what it's worth, I first added that semantic information in 8398065, June of 2022. So, in the time frame under discussion in this issue I believe the semantic information should be identical (or at least highly similar).

If you do want to run an experiment with this disabled, here are the lines to comment out:

if std.get_type_id(field_type) then
actions:insert(
quote
var type_id : uint32 = [std.get_type_id(field_type)]
c.legion_field_id_attach_semantic_information(
[cx.runtime], [fs], field_id, [std.get_type_semantic_tag()], &type_id, terralib.sizeof(uint32), false)
end)
end

@lightsighter
Copy link
Contributor

The application asks Regent and the runtime the same semantic-attach and future-wait operations regardless of what version of the runtime we are using.

I understand. I want to remove the semantic attaches to see if a change in the implementation of them causes the performance degradation or whether it was something else.

We need that semantic information in the mapper to perform our mapping decisions.

That seems pretty dicey. That's not really what the semantic information interface is for. It's more for logging and debugging, not for correctness. I suppose you can use it for that, but that's not what is designed for and I can't promise it will be fast enough for all those cases.

We are waiting for some futures in the top-level task and the only workaround is to use predication. Unfortunately, predication is not compatible with tracing so, if we gain some performance on one side, we lose it on the other.

Let's turn off tracing and get rid of the waits and see if the profile is still "empty" (I don't care if it is slow, but I want to see the runtime busy doing all sorts of things). We don't have a good sense of what is causing all the waiting right now in the worker task so we need to start removing all the sources of blocking in the application side.

For what it's worth, I first added that semantic information in 8398065, June of 2022

I don't care about that specific one. I care about all of the semantic info attached by Regent or by users. I don't think anyone semantic info is the cause of the problem, but I want to start removing sources of blocking because the backtrace experiment was inconclusive.

@rohany
Copy link
Contributor

rohany commented Mar 30, 2024

I told Mike already, but I was able to reproduce the performance issues on 8 and 16 nodes of Lassen, but was not able to reproduce the same problems on Perlmutter, indicating that something might going on inside the network.

@seemamirch
Copy link
Contributor

@mariodirenzo HTR uses the following 2 options - -ll:cpu_bgwork 100 -ll:util_bgwork 100 (time slicing util/cpu threads)
Without those options -> the first iteration completes in 2.9 seconds for 8 nodes, 14 seconds for 16 nodes for the weakScaling test (as per console.txt)
With these options -> the first iteration completes in 16.045 seconds for 8 nodes, 79 seconds for 16 nodes
Both results are for a newer version of legion (master branch 410d812 )
I'm not sure why you have set these options

@mariodirenzo
Copy link
Author

Those options used to help performance as CPU and util threads are usually idle during the execution.
Is the performance without the flags in line with the numbers achieved with the old version of Legion?

@seemamirch
Copy link
Contributor

Yes + overall new legion perf is better - I compared with the same 2 legion versions you mentioned in this issue
new legion, 16 nodes, 101 iterations, new options -> 140 seconds
old legion, 16 nodes, 101 iterations, new options -> 168.9 seconds
old legion, 16 nodes, 101 iterations, old options -> 193 seconds

Background worker (bgwork) threads are set to 16 already - it's not helping to have those options enabled.

@seemamirch
Copy link
Contributor

Profiles for an HTR test on Lassen, 1 node
-ll:cpu_bgwork 100 -ll:util_bgwork 100 (slow version)
https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seemah/accessor_gpu_128_sched/

without these options (fast version)
https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seemah/accessor_gpu_128_no_sched/

FB to FB copies appear to be taking much longer (n0f2-n0f2 channel)

@elliottslaughter
Copy link
Contributor

Was there a conclusion on this? Was the issue fully resolved by removing the -ll:cpu_bgwork and -ll:util_bgwork flags?

@seemamirch
Copy link
Contributor

These flags have been removed from the HTR scripts. @mariodirenzo to confirm performance is now better.
It's unclear though why the FB to FB copies are so much slower with these options

@mariodirenzo
Copy link
Author

The issue is solved by removing those flags but, as Seema mentioned, the reason is unclear.
I am not particularly attached to those flags so it is up to you to decide if this is a satisfactory solution (and close the issue) or if you want to further investigate what was going on

@lightsighter
Copy link
Contributor

@elliottslaughter Did we add some kind of detection for this in Legion Prof? I thought you might have done that but I can't remember.

@elliottslaughter
Copy link
Contributor

I'm not sure we have a root cause yet, so it does not make sense to add any detection.

My understanding from the above discussion is that HTR was running with -ll:bgwork 16 (i.e., 16 background worker threads) as well as with -ll:cpu_bgwork 100 -ll:util_bgwork 100.

There are two possible hypotheses I think we can form based on this:

  1. The overhead was the result of lock contention that was exacerbated by dramatically increasing the number of threads involved in background work. (Frankly, 16 is already too many, and then if you add all the CPUs and all the Utils on top, it's just way too much.)
  2. The timeslice value of 100 for the CPU and Util processors was too small and was causing Realm overheads to be evident. If I understand correctly, the units here are microseconds. Based on Task Bench we know Realm can't really run smaller than 50 microsecond tasks. So at 100 microseconds it's possible that we were simply introducing dramatically higher overheads by forcing the micro-ops to be broken into much smaller pieces than usual.

Both of these hypotheses seem plausible to me. I suppose we could do an experiment with -ll:cpu_bgwork 500 -ll:util_bgwork 500 to see if that changes the observed behavior. At any rate, without getting to a root diagnosis I don't think there is anything to really do in terms of trying to catch this automatically.

One thing I think we should do is track the core binding in the profiler. I.e., we want to show a diagram similar to -ll:show_rsrv so that the user knows how Realm detected the machine and what processors were assigned to what cores. That doesn't automatically detect issues like this but would potentially let us (especially core maintainers helping users) diagnose certain kinds of issues like these ones (along with more common cases like binding all the processors to a single CPU core).

@elliottslaughter
Copy link
Contributor

elliottslaughter commented Jun 7, 2024

A correction: My hypothesis (2) appears to be incorrect. Realm micro-ops do NOT get split depending on the value of -ll:cpu_bgwork or -ll:util_bgwork. Instead, these values work as a filter: CPU/Util processors simply will not run background work items estimated to be larger than the threshold.

(Having said that, I'm not sure that Realm goes to any particular effort to estimate the cost of micro-ops, so it's possible that this effectively turns into a binary flag. Either the CPU/Util runs background work items, or it does not.)

Therefore, nothing about this setting would result in overheads increasing on a per-micro-op basis as the micro-ops themselves are untouched.

So then the hypotheses that remain are either (1) that the additional threads introduce contention (as mentioned above), or (3) that having CPUs/Utils involved in background work either delays higher-priority CPU/Util tasks or else somehow interferes with the CPU/Util processors' ability to run such tasks (e.g., because more time is spent in lock contention).

@lightsighter
Copy link
Contributor

I think (3) is more likely because we've turned over the responsibility of the thread scheduling to the OS (probably Linux) and it's going to do it's fair-share scheduling thing which will round-robin between the background worker threads. If it has to do that for all the background worker threads, each of which is going to check queues for stuff to do, before getting back to the one thread that happens to have the interesting bit of work needed to make forward progress, that is going to cause massive slowdowns.

@elliottslaughter
Copy link
Contributor

Ok. To answer @mariodirenzo's question from a couple comments back: I think this is not a problem we need to solve. Legion does not perform well when you oversubscribe the machine. But the solution to this is easy: stop allocating so many threads. The only time when this really causes problems is when you have a CPU-only machine where you are trying to squeeze maximum FLOPs out of the system. But this is not a scenario we anticipate in any major upcoming system (aside from possibly some of the Japanese supercomputers).

So overall my guidance would be:

  1. In general you don't want to blindly maximize -ll:bgwork. This value should be tuned for the machine and you should expect a local maximum at somewhere less than the theoretical max core count.
  2. It doesn't make sense to both increase -ll:bgwork and also set -ll:cpu_bgwork 100 -ll:util_bgwork 100 at the same time. The high bgwork count makes it unnecessary for the CPUs to do anything. In fact, I would just avoid -ll:cpu_bgwork 100 -ll:util_bgwork 100 because the situations where you need it are likely to be very niche (i.e., CPU-only systems where you're trying to maximize application task throughput).
  3. You might consider increasing ranks per node. On a system with this many cores, various overheads are likely to increase with the number of cores per rank. So running 2 or 4 ranks per node may reduce those overheads more effectively than increasing -ll:bgwork. I saw a 2× speedup with S3D running 2 ranks per node, for example. The main thing is to avoid overcrowding as you do so: you generally want to leave plenty of room for Legion to do its analysis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants