-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance degradation when running HTR #1652
Comments
Something bad is happening in the network. Please regenerate the profiles with the Rust profiler. It is probably too late for this to make the March release. |
@lightsighter These are the raw log files from each run. |
These profiles were not generated for the most recent version of the profiler. You either need to regenerate them or provide the precise Legion commit hash that you used to generate them. |
Legion commit hash from 30 January 2023: |
Those are both going to be too old to show the information we need. Please generate a slow profile from the most recent Legion master branch. |
Also, please make sure all profiles are generated with the same version of GASNetEx. |
I'm assuming the slow profile is generated using the python file. I don't exactly know how to make sure they're generated using the same version of GASNetEx; any hints on that? Thanks. |
I'm saying you need to generate log files from a run using the most recent Legion master branch as they will have considerably more data about the timing of various operations.
Build and link against this version of GASNetEX from here. I'm not sure which build system you are using so it is up to you to integrate that into your build system. |
If you use |
Thanks @elliottslaughter for the tip 👍🏼 |
@elliottslaughter What happened to the dependent partitioning channel in the profiler? |
@albovoci Have you marked all your partitions disjoint/aliased and complete/incomplete? |
I pushed a fix for the profiler here: https://gitlab.com/StanfordLegion/legion/-/merge_requests/1169 Updated profile here: https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~eslaught/bug1652_7541cd950_gasnet_2023_9_0/ |
I know the partitions are marked as disjoint/aliased; I'm not so sure about complete/incomplete. |
No, we do not mark the regions as complete/incomplete as we do not know this attribute at compile time and it depends on the input of the calculation |
Run with |
Sorry, I'm kind of new to this :/ |
Here's an example script for you: #!/bin/bash
set -e
"$@" &
pid=$!
for i in {1 .. 10}; do
sleep 6
gdb -p $pid -ex 'set confirm off' -ex 'set height 0' -ex 'set width 0' -ex 'thread apply all backtrace' -ex 'quit'
end
# wait for backtraces to complete and then kill the process
if [[ ${KILL_PROCESS_AFTER_BACKTRACE:-1} = 1 ]]; then
sleep 2m
kill $pid
sleep 1m
kill -9 $pid
fi Run it like this:
You might want to redirect the output of each rank to a separate file to make sure they stay separate. Modify as necessary for your particular system. |
@albovoci - which version of HTR are you using to reproduce this? |
@seemamirch It's this commit: 2198880d |
Can we do these runs without |
If you do re-run, you just need to do the newer commit and not the older one. |
@elliottslaughter Why would Regent be attaching semantic tags other than for semantic tag 0 (the name tag). For example,
Why would there be 54321 semantic tags? |
There are not 54321 semantic tags. It is a magic number: https://gitlab.com/StanfordLegion/legion/-/blob/master/language/src/regent/std_base.t#L497 And the top of that block explains: https://gitlab.com/StanfordLegion/legion/-/blob/master/language/src/regent/std_base.t#L480 |
@lightsighter |
The only major thing that the
I wouldn't worry about any crashes. Even the new branch is two months old and there have been many bug fixes since then.
Mario and I already discussed those and they are the result of an unusual aliased partition that is being made. You don't need to worry about them and they are not going to be responsible for the profiling effects we're observing. They would show up very loudly as long running meta-tasks and would be prominently visible in profiles if they were a problem. |
I want to remind here that we are using the same version of HTR to test the old and new versions of Legion. The application asks Regent and the runtime the same semantic-attach and future-wait operations regardless of what version of the runtime we are using.
We need that semantic information in the mapper to perform our mapping decisions.
We are waiting for some futures in the top-level task and the only workaround is to use predication. Unfortunately, predication is not compatible with tracing so, if we gain some performance on one side, we lose it on the other. |
For what it's worth, I first added that semantic information in 8398065, June of 2022. So, in the time frame under discussion in this issue I believe the semantic information should be identical (or at least highly similar). If you do want to run an experiment with this disabled, here are the lines to comment out: legion/language/src/regent/codegen.t Lines 4715 to 4722 in c610715
|
I understand. I want to remove the semantic attaches to see if a change in the implementation of them causes the performance degradation or whether it was something else.
That seems pretty dicey. That's not really what the semantic information interface is for. It's more for logging and debugging, not for correctness. I suppose you can use it for that, but that's not what is designed for and I can't promise it will be fast enough for all those cases.
Let's turn off tracing and get rid of the waits and see if the profile is still "empty" (I don't care if it is slow, but I want to see the runtime busy doing all sorts of things). We don't have a good sense of what is causing all the waiting right now in the worker task so we need to start removing all the sources of blocking in the application side.
I don't care about that specific one. I care about all of the semantic info attached by Regent or by users. I don't think anyone semantic info is the cause of the problem, but I want to start removing sources of blocking because the backtrace experiment was inconclusive. |
I told Mike already, but I was able to reproduce the performance issues on 8 and 16 nodes of Lassen, but was not able to reproduce the same problems on Perlmutter, indicating that something might going on inside the network. |
@mariodirenzo HTR uses the following 2 options - -ll:cpu_bgwork 100 -ll:util_bgwork 100 (time slicing util/cpu threads) |
Those options used to help performance as CPU and util threads are usually idle during the execution. |
Yes + overall new legion perf is better - I compared with the same 2 legion versions you mentioned in this issue Background worker (bgwork) threads are set to 16 already - it's not helping to have those options enabled. |
Profiles for an HTR test on Lassen, 1 node without these options (fast version) FB to FB copies appear to be taking much longer (n0f2-n0f2 channel) |
Was there a conclusion on this? Was the issue fully resolved by removing the |
These flags have been removed from the HTR scripts. @mariodirenzo to confirm performance is now better. |
The issue is solved by removing those flags but, as Seema mentioned, the reason is unclear. |
@elliottslaughter Did we add some kind of detection for this in Legion Prof? I thought you might have done that but I can't remember. |
I'm not sure we have a root cause yet, so it does not make sense to add any detection. My understanding from the above discussion is that HTR was running with There are two possible hypotheses I think we can form based on this:
Both of these hypotheses seem plausible to me. I suppose we could do an experiment with One thing I think we should do is track the core binding in the profiler. I.e., we want to show a diagram similar to |
A correction: My hypothesis (2) appears to be incorrect. Realm micro-ops do NOT get split depending on the value of (Having said that, I'm not sure that Realm goes to any particular effort to estimate the cost of micro-ops, so it's possible that this effectively turns into a binary flag. Either the CPU/Util runs background work items, or it does not.) Therefore, nothing about this setting would result in overheads increasing on a per-micro-op basis as the micro-ops themselves are untouched. So then the hypotheses that remain are either (1) that the additional threads introduce contention (as mentioned above), or (3) that having CPUs/Utils involved in background work either delays higher-priority CPU/Util tasks or else somehow interferes with the CPU/Util processors' ability to run such tasks (e.g., because more time is spent in lock contention). |
I think (3) is more likely because we've turned over the responsibility of the thread scheduling to the OS (probably Linux) and it's going to do it's fair-share scheduling thing which will round-robin between the background worker threads. If it has to do that for all the background worker threads, each of which is going to check queues for stuff to do, before getting back to the one thread that happens to have the interesting bit of work needed to make forward progress, that is going to cause massive slowdowns. |
Ok. To answer @mariodirenzo's question from a couple comments back: I think this is not a problem we need to solve. Legion does not perform well when you oversubscribe the machine. But the solution to this is easy: stop allocating so many threads. The only time when this really causes problems is when you have a CPU-only machine where you are trying to squeeze maximum FLOPs out of the system. But this is not a scenario we anticipate in any major upcoming system (aside from possibly some of the Japanese supercomputers). So overall my guidance would be:
|
As mentioned in a previous Legion meeting, we are seeing a significant degradation of performance when running HTR.
We are comparing the performance of an old commit (cba415a) to the performance of a new commit (91b55ce) when weak scaling HTR.
Note that we are using the same version of HTR to perform this test.
For instance, we are reporting here the profiles obtained on 8 nodes with the old commit
and with the new commit.
In this profiles we see a minor degradation of the GPU usage (that reflects in about a 10% increase of the time per step) but, more importantly, a very large section of the profile when it seems that nothing is happening at the beginning of the run.
This increase of "idle" time at the beginning seems to increase with some power of the number of nodes used in the calculations.
Adding @lightsighter and @seemamirch for visibility
@elliottslaughter, can you add this issue to #1032 ?
Is it too late to insert this issue in the list tracked for the March release?
The text was updated successfully, but these errors were encountered: