-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deterministic error affecting HTR #1466
Comments
This looks like an application-level assert? Can we determine what about |
I believe it asserts when a task has received bad data. I've re-ran the problem and it succeeds on Additionally, this error has come up non-deterministically for the same problem. Not sure if it's related:
backtrace:
|
@lightsighter the commit range that @cmelone is referring to looks to be the merge of the |
Not unless they are using collective copies/reductions (which I would be a bit surprised if HTR is doing that). I did fix a bug related to that branch today, but it will only matter if you have a collective reduction. |
After re-compiling Legion this morning with the new changes and running the problem again, it seems to succeed now. Thanks |
That's interesting to know that you're using collective reduction copies. |
yeah, @mariodirenzo was wondering about that too |
You can end up in this situation if you write code like:
Where Otherwise I'm not sure how you'd be impacted by this. |
Yeah, but HTR doesn't use the default mapper. |
The mapper of HTR is derived from the default mapper so I think that @elliottslaughter is correct, we are inheriting that aspect. |
We have been seeing one of our test cases deterministically fail on Lassen in the last week or so with the following error message:
I believe this is the relevant backtrace:
Legion was built with
CXXFLAGS="-g -O2"
in release mode. This is on 1 node, 1 rank per node withcontrol_replication
.This execution last succeeded on
cb61755
and started failing on at leastd1ecc4b
.This may not be relevant, but I have only been able to reproduce this on a POWER9 machine but not on an Intel cluster.
Also, could this be added to #1032? Thanks
The text was updated successfully, but these errors were encountered: