-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bump Parthenon to latest develop (2024-02-15) #84
Conversation
@par-hermes format |
Actually this causing issues for the cluster set up, do no merge |
FYI I just bumped to current |
I'm a bit confused as to why this is merging into the Is this PR still relevant, or has it been superceded? |
Yes, the PR is still relevant and I was putting it on hold while tracing back the issue I saw on Lumi.
with the Parthenon advection example on just two ranks (not showing up on a single ranks), I was neither able to reproduce locally nor on Frontier. Bottom line: Let me clean up this PR (tomorrow) and then we can merge with the latest Parthenon |
I think the core dump will point to that because that's where the error handler is, not necessarily because the underlying bug is there. (We've been debugging an issue in Quokka that produces that error message. We thought it was a out-of-bounds memory access, which it usually is. However, It turned out to be a compiler bug that appears for specific ROCm versions 🙃) Yes, that plan sounds good to me. |
@pgrete just want to check if this is ready to go? |
I double checked on Friday that the (other) crashes I observed are not reproducible any more. They were also not reproducible with the binary I triggered them in first place and given that those were all interconnect related issue, I assume that the machine was unstable by chance when I ran the original tests. (Note this was a second error mode apart from the I now updated to latest Parthenon develop and fixed all interface changes fingerscrossed. Bottom line: this is ready for review and merge @BenWibking @forrestglines |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There appears to be some unrelated cluster setup changes (although they are trivial). Everything else looks good to me.
Yes, I added those (to be used for the histograms). |
@forrestglines Did the cluster test runs with this PR work? |
Yes -- a Perseus test cluster did not cause any issues when I think it did before. I think this is safe to merge in |
@pgrete do you want to merge this now? |
Update Parthenon to PR930