Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return DriverStatus::timeout when appropriate #1135

Merged
merged 9 commits into from
Aug 9, 2024
Merged

Conversation

adamdempsey90
Copy link
Collaborator

@adamdempsey90 adamdempsey90 commented Jul 7, 2024

PR Summary

I don't think DriverStatus::timeout was ever returned -- only failed and complete were. This simply checks if the driver loop exited and if tm.KeepGoing() returns true, then return DriverStatus::timeout instead of DriverStatus::complete.

Useful if downstream codes want to do something different for a timeout.

Breaking behavior

If a downstream code passes the driver status as final exit status, then this new version will result in a failed application run even though the code exited gracefully (following the walltime limit).

PR Checklist

  • Code passes cpplint
  • New features are documented.
  • Adds a test for any bugs fixed. Adds tests for new features.
  • Code is formatted
  • Changes are summarized in CHANGELOG.md
  • Change is breaking (API, behavior, ...)
    • Change is additionally added to CHANGELOG.md in the breaking section
    • PR is marked as breaking
    • Short summary API changes at the top of the PR (plus optionally with an automated update/fix script)
  • CI has been triggered on Darwin for performance regression tests.
  • Docs build
  • (@lanl.gov employees) Update copyright on changed files

Copy link
Collaborator

@Yurlungur Yurlungur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Collaborator

@brryan brryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@pgrete pgrete left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Given that it's changing behavior (though, in fact, now doing what it was supposed to do),I think added a short entry to the Changelog would be good before merging.

@pgrete
Copy link
Collaborator

pgrete commented Jul 10, 2024

It seems the macos test consistently fails -- whereas it passed for #1047 that just went in yesterday.
Can anyone with macOS check what's going on/whether this is reproducible in a non-ci environment?

@acreyes
Copy link
Contributor

acreyes commented Jul 10, 2024

It seems the macos test consistently fails -- whereas it passed for #1047 that just went in yesterday. Can anyone with macOS check what's going on/whether this is reproducible in a non-ci environment?

The sparse-advection example returns 1 when DriverStatus::timeout. I guess that it should fail outside of macos as well

@adamdempsey90
Copy link
Collaborator Author

It seems the macos test consistently fails -- whereas it passed for #1047 that just went in yesterday. Can anyone with macOS check what's going on/whether this is reproducible in a non-ci environment?

The sparse-advection example returns 1 when DriverStatus::timeout. I guess that it should fail outside of macos as well

I can change it to return 0 so it doesn't raise the subprocess.CalledProcessError in python?

@acreyes
Copy link
Contributor

acreyes commented Jul 10, 2024

It seems the macos test consistently fails -- whereas it passed for #1047 that just went in yesterday. Can anyone with macOS check what's going on/whether this is reproducible in a non-ci environment?

The sparse-advection example returns 1 when DriverStatus::timeout. I guess that it should fail outside of macos as well

I can change it to return 0 so it doesn't raise the subprocess.CalledProcessError in python?

Another option would be to add a an error handler to the TestCaseAbs class that default returns False to do something like

       except subprocess.CalledProcessError as err:
            handled = self.test_case.HandleError(self.parameters, err)
            if handled:
                print(err.stdout.decode())
                self.parameters.stdouts.append(err.stdout)
            else:
...

Then the restart test could check the error code to see that the wall-time limit is triggered instead of parsing the stdout

@Yurlungur
Copy link
Collaborator

It seems the macos test consistently fails -- whereas it passed for #1047 that just went in yesterday. Can anyone with macOS check what's going on/whether this is reproducible in a non-ci environment?

The sparse-advection example returns 1 when DriverStatus::timeout. I guess that it should fail outside of macos as well

I can change it to return 0 so it doesn't raise the subprocess.CalledProcessError in python?

Another option would be to add a an error handler to the TestCaseAbs class that default returns False to do something like

       except subprocess.CalledProcessError as err:
            handled = self.test_case.HandleError(self.parameters, err)
            if handled:
                print(err.stdout.decode())
                self.parameters.stdouts.append(err.stdout)
            else:
...

Then the restart test could check the error code to see that the wall-time limit is triggered instead of parsing the stdout

I think this is a good idea.

@adamdempsey90
Copy link
Collaborator Author

It seems the macos test consistently fails -- whereas it passed for #1047 that just went in yesterday. Can anyone with macOS check what's going on/whether this is reproducible in a non-ci environment?

The sparse-advection example returns 1 when DriverStatus::timeout. I guess that it should fail outside of macos as well

I can change it to return 0 so it doesn't raise the subprocess.CalledProcessError in python?

Another option would be to add a an error handler to the TestCaseAbs class that default returns False to do something like

       except subprocess.CalledProcessError as err:
            handled = self.test_case.HandleError(self.parameters, err)
            if handled:
                print(err.stdout.decode())
                self.parameters.stdouts.append(err.stdout)
            else:
...

Then the restart test could check the error code to see that the wall-time limit is triggered instead of parsing the stdout

I think this is a good idea.

I added an error handler to TestCaseAbs. I needed to swap the return codes for Fail and Timeout in the sparse_advection example, though, since generic crashes at startup have a return code of 1.

@@ -152,7 +152,7 @@ DriverStatus EvolutionDriver::Execute() {
pmesh->UserWorkAfterLoop(pmesh, pinput, tm);
}

DriverStatus status = DriverStatus::complete;
DriverStatus status = tm.KeepGoing() ? DriverStatus::timeout : DriverStatus::complete;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused now.
How can we ever set this to timeout here?

Above if (status != TaskListStatus::complete) { means we're already returning with failed.
Otherwise, there's just one break in the while loop (checking with final), and with final is a clean exit and not a timeout or am I missing something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CheckSignalFlags returns OutputSignal::final if any of the signalflags are non-zero, which I think catches the wall time alarm.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. My bad. I misinterpreted timeout as a failure mode (e.g., a communication timeout when we hit the max number of receives tried).

In AthenaPK, we're currently not even passing the result of the driver (which I should probably update).

I'll push a minor update to the changelog momentarily (indicating that this is potentially breaking downstream if the driver status is passed as final return code, which now will result in a failed run despite the run having exited cleanly) and then enable automerge.

@pgrete pgrete enabled auto-merge (squash) August 9, 2024 13:18
@pgrete pgrete merged commit 5e42a4f into develop Aug 9, 2024
52 of 53 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants