-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed workflow should be available for analysis even after successful retry #13839
Comments
This would be a fairly radical departure for how manual retries are implemented. Perhaps a way to satisfy the needs would be to do something with archiving the failed workflow before retry, but that also is not simple as the |
See also #12324 (comment) / #9141 (comment). The current way manual retries are implemented is itself different from the rest of workflow operations, which try to record things on the For example, manual re-runs on GitHub Actions work similarly. It creates a new state without deleting the old state and removes the state that is to be re-ran. Logs and pods (including labels) etc would be partially linked between the two, which is a bit confusing and may create some race conditions, but possibly solveable |
agreed. But it is too tough to copy and restart the failed workflow. It will loss the state. I think we need to find the way to preserve the failed nodes/steps and start the new nodes/steps like inject the retry flag or something |
i believe another uid row is stored in public.argo_archived_workflows or is that just for resubmit |
Correct me if I'm wrong, but I would think you could just copy the whole completed |
We need to validate it. But my knowledge is all node id/pod/artifacts/params name is connected with workflow name. We need to test with controller if workflow name change with status how it will behave |
Resubmit will start from whole workflow from beginning. |
Summary
If a workflow was failed and successfully finished after retry, the failed workflow should be available for analysis and not just simply replaced by the successful workflow.
Use Cases
Image a complex DAG workflow with dozens of steps. At some point within the DAG, the workflow fails and you're under time pressure. So you just use the "Retry" button, one of the cool ArgoWF features. But afterwards it's impossibile to hand over the issue to a developer to analyze the error because the failed workflow is no longer available within the UI.
So it should be possible to have all workflows available for further investigation, especially if there are failed ones.
Message from the maintainers:
Love this feature request? Give it a 👍. We prioritise the proposals with the most 👍.
The text was updated successfully, but these errors were encountered: