Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed workflow should be available for analysis even after successful retry #13839

Open
starwarsfan opened this issue Oct 30, 2024 · 7 comments
Labels
area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries type/feature Feature request

Comments

@starwarsfan
Copy link

Summary

If a workflow was failed and successfully finished after retry, the failed workflow should be available for analysis and not just simply replaced by the successful workflow.

Use Cases

Image a complex DAG workflow with dozens of steps. At some point within the DAG, the workflow fails and you're under time pressure. So you just use the "Retry" button, one of the cool ArgoWF features. But afterwards it's impossibile to hand over the issue to a developer to analyze the error because the failed workflow is no longer available within the UI.

So it should be possible to have all workflows available for further investigation, especially if there are failed ones.


Message from the maintainers:

Love this feature request? Give it a 👍. We prioritise the proposals with the most 👍.

@starwarsfan starwarsfan added the type/feature Feature request label Oct 30, 2024
@Joibel Joibel added the area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries label Oct 30, 2024
@Joibel
Copy link
Member

Joibel commented Oct 30, 2024

This would be a fairly radical departure for how manual retries are implemented. Perhaps a way to satisfy the needs would be to do something with archiving the failed workflow before retry, but that also is not simple as the key is the name of the workflow which doesn't change during a manual retry.

@agilgur5
Copy link
Contributor

See also #12324 (comment) / #9141 (comment).

The current way manual retries are implemented is itself different from the rest of workflow operations, which try to record things on the Workflow resource more immutably / append-only (neither of those terms is quite accurate, but conceptually similar). Perhaps it should be done more similar to a resubmit that retains partial state

For example, manual re-runs on GitHub Actions work similarly. It creates a new state without deleting the old state and removes the state that is to be re-ran.

Logs and pods (including labels) etc would be partially linked between the two, which is a bit confusing and may create some race conditions, but possibly solveable

@sarabala1979
Copy link
Member

agreed. But it is too tough to copy and restart the failed workflow. It will loss the state. I think we need to find the way to preserve the failed nodes/steps and start the new nodes/steps like inject the retry flag or something

@tooptoop4
Copy link
Contributor

i believe another uid row is stored in public.argo_archived_workflows or is that just for resubmit

@agilgur5
Copy link
Contributor

agreed. But it is too tough to copy and restart the failed workflow. It will loss the state. [sic]

Correct me if I'm wrong, but I would think you could just copy the whole completed Workflow (note that a workflow must be completed before you can retry it), give it a new uid + name, then run the existing retry logic on the new copy.

@sarabala1979
Copy link
Member

We need to validate it. But my knowledge is all node id/pod/artifacts/params name is connected with workflow name. We need to test with controller if workflow name change with status how it will behave

@sarabala1979
Copy link
Member

sarabala1979 commented Oct 30, 2024

i believe another uid row is stored in public.argo_archived_workflows or is that just for resubmit

Resubmit will start from whole workflow from beginning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries type/feature Feature request
Projects
None yet
Development

No branches or pull requests

5 participants