Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect resources duration for failed pods #13709

Open
4 tasks done
AntoineDao opened this issue Oct 5, 2024 · 0 comments · May be fixed by #13710
Open
4 tasks done

Incorrect resources duration for failed pods #13709

AntoineDao opened this issue Oct 5, 2024 · 0 comments · May be fixed by #13710
Labels

Comments

@AntoineDao
Copy link
Contributor

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

When running Argo Workflows we occasionally see that our workflows report surprisingly high CPU and Memory resource durations.

image

We have traced this back to an issue with Containerd which will sometimes fails a Pod and for some reason sets the startedAt date to "the epoch" (ie: 1970-01-01T00:00:00Z).

...
"initContainerStatuses": [
{
  "containerID": "containerd://b55ff9eb8687235ae0fecd4216522b75b6f7a0fab5eca18de8c06365bd2636fc",
  "image": "quay.io/argoproject/argoexec:v3.5.11",
  "imageID": "quay.io/argoproject/argoexec@sha256:bb3938480cbe7a7c0f053eb77a5d3edb22c868bf57407bf066fd105961f26c72",
  "lastState": {},
  "name": "init",
  "ready": false,
  "restartCount": 0,
  "started": false,
  "state": {
	"terminated": {
	  "containerID": "containerd://b55ff9eb8687235ae0fecd4216522b75b6f7a0fab5eca18de8c06365bd2636fc",
	  "exitCode": 128,
	  "finishedAt": "2024-09-13T13:28:03Z",
	  "message": "failed to create containerd task: failed to create shim task: context deadline exceeded: unknown",
	  "reason": "StartError",
	  "startedAt": "1970-01-01T00:00:00Z" <---- here
	}
  }
}
],
...

This results in the incorrect calculation of a duration due to the function below which assumes that a startedAt data of 1970 is expected.

func (s Summary) age() time.Duration {
if s.ContainerState.Terminated != nil {
return s.ContainerState.Terminated.FinishedAt.Time.Sub(s.ContainerState.Terminated.StartedAt.Time)
} else {
return 0
}
}

I appreciate this feels like more of a containerd bug that an argo-workflows bug, however I think it is worth adding some logic to handle this case more gracefully.

For additional context, we are running on top of managed GKE when we see this bug. I am not attaching a reproducible workflow because... well... this bug is challenging to reproduce!

I will propose a fix very soon and will let the maintainers decide whether it is worth patching. From our perspective it's worth it because we use the Argo Workflows reported CPU/Memory usage to bill downstream customers... 😨

Version(s)

v3.5.11

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

N/A

Logs from the workflow controller

N/A

Logs from in your workflow's wait container

N/A
AntoineDao added a commit to AntoineDao/argo-workflows that referenced this issue Oct 5, 2024
…poch

We have observed that containerd somtimes fails a pod on GKE and sets the `startedAt` value to
"1970-01-01T00:00:00Z" which causes argo workflows to calculate invalid resources duration.

fix argoproj#13709
AntoineDao added a commit to AntoineDao/argo-workflows that referenced this issue Oct 5, 2024
…poch

We have observed that containerd somtimes fails a pod on GKE and sets the `startedAt` value to
"1970-01-01T00:00:00Z" which causes argo workflows to calculate invalid resources duration.

fix argoproj#13709

Signed-off-by: antoinedao <[email protected]>
AntoineDao added a commit to AntoineDao/argo-workflows that referenced this issue Oct 5, 2024
…poch

We have observed that containerd somtimes fails a pod on GKE and sets the `startedAt` value to
"1970-01-01T00:00:00Z" which causes argo workflows to calculate invalid resources duration.

fix argoproj#13709

Signed-off-by: antoinedao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant