-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deployment Failures Leading to Double Resource Consumption and Risk of Node Eviction Due to NodeHasDiskPressure
#138
Comments
I've tried the light image of mine with @anilmurty pointed out this log line which would suggest that's the size of the image is causing the issue, which I'm now more inclined to believe is the main reason:
as the image Foundry tried is enormous -
|
couldn't repro the node disk pressure with running however I could repro it when I used the default entrypoint of that image. I guess it is doing something there that triggers the issue or so; gonna investigate deeper |
The behavior is quite different, depending on whether the image was cached or being pulled.
Running image
Re-running image
Additionally, running image The difference from the original behavior is that it supposed to restart that pod in It is possible that it spawns another replica/pod which gets indefinitely stuck in "Pending". This would be the case if the Foundry provider didn't implement the SDL - original with slight modifications, primarily:
After which:
Before the deployment:
|
Running image So the whole problem is the large image / low ephemeral storage space (nodefs, imagefs). There are certain thresholds which can be tweaked around: |
The pods have been running well on that node; i.e. the node itself wasn't getting evicted but rather just the deployment itself. The node was disappearing from the akash-provider report (8443/status) for short time while it was collecting the garbage. I guess we are good then. The pods have been running on
|
It appears the provider is running chaperone utility which kills certain deployments. Their And since I've been using the sshd-based one, it was getting killed:
Going to re-test the 4th scenario. |
The pod
and the node Provider closed the lease. FWIW: provider wasn't accessible initially because it was missing the haproxy rule to redirect 8443 (akash-provider) to the worker node it's been running at. Logs
More logs |
Looks like the provider increase the disk space, so the SDL can be retried there again.
|
The Foundry encountered an issue where a node with
88Gi
of available ephemeral disk space was experiencingNodeHasDiskPressure
. This is evident from the events logged by the kubelet
:Correction:
The node isn't getting fully evicted but it is getting logged as in poor condition and evicting the pods there:
The issue was linked to a particular deployment, the contents of which can be found at this link. Due to continuous failures, Kubernetes kept trying to restart the deployment:
To recreate the issue, a simple SDL was used:
The available disk space reduced by twice the requested amount every
10
seconds, with the pod status alternating betweenError
andPending
. This indicated that Kubernetes was continually trying to restart the pod.Upon checking the resource consumption before and after submitting the SDL, it was found that the CPU, memory, and storage were all being consumed at twice the rate requested by the deployment. Before the SDL submission, the resource status was:
After the SDL submission:
A few seconds later:
When the deployment started to crash and redeploy:
This resulted in unexpected resource consumption, with the node having nearly double the resources than it was supposed to:
Logs from Foundry:
And provider goes offline too (since akash-provider pod gets evicted as well from that node).
The text was updated successfully, but these errors were encountered: