You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
provider.provider-02.sandbox-01.aksh.pw has encountered NodeHasDiskPressure. The node ran out of available disk space, causing Kubernetes to evict pods to reclaim disk space.
Relevant events are as follows:
$ kubectl get events -A --sort-by='.lastTimestamp'
...
akash-services 3m43s Warning Evicted pod/akash-provider-0 The node was low on resource: ephemeral-storage. Threshold quantity: 31189488855, available: 29362700Ki. Container provider was using 26060Ki, request is 0, has larger consumption of ephemeral-storage.
default 3m35s Normal NodeHasDiskPressure node/node1 Node node1 status is now: NodeHasDiskPressure
This issue can arise with any deployment that writes a significant amount of data, leading to excessive disk usage. This causes the node to exceed the nodefs threshold, triggering Kubernetes to start the eviction process in an attempt to reclaim ephemeral storage:
$ kubectl get events -A --sort-by='.lastTimestamp' | grep reclaim
default 16m Warning EvictionThresholdMet node/node1 Attempting to reclaim ephemeral-storage
Additionally, it appears that the image size is not being taken into account when determining available space. For instance, a worker node might have 150GB of free disk space, allowing a tenant to claim that space for their deployment. However, if the image itself is large (e.g., 12GB), this can trigger the eviction threshold:
Deploy a large image, say 5GiB in size, with the maximum available disk space for the worker node. The available disk space can be obtained from the 8443/status provider endpoint, ensuring the deployment lands on the intended node.
Once deployed, SSH into the node (or use lease-shell) and start writing data. Use real data instead of zeroes to accurately impact disk space usage (df -Ph / can be checked from the Pod or the worker Host directly).
Once all disk space is utilized, let the system sit for 5-10 minutes for the kubelet to notice the issue (NodeHasDiskPressure event) and start evicting pods to reclaim ephemeral-storage space (EvictionThresholdMet and Evicted events).
The issue
The problem is that the K8s eviction manager starts evicting other pods, not just the culprit pod, likely since it can't know which one is the culprit.
Potential solution
The long term solution could be having the imagefs (/var/lib/containerd) on a separate partition, other than the nodefs (/var/lib/kubelet).
Action required
Please investigate and implement measures to account for image size when determining available space and manage disk usage more effectively to prevent future evictions.
Additional context
I've seen this issue some time ago, when tenants would attempt deploying heavy container images (>10GiB in size) and when the worker node did not have enough free space.
akash network:
sandbox-01
akash network version:
v0.34.0
(binaryv0.34.1
)akash provider version:
0.6.1
Description
provider.provider-02.sandbox-01.aksh.pw
has encounteredNodeHasDiskPressure
. The node ran out of available disk space, causing Kubernetes to evict pods to reclaim disk space.Relevant events are as follows:
More detailed events & logs can be found here.
This issue can arise with any deployment that writes a significant amount of data, leading to excessive disk usage. This causes the node to exceed the
nodefs
threshold, triggering Kubernetes to start the eviction process in an attempt to reclaim ephemeral storage:Additionally, it appears that the image size is not being taken into account when determining available space. For instance, a worker node might have 150GB of free disk space, allowing a tenant to claim that space for their deployment. However, if the image itself is large (e.g., 12GB), this can trigger the eviction threshold:
Disk usage on node1 is rapidly increasing, indicating a potential risk for further evictions:
Despite having resource limits in place:
The following are the thresholds for memory (RAM), the
nodefs
(/var/lib/kubelet
akaephemeral-storage
), andimagefs
(/var/lib/containerd
):Refer to Kubernetes Node Pressure Eviction Documentation for more details.
Reproducer
8443/status
provider endpoint, ensuring the deployment lands on the intended node.lease-shell
) and start writing data. Use real data instead of zeroes to accurately impact disk space usage (df -Ph /
can be checked from the Pod or the worker Host directly).NodeHasDiskPressure
event) and start evicting pods to reclaim ephemeral-storage space (EvictionThresholdMet
andEvicted
events).The issue
The problem is that the K8s eviction manager starts evicting other pods, not just the culprit pod, likely since it can't know which one is the culprit.
Potential solution
The long term solution could be having the
imagefs
(/var/lib/containerd
) on a separate partition, other than thenodefs
(/var/lib/kubelet
).Action required
Please investigate and implement measures to account for image size when determining available space and manage disk usage more effectively to prevent future evictions.
Additional context
I've seen this issue some time ago, when tenants would attempt deploying heavy container images (>10GiB in size) and when the worker node did not have enough free space.
Potentially related,
#138
The text was updated successfully, but these errors were encountered: