Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NodeHasDiskPressure Causing Pod Evictions Due to Excessive Disk Usage #226

Open
andy108369 opened this issue May 17, 2024 · 2 comments
Open
Assignees

Comments

@andy108369
Copy link
Contributor

andy108369 commented May 17, 2024

akash network: sandbox-01
akash network version: v0.34.0 (binary v0.34.1)
akash provider version: 0.6.1

Description

provider.provider-02.sandbox-01.aksh.pw has encountered NodeHasDiskPressure. The node ran out of available disk space, causing Kubernetes to evict pods to reclaim disk space.

image

Relevant events are as follows:

$ kubectl get events -A --sort-by='.lastTimestamp'
...
akash-services                                  3m43s       Warning   Evicted                 pod/akash-provider-0                                                        The node was low on resource: ephemeral-storage. Threshold quantity: 31189488855, available: 29362700Ki. Container provider was using 26060Ki, request is 0, has larger consumption of ephemeral-storage.
default                                         3m35s       Normal    NodeHasDiskPressure     node/node1                                                                  Node node1 status is now: NodeHasDiskPressure

More detailed events & logs can be found here.

This issue can arise with any deployment that writes a significant amount of data, leading to excessive disk usage. This causes the node to exceed the nodefs threshold, triggering Kubernetes to start the eviction process in an attempt to reclaim ephemeral storage:

$ kubectl get events -A --sort-by='.lastTimestamp' | grep reclaim
default                                         16m         Warning   EvictionThresholdMet    node/node1                                                                  Attempting to reclaim ephemeral-storage

Additionally, it appears that the image size is not being taken into account when determining available space. For instance, a worker node might have 150GB of free disk space, allowing a tenant to claim that space for their deployment. However, if the image itself is large (e.g., 12GB), this can trigger the eviction threshold:

root@node1:~# crictl image ps | grep llama
docker.io/yuravorobei/llama-2                           0.6                 5122212d50e6a       12.1GB

Disk usage on node1 is rapidly increasing, indicating a potential risk for further evictions:

root@node1:~# iotop
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND                                                                                                                                                                                                                        
2506135 be/4 root        0.00 B/s  266.15 M/s  ?unavailable?  python /usr/local/bin/uvicorn main:app --host 0.0.0.0 --port 7860
...

root@node1:~# df -Ph /
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1       194G  134G   60G  70% /
root@node1:~# df -Ph /
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1       194G  143G   52G  74% /

Despite having resource limits in place:

$ kubectl -n $ns get deployment/app -o yaml
...
        resources:
          limits:
            cpu: "4"
            ephemeral-storage: "161061273600"
            memory: "16106127360"
            nvidia.com/gpu: "1"
          requests:
            cpu: "4"
            ephemeral-storage: "161061273600"
            memory: "16106127360"
            nvidia.com/gpu: "1"

The following are the thresholds for memory (RAM), the nodefs (/var/lib/kubelet aka ephemeral-storage), and imagefs (/var/lib/containerd):

memory.available<100Mi
nodefs.available<10%
imagefs.available<15%

Refer to Kubernetes Node Pressure Eviction Documentation for more details.

Reproducer

  1. Deploy a large image, say 5GiB in size, with the maximum available disk space for the worker node. The available disk space can be obtained from the 8443/status provider endpoint, ensuring the deployment lands on the intended node.
  2. Once deployed, SSH into the node (or use lease-shell) and start writing data. Use real data instead of zeroes to accurately impact disk space usage (df -Ph / can be checked from the Pod or the worker Host directly).
  3. Once all disk space is utilized, let the system sit for 5-10 minutes for the kubelet to notice the issue (NodeHasDiskPressure event) and start evicting pods to reclaim ephemeral-storage space (EvictionThresholdMet and Evicted events).

The issue
The problem is that the K8s eviction manager starts evicting other pods, not just the culprit pod, likely since it can't know which one is the culprit.

Potential solution
The long term solution could be having the imagefs (/var/lib/containerd) on a separate partition, other than the nodefs (/var/lib/kubelet).

Action required

Please investigate and implement measures to account for image size when determining available space and manage disk usage more effectively to prevent future evictions.

Additional context

I've seen this issue some time ago, when tenants would attempt deploying heavy container images (>10GiB in size) and when the worker node did not have enough free space.

Potentially related,
#138

@andy108369 andy108369 added repo/provider Akash provider-services repo issues awaiting-triage labels May 17, 2024
@chainzero chainzero removed repo/provider Akash provider-services repo issues awaiting-triage labels Jun 5, 2024
@anilmurty
Copy link

@devalpatel67 - please drop any ideas or suggestions for estimating image size before downloading it to the provider

@andy108369
Copy link
Contributor Author

andy108369 commented Sep 30, 2024

Hey @devalpatel67 , any updates please?

FWIW, for k3s #217 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants