Deployment Failures Leading to Double Resource Consumption and Risk of Node Eviction Due to `NodeHasDiskPressure` #138

andy108369 · 2023-10-30T19:04:58Z

The Foundry encountered an issue where a node with 88Gi of available ephemeral disk space was experiencing NodeHasDiskPressure. This is evident from the events logged by the kubelet:

Events:
  Type     Reason                Age                  From     Message
  ----     ------                ----                 ----     -------
  Warning  FreeDiskSpaceFailed   2m1s                 kubelet  Failed to garbage collect required amount of images. Attempted to free 5118015897 bytes, but only found 80233010 bytes eligible to free.
  Normal   NodeHasDiskPressure   95s (x8 over 2d5h)   kubelet  Node prd-stk-tsr-dgx-41 status is now: NodeHasDiskPressure
  Warning  EvictionThresholdMet  45s (x30 over 2d5h)  kubelet  Attempting to reclaim ephemeral-storage

Correction:

The node isn't getting fully evicted but it is getting logged as in poor condition and evicting the pods there:

E[2023-10-30|19:01:03.614] node in poor condition                       cmp=provider client=kube node=prd-stk-tsr-dgx-41 condition=DiskPressure status=True

The issue was linked to a particular deployment, the contents of which can be found at this link. Due to continuous failures, Kubernetes kept trying to restart the deployment:

2lvnbm63bpumc76ei5993o9bhhc6hpibgrmhi2mc4k08q   comfy2-7dd5dcc9fc-64tb7   0/1     Error       0               5m18s
2lvnbm63bpumc76ei5993o9bhhc6hpibgrmhi2mc4k08q   comfy2-7dd5dcc9fc-vl64z   0/1     Pending     0               92s

To recreate the issue, a simple SDL was used:

...
command:
  - "sh"
  - "-c"
args:
  - 'sleep 10s; exit 1;'

profiles:
  compute:
    ssh:
      resources:
        cpu:
          units: 4
        memory:
          size: 8Gi
        storage:
          size: 53Gi

The available disk space reduced by twice the requested amount every 10 seconds, with the pod status alternating between Error and Pending. This indicated that Kubernetes was continually trying to restart the pod.

Upon checking the resource consumption before and after submitting the SDL, it was found that the CPU, memory, and storage were all being consumed at twice the rate requested by the deployment. Before the SDL submission, the resource status was:

$ date; provider_info.sh provider.hurricane.akash.pub
Mon Oct 30 07:36:10 PM CET 2023
type       cpu     gpu  ram                 ephemeral           persistent
used       46.5    1    147.09765625        625.09765625        550
pending    0       0    0                   0                   0
available  46.395  0    27.084184646606445  1183.6669998746365  613
node       46.395  0    27.084184646606445  1183.6669998746365  N/A

After the SDL submission:

$ date; provider_info.sh provider.hurricane.akash.pub
Mon Oct 30 07:36:44 PM CET 2023
type       cpu     gpu  ram                 ephemeral           persistent
used       46.5    1    147.09765625        625.09765625        550
pending    4       0    8                   53                  0
available  42.395  0    19.084184646606445  1130.6669998746365  613
node       42.395  0    19.084184646606445  1130.6669998746365  N/A

A few seconds later:

$ date; provider_info.sh provider.hurricane.akash.pub
Mon Oct 30 07:41:21 PM CET 2023
type       cpu     gpu  ram                 ephemeral           persistent
used       50.5    1    155.09765625        678.09765625        550
pending    0       0    0                   0                   0
available  42.395  0    19.084184646606445  1130.6669998746365  613
node       42.395  0    19.084184646606445  1130.6669998746365  N/A

When the deployment started to crash and redeploy:

$ kubectl -n $ns get pods
NAME                   READY   STATUS             RESTARTS      AGE
ssh-7bb5fcf55f-7bqxw   0/1     CrashLoopBackOff   3 (39s ago)   2m1s

This resulted in unexpected resource consumption, with the node having nearly double the resources than it was supposed to:

$ date; provider_info.sh provider.hurricane.akash.pub
Mon Oct 30 07:37:48 PM CET 2023
type       cpu     gpu  ram                 ephemeral           persistent
used       46.5    1    147.09765625        625.09765625        550
pending    4       0    8                   53                  0
available  38.395  0    11.084184646606445  1077.6669998746365  613
node       38.395  0    11.084184646606445  1077.6669998746365  N/A

Logs from Foundry:

Disk pressure flagged:  
Normal   NodeHasDiskPressure    44s (x9 over 2d6h)   kubelet  Node prd-stk-tsr-dgx-41 status is now: NodeHasDiskPressure
  Warning  EvictionThresholdMet   11s (x32 over 2d6h)  kubelet  Attempting to reclaim ephemeral-storage

Events:
ingress-nginx                                   67s         Normal    Killing                 pod/ingress-nginx-controller-4wrzr                                    Stopping container controller
ingress-nginx                                   67s         Warning   Evicted                 pod/ingress-nginx-controller-4wrzr                                    The node was low on resource: ephemeral-storage. Threshold quantity: 15763389861, available: 15385288Ki. Container controller was using 468Ki, request is 0, has larger consumption of ephemeral-storage.
ingress-nginx                                   16s         Warning   ExceededGracePeriod     pod/ingress-nginx-controller-4wrzr                                    Container runtime did not kill the pod within specified grace period.
ingress-nginx                                   26s         Warning   Evicted                 pod/ingress-nginx-controller-4wrzr                                    The node was low on resource: ephemeral-storage. Threshold quantity: 15763389861, available: 12077328Ki. Container controller was using 132Ki, request is 0, has larger consumption of ephemeral-storage.

ubuntu@prd-stk-tsr-dgx-41:~$ sudo kubectl -n vk5gggln31994kc2e0s8thf4tkrca49un00m36ge4tvte get pods
NAME                     READY   STATUS    RESTARTS   AGE
comfy2-98b8ffcfb-7hj9r   1/1     Running   0          12m

ubuntu@prd-stk-tsr-dgx-41:~$ sudo kubectl -n vk5gggln31994kc2e0s8thf4tkrca49un00m36ge4tvte get pods
NAME                     READY   STATUS                   RESTARTS   AGE
comfy2-98b8ffcfb-7hj9r   0/1     ContainerStatusUnknown   1          14m
comfy2-98b8ffcfb-pkd7v   0/1     Pending                  0          4s
vk5gggln31994kc2e0s8thf4tkrca49un00m36ge4tvte   49s         Warning   Evicted                 pod/comfy2-98b8ffcfb-7hj9r                                            The node was low on resource: ephemeral-storage. Threshold quantity: 15763389861, available: 12484660Ki.
vk5gggln31994kc2e0s8thf4tkrca49un00m36ge4tvte   49s         Normal    Killing                 pod/comfy2-98b8ffcfb-7hj9r                                            Stopping container comfy2
vk5gggln31994kc2e0s8thf4tkrca49un00m36ge4tvte   39s         Warning   ExceededGracePeriod     pod/comfy2-98b8ffcfb-7hj9r                                            Container runtime did not kill the pod within specified grace period.

And provider goes offline too (since akash-provider pod gets evicted as well from that node).

The text was updated successfully, but these errors were encountered:

arno01 · 2023-10-30T20:07:04Z

I've tried the light image of mine with sleep 10; exit 1; and 50Gi storage requested; provider seem to be running fine.

@anilmurty pointed out this log line which would suggest that's the size of the image is causing the issue, which I'm now more inclined to believe is the main reason:

Failed to garbage collect required amount of images. Attempted to free 5118015897 bytes, but only found 80233010 bytes eligible to free.

as the image Foundry tried is enormous - 21.7GB:

$ docker images | grep zjuuu/comfyui
zjuuu/comfyui                                   0.6                     ce0a48d3f9c2   2 weeks ago     21.7GB

andy108369 · 2023-10-30T21:06:26Z

couldn't repro the node disk pressure with running zjuuu/comfyui:0.6 with sleep 10; exit 1;

however I could repro it when I used the default entrypoint of that image.

I guess it is doing something there that triggers the issue or so; gonna investigate deeper

arno01 · 2023-10-31T10:36:36Z

The behavior is quite different, depending on whether the image was cached or being pulled.

image was not cached, sleep 10 & exit 1 entrypoint - disk pressure, pod evicted;

Running image zjuuu/comfyui:0.6 with sleep 10; exit 1; today was pulling the image again (seen from lease-events), knocking off the node after some time due to the disk pressure.

image is cached, sleep 10 & exit 1 entrypoint - no disk pressure;

Re-running image zjuuu/comfyui:0.6 with sleep 10; exit 1; again, did not pull the image again (no Pulling image event in lease-events), the node was not knocked off.

image is cached, sleep infinity entrypoint and then bash -x /entrypoint.sh (original entrypoint) exectued manually - no disk pressure, pod dies:

Additionally, running image zjuuu/comfyui:0.6 with sleep infinity and then executing bash -x /entrypoint.sh (the standard entrypoint of that image) breaks the pod which then gets its lease terminated (likely when it reaches monitorMaxRetries which is about 4-5 minutes ).

The difference from the original behavior is that it supposed to restart that pod in sleep infinity mode (as per SDL) which is not happening.

It is possible that it spawns another replica/pod which gets indefinitely stuck in "Pending". This would be the case if the Foundry provider didn't implement the Force New ReplicaSet Workaround https://docs.akash.network/providers/akash-provider-troubleshooting/force-new-replicaset-workaround which would otherwise alleviate this situation. I'll ask them to verify/apply it today.

SDL - original with slight modifications, primarily:

    command:
      - "sh"
      - "-c"
    args:
      - 'sleep infinity; exit 1;'

lease-shell into the deployment:

fcb@ssh-7ffd9dd59c-59b5x:/$ bash -x /entrypoint.sh
...
+ wget -nv https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/resolve/main/sd_xl_base_1.0.safetensors -P /comfyui/models/checkpoints
Error: lease shell failed: remote process exited with code 137

Exit code 137 error happens when a container or pod was terminated due to high memory usage.

After which:

lease-shell returns kube: lease not found;
lease-events returns no output;
blockchain PoV: market {order, bid, lease}, deployment = are all active & open;
when akash-provider reaches monitorMaxRetries it then closes the lease, client then gets Error: no active leases found for dseq=13480440 as expected

[13480440-1-1]$ akash_shell bash
Detected provider for 13480440/1/1: akash17gqmzu0lnh2uclx9flm755arylrhgqy7udj3el
provider error messsage:
kube: lease not found

Error: remote server returned 400

[13480440-1-1]$ akash_status 
Detected provider for 13480440/1/1: akash17gqmzu0lnh2uclx9flm755arylrhgqy7udj3el
provider error messsage:
kube: lease not found

$ provider_info.sh provider.akash.foundrystaking.com 
type       cpu     gpu  ram                 ephemeral           persistent
used       6.5     1    35.5                50.5                0
pending    6       1    35                  50                  0                                           <<<<<< my deployment (`zjuuu/comfyui:0.6`)
available  638.35  53   4204.639930725098   4297.102982298471   0
node       95.125  8    753.6953830718994   790.7263174308464   N/A
node       89.625  6    719.1953601837158   741.2263174308464   N/A
node       95.425  8    503.05484199523926  791.2263174308464   N/A
node       94.85   8    502.97010040283203  791.2263174308464   N/A
node       73      7    468.37353897094727  38.08480133768171   N/A                 <<<<< pers.storage was 88Gi before the (`zjuuu/comfyui:0.6`) deployment; as well as 79 CPU, 8 GPU, 503Gi RAM
node       94.9    8    503.2803649902344   353.38659380655736  N/A
node       95.425  8    754.0703411102295   791.2263174308464   N/A

Before the deployment:

node       79      8    503.37353897094727  88.08480133768171   N/A

andy108369 · 2023-10-31T16:33:20Z

image was not cached, sleep infinity - disk pressure, pod evicted;

Running image zjuuu/comfyui:0.6 with sleep infinity; after few hours -- was pulling the image again (seen from lease-events), knocking off the deployment after some time due to the disk pressure. Provider eventually closed its lease as expected.

So the whole problem is the large image / low ephemeral storage space (nodefs, imagefs).

There are certain thresholds which can be tweaked around:
https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/

andy108369 · 2023-10-31T16:38:59Z

The pods have been running well on that node; i.e. the node itself wasn't getting evicted but rather just the deployment itself.

The node was disappearing from the akash-provider report (8443/status) for short time while it was collecting the garbage.

I guess we are good then.

The pods have been running on prd-stk-tsr-dgx-41 through the entire time of this issue:

ubuntu@prd-stk-tsr-dgx-41:~$ sudo kubectl get pods -A -o wide | grep prd-stk-tsr-dgx-41
ingress-nginx                                   ingress-nginx-controller-rs88r                             1/1     Running     0          21h     10.233.84.232    prd-stk-tsr-dgx-41    <none>           <none>
kube-system                                     calico-node-hwbjd                                          1/1     Running     0          7d21h   10.40.160.141    prd-stk-tsr-dgx-41    <none>           <none>
kube-system                                     coredns-5c469774b8-vz7z6                                   1/1     Running     0          7d21h   10.233.84.193    prd-stk-tsr-dgx-41    <none>           <none>
kube-system                                     kube-apiserver-prd-stk-tsr-dgx-41                          1/1     Running     1          7d21h   10.40.160.141    prd-stk-tsr-dgx-41    <none>           <none>
kube-system                                     kube-controller-manager-prd-stk-tsr-dgx-41                 1/1     Running     2          7d21h   10.40.160.141    prd-stk-tsr-dgx-41    <none>           <none>
kube-system                                     kube-proxy-pxcm9                                           1/1     Running     0          6d19h   10.40.160.141    prd-stk-tsr-dgx-41    <none>           <none>
kube-system                                     kube-scheduler-prd-stk-tsr-dgx-41                          1/1     Running     1          7d21h   10.40.160.141    prd-stk-tsr-dgx-41    <none>           <none>
kube-system                                     nodelocaldns-5gwlp                                         1/1     Running     0          7d21h   10.40.160.141    prd-stk-tsr-dgx-41    <none>           <none>
nvidia-device-plugin                            nvdp-nvidia-device-plugin-f457b                            1/1     Running     0          7d      10.233.84.199    prd-stk-tsr-dgx-41    <none>           <none>
prometheus                                      prometheus-prometheus-node-exporter-rkcpv                  1/1     Running     0          21h     10.40.160.141    prd-stk-tsr-dgx-41    <none>           <none>

andy108369 · 2023-10-31T18:50:41Z

It appears the provider is running chaperone utility which kills certain deployments.

Their file_types = was set to ssh & sshd (I'm not sharing the entire config here for obvious reason).

And since I've been using the sshd-based one, it was getting killed:

logs from chaperone:

Oct 31 11:47:51 prd-stk-tsr-dgx-41 python3[1150795]: We found a process running with ssh match in pod ssh-548995d65d-clh4q
Oct 31 11:47:51 prd-stk-tsr-dgx-41 python3[1150795]: We found a process running with sshd match in pod ssh-548995d65d-clh4q
Oct 31 11:47:51 prd-stk-tsr-dgx-41 python3[1150795]: We found a process running with chm match in pod ssh-548995d65d-clh4q
Oct 31 11:47:51 prd-stk-tsr-dgx-41 python3[1150795]: Deleted namespace: 3ad5plmb0b7ivmob1dris8ikptv9ok2443kjo2oas6l5m

Going to re-test the 4th scenario.

andy108369 · 2023-10-31T19:22:15Z

image was not cached, sleep infinity - disk pressure, pod evicted; (same as 4. except chaperone isn't killing the pod now)

The pod comfy2-58d5c7b44d-558f4 failed due to:

Message:             The node was low on resource: ephemeral-storage. Threshold quantity: 15763389861, available: 12139364Ki.

and the node prd-stk-tsr-dgx-41 had KubeletHasDiskPressure

Provider closed the lease.

FWIW: provider wasn't accessible initially because it was missing the haproxy rule to redirect 8443 (akash-provider) to the worker node it's been running at.

Logs

fns3s2v3eu25jps6bvr42de873t3iun1vngci22uf9ru6   comfy2-58d5c7b44d-558f4                                    0/1     Error       0          20m
fns3s2v3eu25jps6bvr42de873t3iun1vngci22uf9ru6   comfy2-58d5c7b44d-svl82                                    0/1     Pending     0          94s

ubuntu@prd-stk-tsr-dgx-41:~$ sudo kubectl get pods -A -o wide | grep prd-stk-tsr-dgx-41
fns3s2v3eu25jps6bvr42de873t3iun1vngci22uf9ru6   comfy2-58d5c7b44d-558f4                                    0/1     Error       0          21m     10.233.84.239    prd-stk-tsr-dgx-41    <none>           <none>
ingress-nginx                                   ingress-nginx-controller-qcrn9                             0/1     Evicted     0          83s     <none>           prd-stk-tsr-dgx-41    <none>           <none>
kube-system                                     calico-node-hwbjd                                          1/1     Running     0          7d23h   10.40.160.141    prd-stk-tsr-dgx-41    <none>           <none>
kube-system                                     coredns-5c469774b8-vz7z6                                   1/1     Running     0          7d23h   10.233.84.193    prd-stk-tsr-dgx-41    <none>           <none>
kube-system                                     kube-apiserver-prd-stk-tsr-dgx-41                          1/1     Running     1          7d23h   10.40.160.141    prd-stk-tsr-dgx-41    <none>           <none>
kube-system                                     kube-controller-manager-prd-stk-tsr-dgx-41                 1/1     Running     2          7d23h   10.40.160.141    prd-stk-tsr-dgx-41    <none>           <none>
kube-system                                     kube-proxy-sxkd7                                           1/1     Running     0          150m    10.40.160.141    prd-stk-tsr-dgx-41    <none>           <none>
kube-system                                     kube-scheduler-prd-stk-tsr-dgx-41                          1/1     Running     1          7d23h   10.40.160.141    prd-stk-tsr-dgx-41    <none>           <none>
kube-system                                     nodelocaldns-5gwlp                                         1/1     Running     0          7d23h   10.40.160.141    prd-stk-tsr-dgx-41    <none>           <none>
nvidia-device-plugin                            nvdp-nvidia-device-plugin-f457b                            1/1     Running     0          7d2h    10.233.84.199    prd-stk-tsr-dgx-41    <none>           <none>
prometheus                                      prometheus-prometheus-node-exporter-5xlxh                  0/1     Evicted     0          80s     <none>           prd-stk-tsr-dgx-41    <none>           <none>

ubuntu@prd-stk-tsr-dgx-41:~$ sudo kubectl describe node prd-stk-tsr-dgx-41
Name:               prd-stk-tsr-dgx-41
Roles:              control-plane
Labels:             akash.network/capabilities.gpu.vendor.nvidia.model.v100=true
                    beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=prd-stk-tsr-dgx-41
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/control-plane=
                    node.kubernetes.io/exclude-from-external-load-balancers=
Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: unix:///var/run/containerd/containerd.sock
                    node.alpha.kubernetes.io/ttl: 0
                    projectcalico.org/IPv4Address: 10.40.160.141/24
                    projectcalico.org/IPv4VXLANTunnelAddr: 10.233.84.192
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Mon, 23 Oct 2023 15:28:48 -0400
Taints:             node.kubernetes.io/disk-pressure:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  prd-stk-tsr-dgx-41
  AcquireTime:     <unset>
  RenewTime:       Tue, 31 Oct 2023 15:13:05 -0400
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Mon, 23 Oct 2023 15:30:45 -0400   Mon, 23 Oct 2023 15:30:45 -0400   CalicoIsUp                   Calico is running on this node
  MemoryPressure       False   Tue, 31 Oct 2023 15:13:12 -0400   Mon, 23 Oct 2023 15:28:47 -0400   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         True    Tue, 31 Oct 2023 15:13:12 -0400   Tue, 31 Oct 2023 15:05:02 -0400   KubeletHasDiskPressure       kubelet has disk pressure
  PIDPressure          False   Tue, 31 Oct 2023 15:13:12 -0400   Mon, 23 Oct 2023 15:28:47 -0400   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Tue, 31 Oct 2023 15:13:12 -0400   Mon, 23 Oct 2023 15:32:02 -0400   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  10.40.160.141
  Hostname:    prd-stk-tsr-dgx-41
Capacity:
  cpu:                80
  ephemeral-storage:  102626232Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             528225832Ki
  nvidia.com/gpu:     8
  pods:               110
Allocatable:
  cpu:                80
  ephemeral-storage:  94580335255
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             528123432Ki
  nvidia.com/gpu:     8
  pods:               110
System Info:
  Machine ID:                 437e807e8fe84f478c670855995fba64
  System UUID:                81628b67-a46e-e811-ab21-d8c49769155b
  Boot ID:                    e6730582-abb0-4afd-bd01-2771350a1782
  Kernel Version:             5.15.0-84-generic
  OS Image:                   Ubuntu 22.04.3 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.7.5
  Kubelet Version:            v1.27.5
  Kube-Proxy Version:         v1.27.5
PodCIDR:                      10.233.64.0/24
PodCIDRs:                     10.233.64.0/24
Non-terminated Pods:          (8 in total)
  Namespace                   Name                                          CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                          ------------  ----------  ---------------  -------------  ---
  kube-system                 calico-node-hwbjd                             150m (0%)     300m (0%)   64M (0%)         500M (0%)      7d23h
  kube-system                 coredns-5c469774b8-vz7z6                      100m (0%)     0 (0%)      70Mi (0%)        300Mi (0%)     7d23h
  kube-system                 kube-apiserver-prd-stk-tsr-dgx-41             250m (0%)     0 (0%)      0 (0%)           0 (0%)         7d23h
  kube-system                 kube-controller-manager-prd-stk-tsr-dgx-41    200m (0%)     0 (0%)      0 (0%)           0 (0%)         7d23h
  kube-system                 kube-proxy-sxkd7                              0 (0%)        0 (0%)      0 (0%)           0 (0%)         150m
  kube-system                 kube-scheduler-prd-stk-tsr-dgx-41             100m (0%)     0 (0%)      0 (0%)           0 (0%)         7d23h
  kube-system                 nodelocaldns-5gwlp                            100m (0%)     0 (0%)      70Mi (0%)        200Mi (0%)     7d23h
  nvidia-device-plugin        nvdp-nvidia-device-plugin-f457b               0 (0%)        0 (0%)      0 (0%)           0 (0%)         7d2h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests        Limits
  --------           --------        ------
  cpu                900m (1%)       300m (0%)
  memory             210800640 (0%)  1024288k (0%)
  ephemeral-storage  0 (0%)          0 (0%)
  hugepages-1Gi      0 (0%)          0 (0%)
  hugepages-2Mi      0 (0%)          0 (0%)
  nvidia.com/gpu     0               0
Events:
  Type     Reason                 Age                    From     Message
  ----     ------                 ----                   ----     -------
  Normal   NodeHasNoDiskPressure  13m (x18 over 7d23h)   kubelet  Node prd-stk-tsr-dgx-41 status is now: NodeHasNoDiskPressure
  Normal   NodeHasDiskPressure    8m12s (x18 over 3d6h)  kubelet  Node prd-stk-tsr-dgx-41 status is now: NodeHasDiskPressure
  Warning  FreeDiskSpaceFailed    6m19s                  kubelet  Failed to garbage collect required amount of images. Attempted to free 9002490265 bytes, but only found 0 bytes eligible to free.
  Warning  EvictionThresholdMet   3m15s (x54 over 3d6h)  kubelet  Attempting to reclaim ephemeral-storage

ubuntu@prd-stk-tsr-dgx-41:~$ sudo kubectl -n fns3s2v3eu25jps6bvr42de873t3iun1vngci22uf9ru6 describe pod comfy2-58d5c7b44d-558f4
Name:                comfy2-58d5c7b44d-558f4
Namespace:           fns3s2v3eu25jps6bvr42de873t3iun1vngci22uf9ru6
Priority:            0
Runtime Class Name:  nvidia
Service Account:     default
Node:                prd-stk-tsr-dgx-41/10.40.160.141
Start Time:          Tue, 31 Oct 2023 14:50:53 -0400
Labels:              akash.network=true
                     akash.network/manifest-service=comfy2
                     akash.network/namespace=fns3s2v3eu25jps6bvr42de873t3iun1vngci22uf9ru6
                     pod-template-hash=58d5c7b44d
Annotations:         cni.projectcalico.org/containerID: 43bef3f4ef7fbe33e7c178e973a1fa03d631232e4efa998dba09d63837e07369
                     cni.projectcalico.org/podIP: 
                     cni.projectcalico.org/podIPs: 
Status:              Failed
Reason:              Evicted
Message:             The node was low on resource: ephemeral-storage. Threshold quantity: 15763389861, available: 12139364Ki. 
IP:                  10.233.84.239
IPs:
  IP:           10.233.84.239
Controlled By:  ReplicaSet/comfy2-58d5c7b44d
Containers:
  comfy2:
    Container ID:  containerd://9989c3a5b93e795a56f635d976a895ffd1b436831f39b2c3fa0a8ae51b387f1a
    Image:         zjuuu/comfyui:0.6
    Image ID:      docker.io/zjuuu/comfyui@sha256:7a6cf7c24e1c74b223b87fda62b47fa161a21333c072b49e02f09dd486626588
    Port:          8080/TCP
    Host Port:     0/TCP
    Command:
      sh
      -c
    Args:
      sleep infinity; exit 1;
    State:          Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Tue, 31 Oct 2023 14:57:01 -0400
      Finished:     Tue, 31 Oct 2023 15:09:58 -0400
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:                6
      ephemeral-storage:  53687091200
      memory:             37580963840
      nvidia.com/gpu:     1
    Requests:
      cpu:                6
      ephemeral-storage:  53687091200
      memory:             37580963840
      nvidia.com/gpu:     1
    Environment:
      ENABLE_MANAGER:                 true
      VAEURLS:                        
      MODELURLS:                      https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/resolve/main/sd_xl_base_1.0.safetensors,https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0/resolve/main/sd_xl_refiner_1.0.safetensors
      UPSCALEURLS:                    
      COMMANDLINE_ARGS:               --listen --port 8080
      AKASH_GROUP_SEQUENCE:           1
      AKASH_DEPLOYMENT_SEQUENCE:      13485628
      AKASH_ORDER_SEQUENCE:           1
      AKASH_OWNER:                    akash1z6ql9vzhsumpvumj4zs8juv7l5u2zyr5yax2ys
      AKASH_PROVIDER:                 akash17gqmzu0lnh2uclx9flm755arylrhgqy7udj3el
      AKASH_CLUSTER_PUBLIC_HOSTNAME:  provider.akash.foundrystaking.com
    Mounts:                           <none>
Conditions:
  Type               Status
  DisruptionTarget   True 
  Initialized        True 
  Ready              False 
  ContainersReady    False 
  PodScheduled       True 
Volumes:             <none>
QoS Class:           Guaranteed
Node-Selectors:      <none>
Tolerations:         node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                     node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason               Age    From               Message
  ----     ------               ----   ----               -------
  Normal   Scheduled            22m    default-scheduler  Successfully assigned fns3s2v3eu25jps6bvr42de873t3iun1vngci22uf9ru6/comfy2-58d5c7b44d-558f4 to prd-stk-tsr-dgx-41
  Normal   Pulling              22m    kubelet            Pulling image "zjuuu/comfyui:0.6"
  Normal   Pulled               16m    kubelet            Successfully pulled image "zjuuu/comfyui:0.6" in 6m7.123648259s (6m7.12366841s including waiting)
  Normal   Created              16m    kubelet            Created container comfy2
  Normal   Started              16m    kubelet            Started container comfy2
  Warning  Evicted              4m3s   kubelet            The node was low on resource: ephemeral-storage. Threshold quantity: 15763389861, available: 12139364Ki.
  Normal   Killing              4m3s   kubelet            Stopping container comfy2
  Warning  ExceededGracePeriod  3m53s  kubelet            Container runtime did not kill the pod within specified grace period.

ubuntu@prd-stk-tsr-dgx-41:~$ sudo kubectl -n fns3s2v3eu25jps6bvr42de873t3iun1vngci22uf9ru6 describe pod  comfy2-58d5c7b44d-svl82
Name:                comfy2-58d5c7b44d-svl82
Namespace:           fns3s2v3eu25jps6bvr42de873t3iun1vngci22uf9ru6
Priority:            0
Runtime Class Name:  nvidia
Service Account:     default
Node:                <none>
Labels:              akash.network=true
                     akash.network/manifest-service=comfy2
                     akash.network/namespace=fns3s2v3eu25jps6bvr42de873t3iun1vngci22uf9ru6
                     pod-template-hash=58d5c7b44d
Annotations:         <none>
Status:              Pending
IP:                  
IPs:                 <none>
Controlled By:       ReplicaSet/comfy2-58d5c7b44d
Containers:
  comfy2:
    Image:      zjuuu/comfyui:0.6
    Port:       8080/TCP
    Host Port:  0/TCP
    Command:
      sh
      -c
    Args:
      sleep infinity; exit 1;
    Limits:
      cpu:                6
      ephemeral-storage:  53687091200
      memory:             37580963840
      nvidia.com/gpu:     1
    Requests:
      cpu:                6
      ephemeral-storage:  53687091200
      memory:             37580963840
      nvidia.com/gpu:     1
    Environment:
      ENABLE_MANAGER:                 true
      VAEURLS:                        
      MODELURLS:                      https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/resolve/main/sd_xl_base_1.0.safetensors,https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0/resolve/main/sd_xl_refiner_1.0.safetensors
      UPSCALEURLS:                    
      COMMANDLINE_ARGS:               --listen --port 8080
      AKASH_GROUP_SEQUENCE:           1
      AKASH_DEPLOYMENT_SEQUENCE:      13485628
      AKASH_ORDER_SEQUENCE:           1
      AKASH_OWNER:                    akash1z6ql9vzhsumpvumj4zs8juv7l5u2zyr5yax2ys
      AKASH_PROVIDER:                 akash17gqmzu0lnh2uclx9flm755arylrhgqy7udj3el
      AKASH_CLUSTER_PUBLIC_HOSTNAME:  provider.akash.foundrystaking.com
    Mounts:                           <none>
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:         <none>
QoS Class:       Guaranteed
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age    From               Message
  ----     ------            ----   ----               -------
  Warning  FailedScheduling  3m48s  default-scheduler  0/8 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/disk-pressure: }, 7 node(s) didn't match Pod's node affinity/selector. preemption: 0/8 nodes are available: 8 Preemption is not helpful for scheduling..

More logs

https://gist.githubusercontent.com/andy108369/20b536f8b41cc0521bcff59411c4b6af/raw/05f458c74ae83817b2b3d968a77f7abb2e87bc1a/logs.md

andy108369 · 2023-11-08T15:43:54Z

Looks like the provider increase the disk space, so the SDL can be retried there again.

$ provider_info.sh provider.akash.foundrystaking.com
type       cpu      gpu  ram                 ephemeral           persistent
used       10.5     1    39.5                54.5                0
pending    0        0    0                   0                   0
available  704.725  46   4742.572359085083   12225.30169552099   0
node       89.625   6    719.1953601837158   741.2263174308464   N/A
node       93.625   8    501.17984199523926  789.2263174308464   N/A
node       94.85    8    502.97010040283203  791.2263174308464   N/A
node       79       8    503.37351989746094  6384.962164035067   N/A
node       252.725  8    2012.5731716156006  3165.273985386826   N/A
node       94.9     8    503.2803649902344   353.38659380655736  N/A

andy108369 added repo/provider Akash provider-services repo issues awaiting-triage labels Oct 30, 2023

andy108369 closed this as completed Oct 31, 2023

andy108369 reopened this Oct 31, 2023

andy108369 closed this as completed Oct 31, 2023

andy108369 mentioned this issue May 17, 2024

NodeHasDiskPressure Causing Pod Evictions Due to Excessive Disk Usage #226

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deployment Failures Leading to Double Resource Consumption and Risk of Node Eviction Due to `NodeHasDiskPressure` #138

Deployment Failures Leading to Double Resource Consumption and Risk of Node Eviction Due to `NodeHasDiskPressure` #138

andy108369 commented Oct 30, 2023 •

edited

Loading

arno01 commented Oct 30, 2023

andy108369 commented Oct 30, 2023

arno01 commented Oct 31, 2023 •

edited

Loading

andy108369 commented Oct 31, 2023

andy108369 commented Oct 31, 2023

andy108369 commented Oct 31, 2023

andy108369 commented Oct 31, 2023

andy108369 commented Nov 8, 2023

Deployment Failures Leading to Double Resource Consumption and Risk of Node Eviction Due to NodeHasDiskPressure #138

Deployment Failures Leading to Double Resource Consumption and Risk of Node Eviction Due to NodeHasDiskPressure #138

Comments

andy108369 commented Oct 30, 2023 • edited Loading

arno01 commented Oct 30, 2023

andy108369 commented Oct 30, 2023

arno01 commented Oct 31, 2023 • edited Loading

andy108369 commented Oct 31, 2023

andy108369 commented Oct 31, 2023

andy108369 commented Oct 31, 2023

andy108369 commented Oct 31, 2023

Logs

More logs

andy108369 commented Nov 8, 2023

Deployment Failures Leading to Double Resource Consumption and Risk of Node Eviction Due to `NodeHasDiskPressure` #138

Deployment Failures Leading to Double Resource Consumption and Risk of Node Eviction Due to `NodeHasDiskPressure` #138

andy108369 commented Oct 30, 2023 •

edited

Loading

arno01 commented Oct 31, 2023 •

edited

Loading