Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using limits with the shim makes the pod fail. #194

Open
mikkelhegn opened this issue Dec 13, 2023 · 3 comments
Open

Using limits with the shim makes the pod fail. #194

mikkelhegn opened this issue Dec 13, 2023 · 3 comments

Comments

@mikkelhegn
Copy link

The livenessProbe reports failure continuously. Not sure if the pod is restarted because of that, but that it actually runs, or what the problem is.

Repro using k3d

k3d cluster create wasm-cluster \
           --image ghcr.io/deislabs/containerd-wasm-shims/examples/k3d:v0.10.0 \
           -p "8081:80@loadbalancer" \
           --agents 0

kubectl apply -f https://raw.githubusercontent.com/deislabs/containerd-wasm-shims/main/deployments/workloads/runtime.yaml

Then apply the following workloads for comparison:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: fails
spec:
  replicas: 1
  selector:
    matchLabels:
      app: fails
  template:
    metadata:
      labels:
        app: fails
    spec:
      runtimeClassName: wasmtime-spin
      containers:
        - name: fails
          image: ghcr.io/deislabs/containerd-wasm-shims/examples/spin-rust-hello:latest
          command: ["/"]
          resources:
            limits:
              cpu: 100m
              memory: 128Mi
            requests:
              cpu: 100m
              memory: 128Mi
          livenessProbe:
            httpGet:
              path: .well-known/spin/health
              port: 80
            initialDelaySeconds: 3
            periodSeconds: 3
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: works
spec:
  replicas: 1
  selector:
    matchLabels:
      app: works
  template:
    metadata:
      labels:
        app: works
    spec:
      runtimeClassName: wasmtime-spin
      containers:
        - name: works
          image: ghcr.io/deislabs/containerd-wasm-shims/examples/spin-rust-hello:latest
          command: ["/"]
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
          livenessProbe:
            httpGet:
              path: .well-known/spin/health
              port: 80
            initialDelaySeconds: 3
            periodSeconds: 3
@mikkelhegn mikkelhegn changed the title Using limit with the shim makes the pod fail. Using limits with the shim makes the pod fail. Dec 13, 2023
@jpflueger
Copy link

Just wanted to add my findings in here as well. It seems like there is a CPU spike during startup that is throttled by the resource limits. This might not be specific to the shim but a general issue with resource limits in Kubernetes. For example, I used the following two deployments to check how long it took for Spin's port to be opened and with higher or no limits on the pod it does open the port in a relatively short time.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: spin-slow-start
spec:
  replicas: 1
  selector:
    matchLabels:
      app: spin-slow-start
  template:
    metadata:
      labels:
        app: spin-slow-start
    spec:
      runtimeClassName: wasmtime-spin
      containers:
        - name: spin-hello
          image: ghcr.io/deislabs/containerd-wasm-shims/examples/spin-rust-hello:v0.10.0
          command: ["/"]
          resources:
            limits:
              cpu: 100m
              memory: 128Mi
            requests:
              cpu: 100m
              memory: 128Mi
        - image: alpine:latest
          name: debug-alpine
          command: ["/bin/sh", "-c"]
          args:
            - |
              TARGET_HOST='127.0.0.1'

              echo "START: waiting for $TARGET_HOST:80"
              timeout 60 sh -c 'until nc -z $0 $1; do sleep 1; done' $TARGET_HOST 80
              echo "END: waiting for $TARGET_HOST:80"

              sleep 100000000
          resources: {}
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: spin-faster-start
spec:
  replicas: 1
  selector:
    matchLabels:
      app: spin-faster-start
  template:
    metadata:
      labels:
        app: spin-faster-start
    spec:
      runtimeClassName: wasmtime-spin
      containers:
        - name: spin-hello
          image: ghcr.io/deislabs/containerd-wasm-shims/examples/spin-rust-hello:v0.10.0
          command: ["/"]
          resources:
            limits:
              cpu: 400m
              memory: 512Mi
            requests:
              cpu: 100m
              memory: 128Mi
        - image: alpine:latest
          name: debug-alpine
          command: ["/bin/sh", "-c"]
          args:
            - |
              TARGET_HOST='127.0.0.1'

              echo "START: waiting for $TARGET_HOST:80"
              timeout 60 sh -c 'until nc -z $0 $1; do sleep 1; done' $TARGET_HOST 80
              echo "END: waiting for $TARGET_HOST:80"

              sleep 100000000
          resources: {}

Maybe the fix here is to just remove the limits from example deployments or bump them up? We could also evaluate adding overhead.podCpu configuration to the runtime class to ensure the limits are tolerant of spikes, though that might impact the ability to schedule the pods on smaller nodes.

@jsturtevant
Copy link
Contributor

@mikkelhegn could you check to see if higher limits helps?

You might also play with livenessProbe settings. Delaying the call a few more seconds during the spike of the initial boot might help too.

initialDelaySeconds: 10
periodSeconds: 3

@mikkelhegn
Copy link
Author

mikkelhegn commented Dec 15, 2023

I had to bump initialDelaySeconds to 45 sec to not have the livenessProbe fail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants