Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Liveness and Readiness failures for Kong Pods #11698

Closed
anup-krai opened this issue Oct 5, 2023 Discussed in #11693 · 13 comments
Closed

Liveness and Readiness failures for Kong Pods #11698

anup-krai opened this issue Oct 5, 2023 Discussed in #11693 · 13 comments
Labels
area/ingress-controller Issues where Kong is running as a Kubernetes Ingress Controller area/kubernetes Issues where Kong is running on top of Kubernetes pending author feedback Waiting for the issue author to get back to a maintainer with findings, more details, etc...

Comments

@anup-krai
Copy link

Discussed in #11693

Originally posted by anup-krai October 4, 2023
Hi All,
We are using Kong (3.0.1) with KIC (2.8.2) and Postgres DB. From last few days we are getting below Liveness and Readiness failures which is resulting in proxy container restarts -

Events:
  Type     Reason     Age                  From     Message
  ----     ------     ----                 ----     -------
  Warning  Unhealthy  25m (x36 over 5h5m)  kubelet  Liveness probe failed: Get "http://X.X.X.38:8100/status": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Normal   Killing    25m                  kubelet  Container proxy failed liveness probe, will be restarted
  Warning  Unhealthy  25m (x38 over 5h5m)  kubelet  Readiness probe failed: Get "http://X.X.X.38:8100/status": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Normal   Pulling    24m (x2 over 5h25m)  kubelet  Pulling image "kong:3.0.1"
  Warning  Unhealthy  24m                  kubelet  Readiness probe failed: Get "http://X.X.X.38:8100/status": read tcp X.X.X.107:58132->X.X.X.38:8100: read: connection reset by peer
  Warning  Unhealthy  24m                  kubelet  Readiness probe failed: Get "http://X.X.X.38:8100/status": dial tcp X.X.X.38:8100: connect: connection refused

What could be causing this error? Below are few logs -

2023/10/03 12:21:32 [alert] 1110#0: *1050641 open socket #426 left in connection 2911
2023/10/03 12:21:32 [alert] 1110#0: aborting
2023/10/03 12:21:35 [alert] 1#0: worker process 1109 exited on signal 9
```</div>
@anup-krai
Copy link
Author

After enabling debug level logs I see below -

2023/10/05 10:23:08 [debug] 1108#0: *10990 [lua] targets.lua:240: executing requery for: ****
2023/10/05 10:23:08 [debug] 1108#0: *10990 [lua] targets.lua:439: queryDns(): querying dns for ***
2023/10/05 10:23:08 [debug] 1108#0: *10990 [lua] targets.lua:293: f(): no dns changes detected for ****
2023/10/05 10:23:08 [notice] 1#0: signal 3 (SIGQUIT) received from 1121, shutting down
2023/10/05 10:23:08 [notice] 1108#0: gracefully shutting down
2023/10/05 10:23:08 [debug] 1108#0: *10999 [lua] events.lua:211: do_event_json(): worker-events: handling event; source=resty-worker-events, event=stopping, pid=1108, data=nil
2023/10/05 10:23:08 [crit] 1108#0: *11001 [lua] targets.lua:248: could not reschedule DNS resolver timer: process exiting, context: ngx.timer
2023/10/05 10:23:08 [notice] 1108#0: exiting
2023/10/05 10:23:08 [notice] 1108#0: exit
2023/10/05 10:23:08 [notice] 1#0: signal 17 (SIGCHLD) received from 1108
2023/10/05 10:23:08 [notice] 1#0: worker process 1108 exited with code 0
2023/10/05 10:23:08 [notice] 1#0: signal 29 (SIGIO) received
2023/10/05 10:23:18 [notice] 1#0: signal 15 (SIGTERM) received from 1225, exiting
2023/10/05 10:23:18 [notice] 1#0: signal 14 (SIGALRM) received
2023/10/05 10:23:18 [notice] 1#0: signal 3 (SIGQUIT) received, shutting down
2023/10/05 10:23:18 [notice] 1#0: signal 14 (SIGALRM) received
2023/10/05 10:23:18 [notice] 1#0: signal 14 (SIGALRM) received
2023/10/05 10:23:19 [notice] 1#0: signal 14 (SIGALRM) received
2023/10/05 10:23:19 [notice] 1#0: signal 14 (SIGALRM) received
2023/10/05 10:23:21 [notice] 1#0: signal 14 (SIGALRM) received
2023/10/05 10:23:21 [notice] 1#0: signal 17 (SIGCHLD) received from 1109
2023/10/05 10:23:21 [alert] 1#0: worker process 1109 exited on signal 9
2023/10/05 10:23:21 [notice] 1#0: exit

We have reverted the Kong version from 3.0.1 to 2.8.3 and there also we see similar issue.

@nowNick
Copy link
Contributor

nowNick commented Oct 6, 2023

Hey @anup-krai

Thank you for reporting this issue. Looking at the logs it seems like something is killing the master process:
2023/10/05 10:23:08 [notice] 1#0: signal 3 (SIGQUIT) received from 1121, shutting down

Could you please share your liveness/readiness probe yamls ?

Do you think it's possible that there's a networking issue inside your k8s cluster? You could try to disable the liveness/readiness probes and verify if and how long it takes for kong to respond with something like:

time curl -v http://X.X.X.38:8100/status

@nowNick nowNick added area/kubernetes Issues where Kong is running on top of Kubernetes area/ingress-controller Issues where Kong is running as a Kubernetes Ingress Controller pending author feedback Waiting for the issue author to get back to a maintainer with findings, more details, etc... labels Oct 6, 2023
@anup-krai
Copy link
Author

anup-krai commented Oct 6, 2023

Hi @nowNick ,

Thanks for your response. Below are the liveness/readiness probe yamls

  readinessProbe:
    httpGet:
      path: "/status"
      port: status
      scheme: HTTP
    initialDelaySeconds: 5
    timeoutSeconds: 5
    periodSeconds: 10
    successThreshold: 1
    failureThreshold: 3

  # livenessProbe for Kong pods
  livenessProbe:
    httpGet:
      path: "/status"
      port: status
      scheme: HTTP
    initialDelaySeconds: 5
    timeoutSeconds: 5
    periodSeconds: 10
    successThreshold: 1
    failureThreshold: 3

I had increased timeoutSeconds to 5 hrs and still consumers are seeing failures. And also status calls were failing even through containers were not restarting.

@nowNick
Copy link
Contributor

nowNick commented Oct 6, 2023

Thank you! The probe yamls look ok.
Now let's make sure they can actually reach kong - have you tried disabling the probes and trying to manually reach kong pods with curl?

@anup-krai
Copy link
Author

yes, we have changed the probe endpoints and also tried with disabled it, sometimes curls were timing out or have high response time.

@anup-krai
Copy link
Author

Team, Can you please suggest on this? Looks like similar issue is raised in #11710.

@nowNick
Copy link
Contributor

nowNick commented Oct 9, 2023

Hey @anup-krai! Could you tell us how large your configuration is (number of routes/services/consumers), roughly?

@anup-krai
Copy link
Author

Hi @nowNick , Below are the details -
Routes - 900+
Services - 900+
Consumers - 100+
Plugins - 2900+

@nowNick nowNick removed the pending author feedback Waiting for the issue author to get back to a maintainer with findings, more details, etc... label Oct 10, 2023
@anup-krai
Copy link
Author

Team, Can you please suggest on this?

@nowNick
Copy link
Contributor

nowNick commented Oct 16, 2023

Hey @anup-krai

It seems like the similar issue to yours has been closed with this comment:
#11710 (comment)

Do you think the resolution might also be similar?

@nowNick nowNick added the pending author feedback Waiting for the issue author to get back to a maintainer with findings, more details, etc... label Oct 16, 2023
@anup-krai
Copy link
Author

Yes, we were able to identify the issue and its fixed now. Thanks for your response @nowNick .

@alexandresavicki
Copy link

@anup-krai and @nowNick I think i'm suffering with the same issue here, can you please share how did you identify the ofensor plugin causing this problem?

@regnaio
Copy link

regnaio commented Dec 29, 2024

Seeing same issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ingress-controller Issues where Kong is running as a Kubernetes Ingress Controller area/kubernetes Issues where Kong is running on top of Kubernetes pending author feedback Waiting for the issue author to get back to a maintainer with findings, more details, etc...
Projects
None yet
Development

No branches or pull requests

4 participants