Kong proxy goes unresponsive with normal traffic - #8152 issue - kubelet restarts the container since liveness probe fails #11710

kranthirachakonda · 2023-10-08T03:34:19Z

Discussed in #11709

^{Originally posted by kranthirachakonda October 7, 2023}
Looks like we are running into Kong #8152 issue with 2.8.3 and also 3.0.1. We had new route/service added recently and there is increase in traffic but not huge. May ~15% increase. Only proxy container inside Kong pods gets restart due to /status timing out. We tried changing the timeout/period intervals but that didnt help much. We dont see our containers going above the cpu/memory requests.

Behavior is identical, where kong proxy container goes into unresponsive mode for about 4-5 mins and processes again. We also observed that while I gave memory requests of 4G, proxy container never goes above 512Mi. Not sure if there are default limits within LuaVM or threads etc. Can you please help me in debugging further.

db-less

kong:2.8.3

    Limits:
      cpu:     2500m
      memory:  6Gi
    Requests:
      cpu:      1
      memory:   4Gi
    Liveness:   http-get http://:status/status delay=5s timeout=5s period=10s #success=1 #failure=3

kubernetes-ingress-controller:2.8.2

    Limits:
      cpu:     2
      memory:  2Gi
    Requests:
      cpu:      200m
      memory:   1Gi
    Liveness:   http-get http://:10254/healthz delay=5s timeout=5s period=10s #success=1 #failure=3

Events:

  Type     Reason     Age                    From     Message
  ----     ------     ----                   ----     -------
  Warning  Unhealthy  4m7s (x94 over 20h)    kubelet  Liveness probe failed: Get "http://192.168.10.79:8100/status": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Normal   Killing    4m7s (x27 over 20h)    kubelet  Container proxy failed liveness probe, will be restarted

These are few errors I see while proxy containers goes into a condition where status page fails to respond within 5s, causing kubelet to restart when liveness probe fails.

[lua] targets.lua:248: could not reschedule DNS resolver timer: process exiting, context: ngx.timer
[lua] init.lua:409: [cluster_events] failed to start recurring polling timer: process exiting, context: ngx.timer

Admin page

{
    "configuration": {
        "nginx_kong_conf": "/kong_prefix/nginx-kong.conf",
        "nginx_kong_stream_conf": "/kong_prefix/nginx-kong-stream.conf",
        "dns_error_ttl": 1,
        "kong_env": "/kong_prefix/.kong_env",
        "dns_not_found_ttl": 30,
        "dns_stale_ttl": 4,
        "dns_cache_size": 10000,
        "dns_order": [
            "LAST",
            "SRV",
            "A",
            "CNAME"
        ],
        "dns_no_sync": false,
        "upstream_keepalive_pool_size": 60,
        "client_ssl_cert_default": "/kong_prefix/ssl/kong-default.crt",
        "client_ssl_cert_key_default": "/kong_prefix/ssl/kong-default.key",
        "admin_ssl_cert_default": "/kong_prefix/ssl/admin-kong-default.crt",
        "admin_ssl_cert_key_default": "/kong_prefix/ssl/admin-kong-default.key",
        "admin_ssl_cert_default_ecdsa": "/kong_prefix/ssl/admin-kong-default-ecdsa.crt",
        "ssl_cipher_suite": "intermediate",
        "admin_ssl_cert_key_default_ecdsa": "/kong_prefix/ssl/admin-kong-default-ecdsa.key",
        "ssl_ciphers": "ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384",
        "status_ssl_cert_default": "/kong_prefix/ssl/status-kong-default.crt",
        "status_ssl_cert_key_default": "/kong_prefix/ssl/status-kong-default.key",
        "nginx_main_worker_rlimit_nofile": "auto",
        "upstream_keepalive_idle_timeout": 60,
        "nginx_events_worker_connections": "auto",
        "nginx_events_multi_accept": "on",
        "lua_package_cpath": "",
        "port_maps": [
            "443:8443"
        ],
        "ssl_cert": [
            "/etc/secrets/proxy-tls/tls.crt"
        ],
        "proxy_listen": [
            "0.0.0.0:8443 http2 ssl"
        ],
        "admin_listen": [
            "0.0.0.0:8444 ssl"
        ],
        "status_listen": [
            "0.0.0.0:8100",
            "0.0.0.0:8543 ssl"
        ],
        "stream_listen": [
            "off"
        ],
        "cluster_listen": [
            "off"
        ],
        "admin_ssl_cert": [
            "/kong_prefix/ssl/admin-kong-default.crt",
            "/kong_prefix/ssl/admin-kong-default-ecdsa.crt"
        ],
        "nginx_admin_client_max_body_size": "10m",
        "status_ssl_cert": [
            "/kong_prefix/ssl/status-kong-default.crt",
            "/kong_prefix/ssl/status-kong-default-ecdsa.crt"
        ],
        "status_ssl_cert_key": [
            "/kong_prefix/ssl/status-kong-default.key",
            "/kong_prefix/ssl/status-kong-default-ecdsa.key"
        ],
        "nginx_admin_client_body_buffer_size": "10m",
        "nginx_user": "kong kong",
        "nginx_main_user": "kong kong",
        "nginx_daemon": "off",
        "nginx_main_daemon": "off",
        "nginx_worker_processes": "2",
        "nginx_main_worker_processes": "2",
        "trusted_ips": {},
        "kic": true,
        "nginx_proxy_real_ip_header": "X-Real-IP",
        "real_ip_recursive": "off",
        "nginx_proxy_real_ip_recursive": "off",
        "client_max_body_size": "0",
        "nginx_http_client_max_body_size": "0",
        "client_body_buffer_size": "8k",
        "nginx_http_client_body_buffer_size": "8k",
        "pg_port": 50001,
        "pg_password": "******",
        "pg_ssl": true,
        "pg_ssl_verify": false,
        "pg_max_concurrent_queries": 0,
        "pg_semaphore_timeout": 60000,
        "pg_ro_port": 50002,
        "pg_user": "user1",
        "pg_ro_password": "******",
        "pg_ro_ssl": true,
        "enabled_headers": {
            "server_tokens": true,
            "latency_tokens": true,
            "Via": true,
            "Server": true,
            "X-Kong-Proxy-Latency": true,
            "X-Kong-Response-Latency": true,
            "X-Kong-Admin-Latency": true,
            "X-Kong-Upstream-Status": true,
            "X-Kong-Upstream-Latency": true
        },
        "cassandra_contact_points": [
            "127.0.0.1"
        ],
        "cassandra_port": 9042,
        "pg_ro_host": "hawkeye-haproxy-svc.hawkeye-haproxy-ns.svc.cluster.local",
        "cassandra_ssl": false,
        "cassandra_ssl_verify": false,
        "cassandra_write_consistency": "ONE",
        "pg_ro_user": "user1",
        "cassandra_read_consistency": "ONE",
        "cassandra_lb_policy": "RequestRoundRobin",
        "cassandra_keyspace": "kong",
        "cassandra_refresh_frequency": 60,
        "error_default_type": "text/plain",
        "cassandra_username": "kong",
        "cassandra_repl_factor": 1,
        "cassandra_data_centers": [
            "dc1:2",
            "dc2:3"
        ],
        "cassandra_schema_consensus_timeout": 10000,
        "ssl_protocols": "TLSv1.1 TLSv1.2 TLSv1.3",
        "nginx_http_ssl_protocols": "TLSv1.2 TLSv1.3",
        "nginx_stream_ssl_protocols": "TLSv1.2 TLSv1.3",
        "worker_consistency": "strict",
        "nginx_http_ssl_prefer_server_ciphers": "off",
        "nginx_stream_ssl_prefer_server_ciphers": "off",
        "ssl_dhparam": "ffdhe2048",
        "pluginserver_names": {},
        "nginx_stream_ssl_dhparam": "ffdhe2048",
        "ssl_session_tickets": "on",
        "nginx_http_ssl_session_tickets": "on",
        "nginx_stream_ssl_session_tickets": "on",
        "stream_proxy_ssl_enabled": false,
        "go_plugins_dir": "off",
        "prefix": "/kong_prefix",
        "go_pluginserver_exe": "/usr/local/bin/go-pluginserver",
        "lua_package_path": "/opt/?.lua;/opt/?/init.lua;;",
        "db_cache_warmup_entities": [
            "services"
        ],
        "status_listeners": [
            {
                "ssl": false,
                "port": 8100,
                "listener": "0.0.0.0:8100",
                "ip": "0.0.0.0"
            },
            {
                "ssl": true,
                "port": 8543,
                "listener": "0.0.0.0:8543 ssl",
                "ip": "0.0.0.0"
            }
        ],
        "proxy_ssl_enabled": true,
        "pg_timeout": 5000,
        "role": "traditional",
        "ssl_cert_key_default": "/kong_prefix/ssl/kong-default.key",
        "headers": [
            "server_tokens",
            "latency_tokens",
            "X-Kong-Upstream-Status"
        ],
        "proxy_stream_error_log": "logs/error.log",
        "ssl_cert_default": "/kong_prefix/ssl/kong-default.crt",
        "nginx_http_lua_shared_dict": "prometheus_metrics 100m",
        "ssl_cert_key_default_ecdsa": "/kong_prefix/ssl/kong-default-ecdsa.key",
        "loaded_vaults": {},
        "plugins": [
            "bundled",
            "json-threat-protection",
            "upstream-oauth2",
            "upstream-basic-auth",
            "payload-logging",
            "custom-log",
            "upstream-jwt",
            "upstream-xdt-oauth2",
            "set-target-host",
            "hmac-url-auth",
            "inmar-hmac-auth",
            "oauth-consumer",
            "signnow-hmac-auth",
            "jwt-auth",
            "batch-rate-limiting"
        ],
        "admin_error_log": "/dev/stderr",
        "loaded_plugins": {
            "response-ratelimiting": true,
            "syslog": true,
            "loggly": true,
            "datadog": true,
            "ldap-auth": true,
            "statsd": true,
            "bot-detection": true,
            "aws-lambda": true,
            "request-termination": true,
            "prometheus": true,
            "proxy-cache": true,
            "session": true,
            "acme": true,
            "grpc-gateway": true,
            "grpc-web": true,
            "pre-function": true,
            "post-function": true,
            "azure-functions": true,
            "zipkin": true,
            "signnow-hmac-auth": true,
            "jwt-auth": true,
            "batch-rate-limiting": true,
            "upstream-oauth2": true,
            "upstream-basic-auth": true,
            "payload-logging": true,
            "custom-log": true,
            "upstream-jwt": true,
            "upstream-xdt-oauth2": true,
            "set-target-host": true,
            "hmac-url-auth": true,
            "inmar-hmac-auth": true,
            "oauth-consumer": true,
            "json-threat-protection": true,
            "jwt": true,
            "acl": true,
            "correlation-id": true,
            "cors": true,
            "oauth2": true,
            "tcp-log": true,
            "udp-log": true,
            "file-log": true,
            "http-log": true,
            "key-auth": true,
            "hmac-auth": true,
            "basic-auth": true,
            "ip-restriction": true,
            "request-transformer": true,
            "response-transformer": true,
            "request-size-limiting": true,
            "rate-limiting": true
        },
        "proxy_error_log": "/dev/stderr",
        "host_ports": {
            "8443": 443,
            "8443": 443
        },
        "anonymous_reports": false,
        "real_ip_header": "X-Real-IP",
        "database": "postgres",
        "ssl_prefer_server_ciphers": "on",
        "status_ssl_cert_key_default_ecdsa": "/kong_prefix/ssl/status-kong-default-ecdsa.key",
        "db_resurrect_ttl": 30,
        "ssl_session_timeout": "1d",
        "nginx_http_ssl_session_timeout": "1d",
        "nginx_stream_ssl_session_timeout": "1d",
        "proxy_access_log": "/dev/stdout",
        "worker_state_update_frequency": 5,
        "proxy_stream_access_log": "logs/access.log basic",
        "dns_resolver": {},
        "admin_access_log": "/dev/stdout",
        "dns_hostsfile": "/etc/hosts",
        "status_access_log": "off",
        "status_error_log": "logs/status_error.log",
        "log_level": "error",
        "cluster_listeners": {},
        "status_ssl_enabled": true,
        "vaults": [
            "off"
        ],
        "nginx_optimizations": true,
        "lua_ssl_trusted_certificate": {},
        "lua_ssl_verify_depth": 1,
        "lua_ssl_protocols": "TLSv1.1 TLSv1.2 TLSv1.3",
        "nginx_http_lua_ssl_protocols": "TLSv1.1 TLSv1.2 TLSv1.3",
        "nginx_stream_lua_ssl_protocols": "TLSv1.1 TLSv1.2 TLSv1.3",
        "lua_socket_pool_size": 30,
        "db_update_frequency": 5,
        "cluster_control_plane": "127.0.0.1:8005",
        "cassandra_repl_strategy": "SimpleStrategy",
        "mem_cache_size": "3000m",
        "cluster_mtls": "shared",
        "ssl_cert_key": [
            "/etc/secrets/proxy-tls/tls.key"
        ],
        "db_cache_ttl": 0,
        "nginx_main_directives": [
            {
                "value": "off",
                "name": "daemon"
            },
            {
                "value": "KONG_REDIS_HOST;env KONG_REDIS_PORT;env KONG_REDIS_PASSWORD",
                "name": "env"
            },
            {
                "value": "kong kong",
                "name": "user"
            },
            {
                "value": "2",
                "name": "worker_processes"
            },
            {
                "value": "auto",
                "name": "worker_rlimit_nofile"
            }
        ],
        "cluster_data_plane_purge_delay": 1209600,
        "cluster_ocsp": "off",
        "cluster_max_payload": 4194304,
        "untrusted_lua": "sandbox",
        "nginx_http_directives": [
            {
                "value": "8k",
                "name": "client_body_buffer_size"
            },
            {
                "value": "0",
                "name": "client_max_body_size"
            },
            {
                "value": "100000",
                "name": "lua_regex_match_limit"
            },
            {
                "value": "prometheus_metrics 100m",
                "name": "lua_shared_dict"
            },
            {
                "value": "TLSv1.1 TLSv1.2 TLSv1.3",
                "name": "lua_ssl_protocols"
            },
            {
                "value": "/kong_prefix/ssl/ffdhe2048.pem",
                "name": "ssl_dhparam"
            },
            {
                "value": "off",
                "name": "ssl_prefer_server_ciphers"
            },
            {
                "value": "TLSv1.2 TLSv1.3",
                "name": "ssl_protocols"
            },
            {
                "value": "on",
                "name": "ssl_session_tickets"
            },
            {
                "value": "1d",
                "name": "ssl_session_timeout"
            }
        ],
        "untrusted_lua_sandbox_requires": {},
        "untrusted_lua_sandbox_environment": {},
        "client_ssl": false,
        "upstream_keepalive_max_requests": 100,
        "nginx_upstream_directives": {},
        "pg_ro_database": "hawkeye",
        "nginx_proxy_directives": [
            {
                "value": "X-Real-IP",
                "name": "real_ip_header"
            },
            {
                "value": "off",
                "name": "real_ip_recursive"
            }
        ],
        "pg_database": "hawkeye",
        "nginx_status_directives": {},
        "admin_ssl_cert_key": [
            "/kong_prefix/ssl/admin-kong-default.key",
            "/kong_prefix/ssl/admin-kong-default-ecdsa.key"
        ],
        "nginx_admin_directives": [
            {
                "value": "10m",
                "name": "client_body_buffer_size"
            },
            {
                "value": "10m",
                "name": "client_max_body_size"
            }
        ],
        "nginx_events_directives": [
            {
                "value": "on",
                "name": "multi_accept"
            },
            {
                "value": "auto",
                "name": "worker_connections"
            }
        ],
        "nginx_stream_directives": [
            {
                "value": "stream_prometheus_metrics 5m",
                "name": "lua_shared_dict"
            },
            {
                "value": "TLSv1.1 TLSv1.2 TLSv1.3",
                "name": "lua_ssl_protocols"
            },
            {
                "value": "/kong_prefix/ssl/ffdhe2048.pem",
                "name": "ssl_dhparam"
            },
            {
                "value": "off",
                "name": "ssl_prefer_server_ciphers"
            },
            {
                "value": "TLSv1.2 TLSv1.3",
                "name": "ssl_protocols"
            },
            {
                "value": "on",
                "name": "ssl_session_tickets"
            },
            {
                "value": "1d",
                "name": "ssl_session_timeout"
            }
        ],
        "pg_host": "hawkeye-haproxy-svc.hawkeye-haproxy-ns.svc.cluster.local",
        "nginx_supstream_directives": {},
        "db_update_propagation": 0,
        "nginx_sproxy_directives": {},
        "proxy_listeners": [
            {
                "ssl": true,
                "port": 8443,
                "bind": false,
                "listener": "0.0.0.0:8443 ssl http2",
                "deferred": false,
                "reuseport": false,
                "backlog=%d+": false,
                "proxy_protocol": false,
                "http2": true,
                "ip": "0.0.0.0"
            }
        ],
        "stream_listeners": {},
        "admin_listeners": [
            {
                "ssl": true,
                "port": 8444,
                "bind": false,
                "listener": "0.0.0.0:8444 ssl",
                "deferred": false,
                "reuseport": false,
                "backlog=%d+": false,
                "proxy_protocol": false,
                "http2": false,
                "ip": "0.0.0.0"
            }
        ],
        "admin_ssl_enabled": true,
        "nginx_http_upstream_directives": {},
        "nginx_http_status_directives": {},
        "nginx_http_ssl_dhparam": "ffdhe2048",
        "nginx_pid": "/kong_prefix/pids/nginx.pid",
        "nginx_http_lua_regex_match_limit": "100000",
        "status_ssl_cert_default_ecdsa": "/kong_prefix/ssl/status-kong-default-ecdsa.crt",
        "nginx_err_logs": "/kong_prefix/logs/error.log",
        "nginx_main_env": "KONG_REDIS_HOST;env KONG_REDIS_PORT;env KONG_REDIS_PASSWORD",
        "cassandra_timeout": 5000,
        "nginx_acc_logs": "/kong_prefix/logs/access.log",
        "pg_ro_ssl_verify": false,
        "admin_acc_logs": "/kong_prefix/logs/admin_access.log",
        "ssl_cert_csr_default": "/kong_prefix/ssl/kong-default.csr",
        "nginx_conf": "/kong_prefix/nginx.conf",
        "ssl_cert_default_ecdsa": "/kong_prefix/ssl/kong-default-ecdsa.crt"
    },
    "version": "2.8.3",
    "tagline": "Welcome to kong",
    "lua_version": "LuaJIT 2.1.0-beta3",
    "pids": {
        "workers": [
            1108,
            1109
        ],
        "master": 1
    },
    "hostname": "hawkeye-kong-c75b565dc-gkkkn",
    "node_id": "4517d36f-f53a-4514-b269-5046b3d13f83",
    "plugins": {
        "available_on_server": {
            "response-ratelimiting": true,
            "syslog": true,
            "loggly": true,
            "datadog": true,
            "ldap-auth": true,
            "statsd": true,
            "bot-detection": true,
            "aws-lambda": true,
            "request-termination": true,
            "prometheus": true,
            "proxy-cache": true,
            "session": true,
            "acme": true,
            "grpc-gateway": true,
            "grpc-web": true,
            "pre-function": true,
            "post-function": true,
            "azure-functions": true,
            "zipkin": true,
            "signnow-hmac-auth": true,
            "jwt-auth": true,
            "batch-rate-limiting": true,
            "oauth-consumer": true,
            "inmar-hmac-auth": true,
            "hmac-url-auth": true,
            "set-target-host": true,
            "upstream-xdt-oauth2": true,
            "upstream-jwt": true,
            "custom-log": true,
            "payload-logging": true,
            "upstream-basic-auth": true,
            "upstream-oauth2": true,
            "json-threat-protection": true,
            "jwt": true,
            "acl": true,
            "correlation-id": true,
            "cors": true,
            "oauth2": true,
            "tcp-log": true,
            "udp-log": true,
            "file-log": true,
            "http-log": true,
            "key-auth": true,
            "hmac-auth": true,
            "basic-auth": true,
            "ip-restriction": true,
            "request-transformer": true,
            "response-transformer": true,
            "request-size-limiting": true,
            "rate-limiting": true
        },
        "enabled_in_cluster": [
            "jwt-auth",
            "acl",
            "batch-rate-limiting",
            "upstream-basic-auth",
            "request-termination",
            "json-threat-protection",
            "request-transformer",
            "upstream-oauth2",
            "set-target-host",
            "jwt",
            "upstream-jwt",
            "signnow-hmac-auth",
            "request-size-limiting",
            "upstream-xdt-oauth2",
            "cors",
            "oauth2",
            "inmar-hmac-auth",
            "payload-logging",
            "custom-log",
            "basic-auth",
            "prometheus",
            "hmac-url-auth"
        ]
    },
    "timers": {
        "running": 0,
        "pending": 8
    }
}

Status page

{
    "database": {
        "reachable": true
    },
    "server": {
        "connections_writing": 2,
        "total_requests": 10658,
        "connections_waiting": 10,
        "connections_handled": 1828,
        "connections_accepted": 1828,
        "connections_active": 12,
        "connections_reading": 0
    },
    "memory": {
        "workers_lua_vms": [
            {
                "pid": 1108,
                "http_allocated_gc": "115.50 MiB"
            },
            {
                "pid": 1109,
                "http_allocated_gc": "108.54 MiB"
            }
        ],
        "lua_shared_dicts": {
            "kong": {
                "capacity": "5.00 MiB",
                "allocated_slabs": "0.04 MiB"
            },
            "kong_locks": {
                "capacity": "8.00 MiB",
                "allocated_slabs": "0.06 MiB"
            },
            "kong_healthchecks": {
                "capacity": "5.00 MiB",
                "allocated_slabs": "0.04 MiB"
            },
            "kong_process_events": {
                "capacity": "5.00 MiB",
                "allocated_slabs": "0.05 MiB"
            },
            "kong_cluster_events": {
                "capacity": "5.00 MiB",
                "allocated_slabs": "0.04 MiB"
            },
            "kong_rate_limiting_counters": {
                "capacity": "12.00 MiB",
                "allocated_slabs": "0.09 MiB"
            },
            "kong_core_db_cache": {
                "capacity": "3000.00 MiB",
                "allocated_slabs": "24.14 MiB"
            },
            "kong_core_db_cache_miss": {
                "capacity": "12.00 MiB",
                "allocated_slabs": "0.09 MiB"
            },
            "kong_db_cache": {
                "capacity": "3000.00 MiB",
                "allocated_slabs": "17.58 MiB"
            },
            "kong_db_cache_miss": {
                "capacity": "12.00 MiB",
                "allocated_slabs": "0.08 MiB"
            },
            "prometheus_metrics": {
                "capacity": "100.00 MiB",
                "allocated_slabs": "5.20 MiB"
            }
        }
    }
}

The text was updated successfully, but these errors were encountered:

hanshuebner · 2023-10-09T07:18:54Z

@kranthirachakonda How large is your configuration (number of routes/services/consumers), roughly?

kranthirachakonda · 2023-10-09T17:03:03Z

@kranthirachakonda How large is your configuration (number of routes/services/consumers), roughly?

@hanshuebner approx
Routes - 900
Services - 900
Consumers - 100
Plugins - 2900

hanshuebner · 2023-10-10T14:02:39Z

@kranthirachakonda The number of Kong entities seems to be moderate, so it is not a sizing issue. Are you able to monitor DNS traffic made by the Kong pod?

kranthirachakonda · 2023-10-10T16:45:47Z

@kranthirachakonda The number of Kong entities seems to be moderate, so it is not a sizing issue. Are you able to monitor DNS traffic made by the Kong pod?

@hanshuebner @nowNick - Yes, dns traffic increased and we tried below and none helped in stopping from kong proxy going into unresponsive mode.

Changed ndots from 5 to 2

dns_valid_ttl: 53

dns_valid_ttl: 30
dns_no_sync: on

dns_valid_ttl: 53
dns_no_sync: on

Reverted dns_valid_ttl and dns_no_sync
dns_order: LAST,A,CNAME

Reverted dns_order
dns_cache_size: 100000

Reverted dns_cache_size
dns_stale_ttl: 120

REverted dns_stale_ttl
dns_stale_ttl: 127

Reverted all DNS changes

upstream_keepalive_max_requests: 200
upstream_keepalive_pool_size: 120
upstream_keepalive_idle_timeout: 30
lua_socket_pool_size: 60

dns_not_found_ttl: 300

Reverted dns_not_found_ttl
dns_error_ttl: 30

Reverted dns_error_ttl
lua_socket_pool_size: 127

hanshuebner · 2023-10-10T16:48:44Z

@kranthirachakonda Can you provide us with some information regarding the DNS traffic that you see? Are the requests that are sent all different, or are there multiple requests for the same name? How many requests do you see?

kranthirachakonda · 2023-10-11T01:30:42Z

@kranthirachakonda Can you provide us with some information regarding the DNS traffic that you see? Are the requests that are sent all different, or are there multiple requests for the same name? How many requests do you see?

We are able to reproduce the issue in our non-prod environment also , with couple of service endpoints with mix of internal and external hostnames (fqdn).

I feel when one/few of the problematic route/service is invoked worker processes or either running out of timers or some capacity which is making it to unresponsive. Meaning at that scenario /status page or any service doesnt respond for ~45 secs.
To test our theory we increased liveness probes period to very long so that kubelet doesnt restart and saw that after 4-5 mins worker processes recovered on its own. So can you please help me figure out if I am running into some resource constraints and what those could be?

Based on our grafana chart and real-time top I dont see high CPU/Memory - CPU max is 500m and Memory max is 512Mi
Not sure why it doesnt grow beyond those values.

The DNS traffic and traffic onto the kong proxy is normal e.g.

Kong-Nginx timers - I see lots of pending state

hanshuebner · 2023-10-11T05:40:33Z

Kong-Nginx timers - I see lots of pending state

Where do you see those?

When Kong hangs, do you see a lot of open network connections?

kranthirachakonda · 2023-10-12T02:58:01Z

Kong-Nginx timers - I see lots of pending state

Where do you see those?
Nginx timers in grafana dashboard

When Kong hangs, do you see a lot of open network connections?

Ya about 200 time_waits

I see http latency about 100s for few calls, and some times 100% cpu usage for the kong proxy container alone. I am able to reproduce same issue even in 3.2.2 version also. Any help on how i can debug further.

kranthirachakonda · 2023-10-16T11:41:26Z

We found the issue with one of the external api timeouts and our custom plugin’s which caused worker processes to go into cpu 100%. Updating those fixed our issue.

alexandresavicki · 2024-04-08T23:41:51Z

@kranthirachakonda Can you please share how do you identify the root cause? I think i'm facing the same issue hare, i'm newer with kong so it will be great if can you help me with some tricks to identify.

kranthirachakonda changed the title ~~Kong proxy goes unresponsive with normal traffic - kubelet restarts the container since liveness probe fails~~ Kong proxy goes unresponsive with normal traffic - #8152 issue - kubelet restarts the container since liveness probe fails Oct 9, 2023

hanshuebner added area/ingress-controller Issues where Kong is running as a Kubernetes Ingress Controller pending author feedback Waiting for the issue author to get back to a maintainer with findings, more details, etc... labels Oct 9, 2023

anup-krai mentioned this issue Oct 9, 2023

Liveness and Readiness failures for Kong Pods #11698

Closed

nowNick removed the pending author feedback Waiting for the issue author to get back to a maintainer with findings, more details, etc... label Oct 10, 2023

hanshuebner added the pending author feedback Waiting for the issue author to get back to a maintainer with findings, more details, etc... label Oct 10, 2023

kranthirachakonda closed this as completed Oct 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kong proxy goes unresponsive with normal traffic - #8152 issue - kubelet restarts the container since liveness probe fails #11710

Kong proxy goes unresponsive with normal traffic - #8152 issue - kubelet restarts the container since liveness probe fails #11710

kranthirachakonda commented Oct 8, 2023 •

edited by hanshuebner

Loading

hanshuebner commented Oct 9, 2023

kranthirachakonda commented Oct 9, 2023

hanshuebner commented Oct 10, 2023

kranthirachakonda commented Oct 10, 2023 •

edited

Loading

hanshuebner commented Oct 10, 2023

kranthirachakonda commented Oct 11, 2023 •

edited

Loading

hanshuebner commented Oct 11, 2023

kranthirachakonda commented Oct 12, 2023

kranthirachakonda commented Oct 16, 2023 •

edited

Loading

alexandresavicki commented Apr 8, 2024

Kong proxy goes unresponsive with normal traffic - #8152 issue - kubelet restarts the container since liveness probe fails #11710

Kong proxy goes unresponsive with normal traffic - #8152 issue - kubelet restarts the container since liveness probe fails #11710

Comments

kranthirachakonda commented Oct 8, 2023 • edited by hanshuebner Loading

Discussed in #11709

hanshuebner commented Oct 9, 2023

kranthirachakonda commented Oct 9, 2023

hanshuebner commented Oct 10, 2023

kranthirachakonda commented Oct 10, 2023 • edited Loading

Changed ndots from 5 to 2

dns_valid_ttl: 53

dns_valid_ttl: 30 dns_no_sync: on

dns_valid_ttl: 53 dns_no_sync: on

Reverted dns_valid_ttl and dns_no_sync dns_order: LAST,A,CNAME

Reverted dns_order dns_cache_size: 100000

Reverted dns_cache_size dns_stale_ttl: 120

REverted dns_stale_ttl dns_stale_ttl: 127

Reverted all DNS changes

upstream_keepalive_max_requests: 200 upstream_keepalive_pool_size: 120 upstream_keepalive_idle_timeout: 30 lua_socket_pool_size: 60

dns_not_found_ttl: 300

Reverted dns_not_found_ttl dns_error_ttl: 30

hanshuebner commented Oct 10, 2023

kranthirachakonda commented Oct 11, 2023 • edited Loading

hanshuebner commented Oct 11, 2023

kranthirachakonda commented Oct 12, 2023

kranthirachakonda commented Oct 16, 2023 • edited Loading

alexandresavicki commented Apr 8, 2024

kranthirachakonda commented Oct 8, 2023 •

edited by hanshuebner

Loading

kranthirachakonda commented Oct 10, 2023 •

edited

Loading

dns_valid_ttl: 30
dns_no_sync: on

dns_valid_ttl: 53
dns_no_sync: on

Reverted dns_valid_ttl and dns_no_sync
dns_order: LAST,A,CNAME

Reverted dns_order
dns_cache_size: 100000

Reverted dns_cache_size
dns_stale_ttl: 120

REverted dns_stale_ttl
dns_stale_ttl: 127

upstream_keepalive_max_requests: 200
upstream_keepalive_pool_size: 120
upstream_keepalive_idle_timeout: 30
lua_socket_pool_size: 60

Reverted dns_not_found_ttl
dns_error_ttl: 30

kranthirachakonda commented Oct 11, 2023 •

edited

Loading

kranthirachakonda commented Oct 16, 2023 •

edited

Loading