Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kong proxy goes unresponsive with normal traffic - #8152 issue - kubelet restarts the container since liveness probe fails #11710

Closed
kranthirachakonda opened this issue Oct 8, 2023 Discussed in #11709 · 10 comments
Labels
area/ingress-controller Issues where Kong is running as a Kubernetes Ingress Controller pending author feedback Waiting for the issue author to get back to a maintainer with findings, more details, etc...

Comments

@kranthirachakonda
Copy link

kranthirachakonda commented Oct 8, 2023

Discussed in #11709

Originally posted by kranthirachakonda October 7, 2023
Looks like we are running into Kong #8152 issue with 2.8.3 and also 3.0.1. We had new route/service added recently and there is increase in traffic but not huge. May ~15% increase. Only proxy container inside Kong pods gets restart due to /status timing out. We tried changing the timeout/period intervals but that didnt help much. We dont see our containers going above the cpu/memory requests.

Behavior is identical, where kong proxy container goes into unresponsive mode for about 4-5 mins and processes again. We also observed that while I gave memory requests of 4G, proxy container never goes above 512Mi. Not sure if there are default limits within LuaVM or threads etc. Can you please help me in debugging further.

db-less

kong:2.8.3

    Limits:
      cpu:     2500m
      memory:  6Gi
    Requests:
      cpu:      1
      memory:   4Gi
    Liveness:   http-get http://:status/status delay=5s timeout=5s period=10s #success=1 #failure=3

kubernetes-ingress-controller:2.8.2

    Limits:
      cpu:     2
      memory:  2Gi
    Requests:
      cpu:      200m
      memory:   1Gi
    Liveness:   http-get http://:10254/healthz delay=5s timeout=5s period=10s #success=1 #failure=3

Events:

  Type     Reason     Age                    From     Message
  ----     ------     ----                   ----     -------
  Warning  Unhealthy  4m7s (x94 over 20h)    kubelet  Liveness probe failed: Get "http://192.168.10.79:8100/status": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Normal   Killing    4m7s (x27 over 20h)    kubelet  Container proxy failed liveness probe, will be restarted

These are few errors I see while proxy containers goes into a condition where status page fails to respond within 5s, causing kubelet to restart when liveness probe fails.

[lua] targets.lua:248: could not reschedule DNS resolver timer: process exiting, context: ngx.timer
[lua] init.lua:409: [cluster_events] failed to start recurring polling timer: process exiting, context: ngx.timer

Admin page

{
    "configuration": {
        "nginx_kong_conf": "/kong_prefix/nginx-kong.conf",
        "nginx_kong_stream_conf": "/kong_prefix/nginx-kong-stream.conf",
        "dns_error_ttl": 1,
        "kong_env": "/kong_prefix/.kong_env",
        "dns_not_found_ttl": 30,
        "dns_stale_ttl": 4,
        "dns_cache_size": 10000,
        "dns_order": [
            "LAST",
            "SRV",
            "A",
            "CNAME"
        ],
        "dns_no_sync": false,
        "upstream_keepalive_pool_size": 60,
        "client_ssl_cert_default": "/kong_prefix/ssl/kong-default.crt",
        "client_ssl_cert_key_default": "/kong_prefix/ssl/kong-default.key",
        "admin_ssl_cert_default": "/kong_prefix/ssl/admin-kong-default.crt",
        "admin_ssl_cert_key_default": "/kong_prefix/ssl/admin-kong-default.key",
        "admin_ssl_cert_default_ecdsa": "/kong_prefix/ssl/admin-kong-default-ecdsa.crt",
        "ssl_cipher_suite": "intermediate",
        "admin_ssl_cert_key_default_ecdsa": "/kong_prefix/ssl/admin-kong-default-ecdsa.key",
        "ssl_ciphers": "ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384",
        "status_ssl_cert_default": "/kong_prefix/ssl/status-kong-default.crt",
        "status_ssl_cert_key_default": "/kong_prefix/ssl/status-kong-default.key",
        "nginx_main_worker_rlimit_nofile": "auto",
        "upstream_keepalive_idle_timeout": 60,
        "nginx_events_worker_connections": "auto",
        "nginx_events_multi_accept": "on",
        "lua_package_cpath": "",
        "port_maps": [
            "443:8443"
        ],
        "ssl_cert": [
            "/etc/secrets/proxy-tls/tls.crt"
        ],
        "proxy_listen": [
            "0.0.0.0:8443 http2 ssl"
        ],
        "admin_listen": [
            "0.0.0.0:8444 ssl"
        ],
        "status_listen": [
            "0.0.0.0:8100",
            "0.0.0.0:8543 ssl"
        ],
        "stream_listen": [
            "off"
        ],
        "cluster_listen": [
            "off"
        ],
        "admin_ssl_cert": [
            "/kong_prefix/ssl/admin-kong-default.crt",
            "/kong_prefix/ssl/admin-kong-default-ecdsa.crt"
        ],
        "nginx_admin_client_max_body_size": "10m",
        "status_ssl_cert": [
            "/kong_prefix/ssl/status-kong-default.crt",
            "/kong_prefix/ssl/status-kong-default-ecdsa.crt"
        ],
        "status_ssl_cert_key": [
            "/kong_prefix/ssl/status-kong-default.key",
            "/kong_prefix/ssl/status-kong-default-ecdsa.key"
        ],
        "nginx_admin_client_body_buffer_size": "10m",
        "nginx_user": "kong kong",
        "nginx_main_user": "kong kong",
        "nginx_daemon": "off",
        "nginx_main_daemon": "off",
        "nginx_worker_processes": "2",
        "nginx_main_worker_processes": "2",
        "trusted_ips": {},
        "kic": true,
        "nginx_proxy_real_ip_header": "X-Real-IP",
        "real_ip_recursive": "off",
        "nginx_proxy_real_ip_recursive": "off",
        "client_max_body_size": "0",
        "nginx_http_client_max_body_size": "0",
        "client_body_buffer_size": "8k",
        "nginx_http_client_body_buffer_size": "8k",
        "pg_port": 50001,
        "pg_password": "******",
        "pg_ssl": true,
        "pg_ssl_verify": false,
        "pg_max_concurrent_queries": 0,
        "pg_semaphore_timeout": 60000,
        "pg_ro_port": 50002,
        "pg_user": "user1",
        "pg_ro_password": "******",
        "pg_ro_ssl": true,
        "enabled_headers": {
            "server_tokens": true,
            "latency_tokens": true,
            "Via": true,
            "Server": true,
            "X-Kong-Proxy-Latency": true,
            "X-Kong-Response-Latency": true,
            "X-Kong-Admin-Latency": true,
            "X-Kong-Upstream-Status": true,
            "X-Kong-Upstream-Latency": true
        },
        "cassandra_contact_points": [
            "127.0.0.1"
        ],
        "cassandra_port": 9042,
        "pg_ro_host": "hawkeye-haproxy-svc.hawkeye-haproxy-ns.svc.cluster.local",
        "cassandra_ssl": false,
        "cassandra_ssl_verify": false,
        "cassandra_write_consistency": "ONE",
        "pg_ro_user": "user1",
        "cassandra_read_consistency": "ONE",
        "cassandra_lb_policy": "RequestRoundRobin",
        "cassandra_keyspace": "kong",
        "cassandra_refresh_frequency": 60,
        "error_default_type": "text/plain",
        "cassandra_username": "kong",
        "cassandra_repl_factor": 1,
        "cassandra_data_centers": [
            "dc1:2",
            "dc2:3"
        ],
        "cassandra_schema_consensus_timeout": 10000,
        "ssl_protocols": "TLSv1.1 TLSv1.2 TLSv1.3",
        "nginx_http_ssl_protocols": "TLSv1.2 TLSv1.3",
        "nginx_stream_ssl_protocols": "TLSv1.2 TLSv1.3",
        "worker_consistency": "strict",
        "nginx_http_ssl_prefer_server_ciphers": "off",
        "nginx_stream_ssl_prefer_server_ciphers": "off",
        "ssl_dhparam": "ffdhe2048",
        "pluginserver_names": {},
        "nginx_stream_ssl_dhparam": "ffdhe2048",
        "ssl_session_tickets": "on",
        "nginx_http_ssl_session_tickets": "on",
        "nginx_stream_ssl_session_tickets": "on",
        "stream_proxy_ssl_enabled": false,
        "go_plugins_dir": "off",
        "prefix": "/kong_prefix",
        "go_pluginserver_exe": "/usr/local/bin/go-pluginserver",
        "lua_package_path": "/opt/?.lua;/opt/?/init.lua;;",
        "db_cache_warmup_entities": [
            "services"
        ],
        "status_listeners": [
            {
                "ssl": false,
                "port": 8100,
                "listener": "0.0.0.0:8100",
                "ip": "0.0.0.0"
            },
            {
                "ssl": true,
                "port": 8543,
                "listener": "0.0.0.0:8543 ssl",
                "ip": "0.0.0.0"
            }
        ],
        "proxy_ssl_enabled": true,
        "pg_timeout": 5000,
        "role": "traditional",
        "ssl_cert_key_default": "/kong_prefix/ssl/kong-default.key",
        "headers": [
            "server_tokens",
            "latency_tokens",
            "X-Kong-Upstream-Status"
        ],
        "proxy_stream_error_log": "logs/error.log",
        "ssl_cert_default": "/kong_prefix/ssl/kong-default.crt",
        "nginx_http_lua_shared_dict": "prometheus_metrics 100m",
        "ssl_cert_key_default_ecdsa": "/kong_prefix/ssl/kong-default-ecdsa.key",
        "loaded_vaults": {},
        "plugins": [
            "bundled",
            "json-threat-protection",
            "upstream-oauth2",
            "upstream-basic-auth",
            "payload-logging",
            "custom-log",
            "upstream-jwt",
            "upstream-xdt-oauth2",
            "set-target-host",
            "hmac-url-auth",
            "inmar-hmac-auth",
            "oauth-consumer",
            "signnow-hmac-auth",
            "jwt-auth",
            "batch-rate-limiting"
        ],
        "admin_error_log": "/dev/stderr",
        "loaded_plugins": {
            "response-ratelimiting": true,
            "syslog": true,
            "loggly": true,
            "datadog": true,
            "ldap-auth": true,
            "statsd": true,
            "bot-detection": true,
            "aws-lambda": true,
            "request-termination": true,
            "prometheus": true,
            "proxy-cache": true,
            "session": true,
            "acme": true,
            "grpc-gateway": true,
            "grpc-web": true,
            "pre-function": true,
            "post-function": true,
            "azure-functions": true,
            "zipkin": true,
            "signnow-hmac-auth": true,
            "jwt-auth": true,
            "batch-rate-limiting": true,
            "upstream-oauth2": true,
            "upstream-basic-auth": true,
            "payload-logging": true,
            "custom-log": true,
            "upstream-jwt": true,
            "upstream-xdt-oauth2": true,
            "set-target-host": true,
            "hmac-url-auth": true,
            "inmar-hmac-auth": true,
            "oauth-consumer": true,
            "json-threat-protection": true,
            "jwt": true,
            "acl": true,
            "correlation-id": true,
            "cors": true,
            "oauth2": true,
            "tcp-log": true,
            "udp-log": true,
            "file-log": true,
            "http-log": true,
            "key-auth": true,
            "hmac-auth": true,
            "basic-auth": true,
            "ip-restriction": true,
            "request-transformer": true,
            "response-transformer": true,
            "request-size-limiting": true,
            "rate-limiting": true
        },
        "proxy_error_log": "/dev/stderr",
        "host_ports": {
            "8443": 443,
            "8443": 443
        },
        "anonymous_reports": false,
        "real_ip_header": "X-Real-IP",
        "database": "postgres",
        "ssl_prefer_server_ciphers": "on",
        "status_ssl_cert_key_default_ecdsa": "/kong_prefix/ssl/status-kong-default-ecdsa.key",
        "db_resurrect_ttl": 30,
        "ssl_session_timeout": "1d",
        "nginx_http_ssl_session_timeout": "1d",
        "nginx_stream_ssl_session_timeout": "1d",
        "proxy_access_log": "/dev/stdout",
        "worker_state_update_frequency": 5,
        "proxy_stream_access_log": "logs/access.log basic",
        "dns_resolver": {},
        "admin_access_log": "/dev/stdout",
        "dns_hostsfile": "/etc/hosts",
        "status_access_log": "off",
        "status_error_log": "logs/status_error.log",
        "log_level": "error",
        "cluster_listeners": {},
        "status_ssl_enabled": true,
        "vaults": [
            "off"
        ],
        "nginx_optimizations": true,
        "lua_ssl_trusted_certificate": {},
        "lua_ssl_verify_depth": 1,
        "lua_ssl_protocols": "TLSv1.1 TLSv1.2 TLSv1.3",
        "nginx_http_lua_ssl_protocols": "TLSv1.1 TLSv1.2 TLSv1.3",
        "nginx_stream_lua_ssl_protocols": "TLSv1.1 TLSv1.2 TLSv1.3",
        "lua_socket_pool_size": 30,
        "db_update_frequency": 5,
        "cluster_control_plane": "127.0.0.1:8005",
        "cassandra_repl_strategy": "SimpleStrategy",
        "mem_cache_size": "3000m",
        "cluster_mtls": "shared",
        "ssl_cert_key": [
            "/etc/secrets/proxy-tls/tls.key"
        ],
        "db_cache_ttl": 0,
        "nginx_main_directives": [
            {
                "value": "off",
                "name": "daemon"
            },
            {
                "value": "KONG_REDIS_HOST;env KONG_REDIS_PORT;env KONG_REDIS_PASSWORD",
                "name": "env"
            },
            {
                "value": "kong kong",
                "name": "user"
            },
            {
                "value": "2",
                "name": "worker_processes"
            },
            {
                "value": "auto",
                "name": "worker_rlimit_nofile"
            }
        ],
        "cluster_data_plane_purge_delay": 1209600,
        "cluster_ocsp": "off",
        "cluster_max_payload": 4194304,
        "untrusted_lua": "sandbox",
        "nginx_http_directives": [
            {
                "value": "8k",
                "name": "client_body_buffer_size"
            },
            {
                "value": "0",
                "name": "client_max_body_size"
            },
            {
                "value": "100000",
                "name": "lua_regex_match_limit"
            },
            {
                "value": "prometheus_metrics 100m",
                "name": "lua_shared_dict"
            },
            {
                "value": "TLSv1.1 TLSv1.2 TLSv1.3",
                "name": "lua_ssl_protocols"
            },
            {
                "value": "/kong_prefix/ssl/ffdhe2048.pem",
                "name": "ssl_dhparam"
            },
            {
                "value": "off",
                "name": "ssl_prefer_server_ciphers"
            },
            {
                "value": "TLSv1.2 TLSv1.3",
                "name": "ssl_protocols"
            },
            {
                "value": "on",
                "name": "ssl_session_tickets"
            },
            {
                "value": "1d",
                "name": "ssl_session_timeout"
            }
        ],
        "untrusted_lua_sandbox_requires": {},
        "untrusted_lua_sandbox_environment": {},
        "client_ssl": false,
        "upstream_keepalive_max_requests": 100,
        "nginx_upstream_directives": {},
        "pg_ro_database": "hawkeye",
        "nginx_proxy_directives": [
            {
                "value": "X-Real-IP",
                "name": "real_ip_header"
            },
            {
                "value": "off",
                "name": "real_ip_recursive"
            }
        ],
        "pg_database": "hawkeye",
        "nginx_status_directives": {},
        "admin_ssl_cert_key": [
            "/kong_prefix/ssl/admin-kong-default.key",
            "/kong_prefix/ssl/admin-kong-default-ecdsa.key"
        ],
        "nginx_admin_directives": [
            {
                "value": "10m",
                "name": "client_body_buffer_size"
            },
            {
                "value": "10m",
                "name": "client_max_body_size"
            }
        ],
        "nginx_events_directives": [
            {
                "value": "on",
                "name": "multi_accept"
            },
            {
                "value": "auto",
                "name": "worker_connections"
            }
        ],
        "nginx_stream_directives": [
            {
                "value": "stream_prometheus_metrics 5m",
                "name": "lua_shared_dict"
            },
            {
                "value": "TLSv1.1 TLSv1.2 TLSv1.3",
                "name": "lua_ssl_protocols"
            },
            {
                "value": "/kong_prefix/ssl/ffdhe2048.pem",
                "name": "ssl_dhparam"
            },
            {
                "value": "off",
                "name": "ssl_prefer_server_ciphers"
            },
            {
                "value": "TLSv1.2 TLSv1.3",
                "name": "ssl_protocols"
            },
            {
                "value": "on",
                "name": "ssl_session_tickets"
            },
            {
                "value": "1d",
                "name": "ssl_session_timeout"
            }
        ],
        "pg_host": "hawkeye-haproxy-svc.hawkeye-haproxy-ns.svc.cluster.local",
        "nginx_supstream_directives": {},
        "db_update_propagation": 0,
        "nginx_sproxy_directives": {},
        "proxy_listeners": [
            {
                "ssl": true,
                "port": 8443,
                "bind": false,
                "listener": "0.0.0.0:8443 ssl http2",
                "deferred": false,
                "reuseport": false,
                "backlog=%d+": false,
                "proxy_protocol": false,
                "http2": true,
                "ip": "0.0.0.0"
            }
        ],
        "stream_listeners": {},
        "admin_listeners": [
            {
                "ssl": true,
                "port": 8444,
                "bind": false,
                "listener": "0.0.0.0:8444 ssl",
                "deferred": false,
                "reuseport": false,
                "backlog=%d+": false,
                "proxy_protocol": false,
                "http2": false,
                "ip": "0.0.0.0"
            }
        ],
        "admin_ssl_enabled": true,
        "nginx_http_upstream_directives": {},
        "nginx_http_status_directives": {},
        "nginx_http_ssl_dhparam": "ffdhe2048",
        "nginx_pid": "/kong_prefix/pids/nginx.pid",
        "nginx_http_lua_regex_match_limit": "100000",
        "status_ssl_cert_default_ecdsa": "/kong_prefix/ssl/status-kong-default-ecdsa.crt",
        "nginx_err_logs": "/kong_prefix/logs/error.log",
        "nginx_main_env": "KONG_REDIS_HOST;env KONG_REDIS_PORT;env KONG_REDIS_PASSWORD",
        "cassandra_timeout": 5000,
        "nginx_acc_logs": "/kong_prefix/logs/access.log",
        "pg_ro_ssl_verify": false,
        "admin_acc_logs": "/kong_prefix/logs/admin_access.log",
        "ssl_cert_csr_default": "/kong_prefix/ssl/kong-default.csr",
        "nginx_conf": "/kong_prefix/nginx.conf",
        "ssl_cert_default_ecdsa": "/kong_prefix/ssl/kong-default-ecdsa.crt"
    },
    "version": "2.8.3",
    "tagline": "Welcome to kong",
    "lua_version": "LuaJIT 2.1.0-beta3",
    "pids": {
        "workers": [
            1108,
            1109
        ],
        "master": 1
    },
    "hostname": "hawkeye-kong-c75b565dc-gkkkn",
    "node_id": "4517d36f-f53a-4514-b269-5046b3d13f83",
    "plugins": {
        "available_on_server": {
            "response-ratelimiting": true,
            "syslog": true,
            "loggly": true,
            "datadog": true,
            "ldap-auth": true,
            "statsd": true,
            "bot-detection": true,
            "aws-lambda": true,
            "request-termination": true,
            "prometheus": true,
            "proxy-cache": true,
            "session": true,
            "acme": true,
            "grpc-gateway": true,
            "grpc-web": true,
            "pre-function": true,
            "post-function": true,
            "azure-functions": true,
            "zipkin": true,
            "signnow-hmac-auth": true,
            "jwt-auth": true,
            "batch-rate-limiting": true,
            "oauth-consumer": true,
            "inmar-hmac-auth": true,
            "hmac-url-auth": true,
            "set-target-host": true,
            "upstream-xdt-oauth2": true,
            "upstream-jwt": true,
            "custom-log": true,
            "payload-logging": true,
            "upstream-basic-auth": true,
            "upstream-oauth2": true,
            "json-threat-protection": true,
            "jwt": true,
            "acl": true,
            "correlation-id": true,
            "cors": true,
            "oauth2": true,
            "tcp-log": true,
            "udp-log": true,
            "file-log": true,
            "http-log": true,
            "key-auth": true,
            "hmac-auth": true,
            "basic-auth": true,
            "ip-restriction": true,
            "request-transformer": true,
            "response-transformer": true,
            "request-size-limiting": true,
            "rate-limiting": true
        },
        "enabled_in_cluster": [
            "jwt-auth",
            "acl",
            "batch-rate-limiting",
            "upstream-basic-auth",
            "request-termination",
            "json-threat-protection",
            "request-transformer",
            "upstream-oauth2",
            "set-target-host",
            "jwt",
            "upstream-jwt",
            "signnow-hmac-auth",
            "request-size-limiting",
            "upstream-xdt-oauth2",
            "cors",
            "oauth2",
            "inmar-hmac-auth",
            "payload-logging",
            "custom-log",
            "basic-auth",
            "prometheus",
            "hmac-url-auth"
        ]
    },
    "timers": {
        "running": 0,
        "pending": 8
    }
}

Status page

{
    "database": {
        "reachable": true
    },
    "server": {
        "connections_writing": 2,
        "total_requests": 10658,
        "connections_waiting": 10,
        "connections_handled": 1828,
        "connections_accepted": 1828,
        "connections_active": 12,
        "connections_reading": 0
    },
    "memory": {
        "workers_lua_vms": [
            {
                "pid": 1108,
                "http_allocated_gc": "115.50 MiB"
            },
            {
                "pid": 1109,
                "http_allocated_gc": "108.54 MiB"
            }
        ],
        "lua_shared_dicts": {
            "kong": {
                "capacity": "5.00 MiB",
                "allocated_slabs": "0.04 MiB"
            },
            "kong_locks": {
                "capacity": "8.00 MiB",
                "allocated_slabs": "0.06 MiB"
            },
            "kong_healthchecks": {
                "capacity": "5.00 MiB",
                "allocated_slabs": "0.04 MiB"
            },
            "kong_process_events": {
                "capacity": "5.00 MiB",
                "allocated_slabs": "0.05 MiB"
            },
            "kong_cluster_events": {
                "capacity": "5.00 MiB",
                "allocated_slabs": "0.04 MiB"
            },
            "kong_rate_limiting_counters": {
                "capacity": "12.00 MiB",
                "allocated_slabs": "0.09 MiB"
            },
            "kong_core_db_cache": {
                "capacity": "3000.00 MiB",
                "allocated_slabs": "24.14 MiB"
            },
            "kong_core_db_cache_miss": {
                "capacity": "12.00 MiB",
                "allocated_slabs": "0.09 MiB"
            },
            "kong_db_cache": {
                "capacity": "3000.00 MiB",
                "allocated_slabs": "17.58 MiB"
            },
            "kong_db_cache_miss": {
                "capacity": "12.00 MiB",
                "allocated_slabs": "0.08 MiB"
            },
            "prometheus_metrics": {
                "capacity": "100.00 MiB",
                "allocated_slabs": "5.20 MiB"
            }
        }
    }
}
@kranthirachakonda kranthirachakonda changed the title Kong proxy goes unresponsive with normal traffic - kubelet restarts the container since liveness probe fails Kong proxy goes unresponsive with normal traffic - #8152 issue - kubelet restarts the container since liveness probe fails Oct 9, 2023
@hanshuebner hanshuebner added area/ingress-controller Issues where Kong is running as a Kubernetes Ingress Controller pending author feedback Waiting for the issue author to get back to a maintainer with findings, more details, etc... labels Oct 9, 2023
@hanshuebner
Copy link
Contributor

@kranthirachakonda How large is your configuration (number of routes/services/consumers), roughly?

@kranthirachakonda
Copy link
Author

@kranthirachakonda How large is your configuration (number of routes/services/consumers), roughly?

@hanshuebner approx
Routes - 900
Services - 900
Consumers - 100
Plugins - 2900

@nowNick nowNick removed the pending author feedback Waiting for the issue author to get back to a maintainer with findings, more details, etc... label Oct 10, 2023
@hanshuebner
Copy link
Contributor

@kranthirachakonda The number of Kong entities seems to be moderate, so it is not a sizing issue. Are you able to monitor DNS traffic made by the Kong pod?

@hanshuebner hanshuebner added the pending author feedback Waiting for the issue author to get back to a maintainer with findings, more details, etc... label Oct 10, 2023
@kranthirachakonda
Copy link
Author

kranthirachakonda commented Oct 10, 2023

@kranthirachakonda The number of Kong entities seems to be moderate, so it is not a sizing issue. Are you able to monitor DNS traffic made by the Kong pod?

@hanshuebner @nowNick - Yes, dns traffic increased and we tried below and none helped in stopping from kong proxy going into unresponsive mode.

Changed ndots from 5 to 2

dns_valid_ttl: 53

dns_valid_ttl: 30
dns_no_sync: on

dns_valid_ttl: 53
dns_no_sync: on

Reverted dns_valid_ttl and dns_no_sync
dns_order: LAST,A,CNAME

Reverted dns_order
dns_cache_size: 100000

Reverted dns_cache_size
dns_stale_ttl: 120

REverted dns_stale_ttl
dns_stale_ttl: 127

Reverted all DNS changes

upstream_keepalive_max_requests: 200
upstream_keepalive_pool_size: 120
upstream_keepalive_idle_timeout: 30
lua_socket_pool_size: 60

dns_not_found_ttl: 300

Reverted dns_not_found_ttl
dns_error_ttl: 30

Reverted dns_error_ttl
lua_socket_pool_size: 127

@hanshuebner
Copy link
Contributor

@kranthirachakonda Can you provide us with some information regarding the DNS traffic that you see? Are the requests that are sent all different, or are there multiple requests for the same name? How many requests do you see?

@kranthirachakonda
Copy link
Author

kranthirachakonda commented Oct 11, 2023

@kranthirachakonda Can you provide us with some information regarding the DNS traffic that you see? Are the requests that are sent all different, or are there multiple requests for the same name? How many requests do you see?

We are able to reproduce the issue in our non-prod environment also , with couple of service endpoints with mix of internal and external hostnames (fqdn).

I feel when one/few of the problematic route/service is invoked worker processes or either running out of timers or some capacity which is making it to unresponsive. Meaning at that scenario /status page or any service doesnt respond for ~45 secs.
To test our theory we increased liveness probes period to very long so that kubelet doesnt restart and saw that after 4-5 mins worker processes recovered on its own. So can you please help me figure out if I am running into some resource constraints and what those could be?

Based on our grafana chart and real-time top I dont see high CPU/Memory - CPU max is 500m and Memory max is 512Mi
Not sure why it doesnt grow beyond those values.

The DNS traffic and traffic onto the kong proxy is normal e.g.

image

Kong-Nginx timers - I see lots of pending state
image

@hanshuebner
Copy link
Contributor

Kong-Nginx timers - I see lots of pending state

Where do you see those?

When Kong hangs, do you see a lot of open network connections?

@kranthirachakonda
Copy link
Author

Kong-Nginx timers - I see lots of pending state

Where do you see those?
Nginx timers in grafana dashboard

When Kong hangs, do you see a lot of open network connections?

Ya about 200 time_waits

I see http latency about 100s for few calls, and some times 100% cpu usage for the kong proxy container alone. I am able to reproduce same issue even in 3.2.2 version also. Any help on how i can debug further.

@kranthirachakonda
Copy link
Author

kranthirachakonda commented Oct 16, 2023

We found the issue with one of the external api timeouts and our custom plugin’s which caused worker processes to go into cpu 100%. Updating those fixed our issue.

@alexandresavicki
Copy link

@kranthirachakonda Can you please share how do you identify the root cause? I think i'm facing the same issue hare, i'm newer with kong so it will be great if can you help me with some tricks to identify.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ingress-controller Issues where Kong is running as a Kubernetes Ingress Controller pending author feedback Waiting for the issue author to get back to a maintainer with findings, more details, etc...
Projects
None yet
Development

No branches or pull requests

4 participants