Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

uwsgi worker dying at high request load with error "taking too much time to die...NO MERCY !!!" #2611

Open
naveenchandpandey opened this issue Feb 21, 2024 · 3 comments

Comments

@naveenchandpandey
Copy link

naveenchandpandey commented Feb 21, 2024

Hi,

We are running uwsgi service as our primary WSGI for a Django-based project with the following configuration:


[uwsgi]
chdir           = /var/www/rest-api-python/
module          = etc.wsgi:application
home            = /var/www/rest-api-python/env
master          = true
processes       = 4
# Processes should be equal to no of vcpus of machine
threads         = 15
socket          = /var/run/python/django_server.sock
chmod-socket    = 750
#vacuum          = true
uid	        = www-data
gid	        = www-data
touch-reload 	= /var/run/python/reload
master-fifo     = /var/www/pipes/rest-api-python-fifo
lazy-apps       = true

env = DJANGO_SETTINGS_MODULE=etc.settings
safe-pidfile = /var/run/python/django-server.pid
harakiri = 1000
#limit-as = 128
max-requests = 30000
route-uri = ^/python/(.*) rewrite:/$1
enable-threads = true
disable-logging = true
log-4xx = true
log-5xx = true

env = prometheus_multiproc_dir=/var/run/python/prometheus/
listen = 250
thunder-lock = true

What we are observing here is that at high request loads, all the workers die around the same time making the service unavailable for further requests. These workers eventually get respawned but during this period no new request gets served. Following are the uwsgi logs for the same:

[deadlock-detector] a process holding a robust mutex died. recovering...
[deadlock-detector] a process holding a robust mutex died. recovering...
Fri Feb 16 07:30:38 2024 - worker 2 (pid: 2214) is taking too much time to die...NO MERCY !!!
DAMN ! worker 2 (pid: 2214) died, killed by signal 9 :( trying respawn ...
Respawned uWSGI worker 2 (new pid: 67074)
Fri Feb 16 07:30:40 2024 - worker 1 (pid: 2213) is taking too much time to die...NO MERCY !!!
Fri Feb 16 07:30:40 2024 - worker 3 (pid: 2215) is taking too much time to die...NO MERCY !!!
Fri Feb 16 07:30:40 2024 - worker 4 (pid: 2216) is taking too much time to die...NO MERCY !!!
WSGI app 0 (mountpoint='') ready in 2 seconds on interpreter 0x55936c12fb00 pid: 67074 (default app)
DAMN ! worker 1 (pid: 2213) died, killed by signal 9 :( trying respawn ...
Respawned uWSGI worker 1 (new pid: 67108)
DAMN ! worker 3 (pid: 2215) died, killed by signal 9 :( trying respawn ...
Respawned uWSGI worker 3 (new pid: 67109)
DAMN ! worker 4 (pid: 2216) died, killed by signal 9 :( trying respawn ...
Respawned uWSGI worker 4 (new pid: 67110)

Also sometimes we also face the following error just after the deadlock error:

corrupted double-linked list
worker 2 killed successfully (pid: 133287)
Respawned uWSGI worker 2 (new pid: 201887)

Can you please let us know if we have any configuration-level issues here or if it's something completely unrelated?

System config:
Python: 3.6.9
OS: Ubuntu "22.04.3 LTS (Jammy Jellyfish)"
Django: 2.2.4
uWSGI: 2.0.18

Our machine config is as follows:

vCPUs | 8
Memory (GiB) | 16.0
Memory per vCPU (GiB) | 2.0
Physical Processor | AMD EPYC 7R13 Processor
Clock Speed (GHz) | 3.6
CPU Architecture | x86_64

Ulimits of our machine:

real-time non-blocking time  (microseconds, -R) unlimited
core file size              (blocks, -c) 0
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 0
file size                   (blocks, -f) unlimited
pending signals                     (-i) 62311
max locked memory           (kbytes, -l) 2000660
max memory size             (kbytes, -m) unlimited
open files                          (-n) 65536
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 0
stack size                  (kbytes, -s) 8192
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) 62311
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited

Furthermore, i am also attaching uwsgi startup logs. have a feeling that it might help:

*** Starting uWSGI 2.0.18 (64bit) on [Wed Feb 21 06:21:02 2024] ***
compiled with version: 7.5.0 on 14 July 2020 07:19:16
os: Linux-6.2.0-1012-aws #12~22.04.1-Ubuntu SMP Thu Sep  7 14:01:24 UTC 2023
nodename: ip-10-61-33-114
machine: x86_64
clock source: unix
pcre jit disabled
detected number of CPU cores: 8
current working directory: /
detected binary path: /var/www/rest-api-python/env/bin/uwsgi
*** dumping internal routing table ***
[rule: 0] subject: request_uri regexp: ^/python/(.*) action: rewrite:/$1
*** end of the internal routing table ***
chdir() to /var/www/rest-api-python/
your processes number limit is 62311
your memory page size is 4096 bytes
 *** WARNING: you have enabled harakiri without post buffering. Slow upload could be rejected on post-unbuffered webservers ***
detected max file descriptor number: 1024
lock engine: pthread robust mutexes
thunder lock: enabled
uwsgi socket 0 bound to UNIX address /var/run/python/django_server.sock fd 4
Python version: 3.6.9 (default, May  5 2020, 05:01:21)  [GCC 7.5.0]
PEP 405 virtualenv detected: /var/www/rest-api-python/env
Set PythonHome to /var/www/rest-api-python/env
Python main interpreter initialized at 0x55f181a66b00
python threads support enabled
your server socket listen backlog is limited to 250 connections
your mercy for graceful operations on workers is 60 seconds
mapped 1096520 bytes (1070 KB) for 60 cores
*** Operational MODE: preforking+threaded ***
*** uWSGI is running in multiple interpreter mode ***
spawned uWSGI master process (pid: 789)
spawned uWSGI worker 1 (pid: 972, cores: 15)
writing pidfile to /var/run/python/django-server.pid
spawned uWSGI worker 2 (pid: 973, cores: 15)
writing pidfile to /var/run/python/django-server.pid
spawned uWSGI worker 3 (pid: 974, cores: 15)
writing pidfile to /var/run/python/django-server.pid
spawned uWSGI worker 4 (pid: 975, cores: 15)
writing pidfile to /var/run/python/django-server.pid
writing pidfile to /var/run/python/django-server.pid
unable to stat() /var/run/python/reload, events will be triggered as soon as the file is created
WSGI app 0 (mountpoint='') ready in 7 seconds on interpreter 0x55f181a66b00 pid: 975 (default app)
WSGI app 0 (mountpoint='') ready in 7 seconds on interpreter 0x55f181a66b00 pid: 974 (default app)
WSGI app 0 (mountpoint='') ready in 7 seconds on interpreter 0x55f181a66b00 pid: 973 (default app)
WSGI app 0 (mountpoint='') ready in 7 seconds on interpreter 0x55f181a66b00 pid: 972 (default app)
@hb-akhilesh
Copy link

hb-akhilesh commented Feb 21, 2024

malloc_consolidate(): unaligned fastbin chunk detected , corrupted double-linked list There are different kind of messages - on different occasions - we see with deadlock detection.

@hb-akhilesh
Copy link

Increasing processes to 8 and reducing threads to 4 made some improvement.

@methane
Copy link
Contributor

methane commented Mar 19, 2024

Try #2615 (and, maybe #2619).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants