Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pyuwsgi + enable-threads + python3.12: workers do not recover after a reload #2659

Closed
asottile-sentry opened this issue Aug 12, 2024 · 11 comments

Comments

@asottile-sentry
Copy link

asottile-sentry commented Aug 12, 2024

reproduction:

python3.12 -m venv venv
. venv/bin/activate
pip install pyuwsgi
bash t.sh
# wsgi.py
def application(environ, start_response):
    print('.', end='', flush=True)

    status = '200 OK'
    output = b'!'

    response_headers = [('Content-type', 'text/plain'),
                        ('Content-Length', str(len(output)))]
    start_response(status, response_headers)

    return [output]
# t.sh
UWSGI_MASTER=true \
UWSGI_BINARY_PATH=$PWD/venv/bin/python3 \
UWSGI_VIRTUALENV=$PWD/venv \
UWSGI_WORKERS=2 \
UWSGI_ENABLE_THREADS=true \
UWSGI_MAX_REQUESTS=5 \
UWSGI_HTTP_SOCKET=localhost:9001 \
UWSGI_WSGI_FILE=wsgi.py \
UWSGI_DISABLE_LOGGING=1 \
    venv/bin/uwsgi
# some other tab / window
yes | head -300 | xargs --replace curl localhost:9001/health-check/; echo

output:

$ bash t.sh 
*** Starting uWSGI 2.0.26 (64bit) on [Mon Aug 12 16:57:55 2024] ***
compiled with version: Apple LLVM 15.0.0 (clang-1500.3.9.4) on 12 August 2024 17:24:49
os: Darwin-23.5.0 Darwin Kernel Version 23.5.0: Wed May  1 20:12:58 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T6000
nodename: FJJ4YYCWYX.attlocal.net
machine: arm64
clock source: unix
pcre jit disabled
detected number of CPU cores: 10
current working directory: /private/tmp/y
detected binary path: /tmp/y/venv/bin/python3
your processes number limit is 5333
your memory page size is 16384 bytes
detected max file descriptor number: 2560
lock engine: OSX spinlocks
thunder lock: disabled (you can enable it with --thunder-lock)
uwsgi socket 0 bound to TCP address localhost:9001 fd 3
Python version: 3.12.2 (main, Feb  6 2024, 20:19:44) [Clang 15.0.0 (clang-1500.1.0.2.5)]
--- Python VM already initialized ---
Python main interpreter initialized at 0x10154c4f0
python threads support enabled
your server socket listen backlog is limited to 100 connections
your mercy for graceful operations on workers is 60 seconds
mapped 218688 bytes (213 KB) for 2 cores
*** Operational MODE: preforking ***
WSGI app 0 (mountpoint='') ready in 0 seconds on interpreter 0x10154c4f0 pid: 63841 (default app)
*** uWSGI is running in multiple interpreter mode ***
spawned uWSGI master process (pid: 63841)
spawned uWSGI worker 1 (pid: 63842, cores: 1)
spawned uWSGI worker 2 (pid: 63843, cores: 1)
................................................................................................................................................................................................................................................................................................................................................................................................................................................The work of process 63843 is done. Seeya!
....The work of process 63842 is done. Seeya!
worker 2 killed successfully (pid: 63843)
Respawned uWSGI worker 2 (new pid: 64286)
worker 1 killed successfully (pid: 63842)
Respawned uWSGI worker 1 (new pid: 64287)

(and then no additional output) -- the uwsgi processes also become relatively unkillable (not responding to ^C / ^\, SIGTERM etc.)

the other terminal output:

$ yes | head -300 | xargs --replace curl localhost:9001/health-check/; echo
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
$ yes | head -300 | xargs --replace curl localhost:9001/health-check/; echo
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

this seems to not be an issue with python 3.11 or without pyuwsgi

@asottile-sentry
Copy link
Author

cc @methane since you seem to have been poking around in --enable-threads and termination recently and maybe have an idea what's going on

@methane
Copy link
Contributor

methane commented Aug 13, 2024

Try #2654

@asottile
Copy link
Contributor

tried that patch out but unfortunately it doesn't seem to fix it -- still getting hangs unfortunately

@asottile
Copy link
Contributor

if it's helpful -- an strace of the child processes while stuck shows:

FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
futex(0xb2bd50, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1576, tv_nsec=257230947}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
futex(0xb2bd50, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1576, tv_nsec=262893986}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
futex(0xb2bd50, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1576, tv_nsec=268841818}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
futex(0xb2bd50, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1576, tv_nsec=274623638}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)

the parent process:

wait4(-1, 0x7ffc745d4098, WNOHANG, NULL) = 0
epoll_wait(10, [], 1, 1000)             = 0
getsockopt(3, SOL_TCP, TCP_INFO, "\n\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0d\0\0\0"..., [104]) = 0
wait4(-1, 0x7ffc745d4098, WNOHANG, NULL) = 0
epoll_wait(10, [], 1, 1000)             = 0
getsockopt(3, SOL_TCP, TCP_INFO, "\n\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0d\0\0\0"..., [104]) = 0
wait4(-1, 0x7ffc745d4098, WNOHANG, NULL) = 0
epoll_wait(10, [], 1, 1000)             = 0
getsockopt(3, SOL_TCP, TCP_INFO, "\n\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0d\0\0\0"..., [104]) = 0
wait4(-1, 0x7ffc745d4098, WNOHANG, NULL) = 0
epoll_wait(10, [], 1, 1000)             = 0
getsockopt(3, SOL_TCP, TCP_INFO, "\n\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0d\0\0\0"..., [104]) = 0

will see if I can figure out more

@asottile
Copy link
Contributor

asottile commented Aug 14, 2024

ah this is definitely more useful info -- here's the stack of the stuck children:

#0  __futex_abstimed_wait_common64 (private=-1862589240, cancel=true, 
    abstime=0x7ffc90fb2510, op=137, expected=0, 
    futex_word=0xc68c30 <_PyRuntime+93552>) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=-1862589240, 
    abstime=0x7ffc90fb2510, clockid=1913253632, expected=0, 
    futex_word=0xc68c30 <_PyRuntime+93552>) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (
    futex_word=futex_word@entry=0xc68c30 <_PyRuntime+93552>, 
    expected=expected@entry=0, clockid=clockid@entry=1, 
    abstime=abstime@entry=0x7ffc90fb2510, private=private@entry=0)
    at ./nptl/futex-internal.c:139
#3  0x000078b1bc893e9b in __pthread_cond_wait_common (abstime=0x7ffc90fb2510, 
    clockid=1, mutex=0xc68c38 <_PyRuntime+93560>, 
    cond=0xc68c08 <_PyRuntime+93512>) at ./nptl/pthread_cond_wait.c:503
#4  ___pthread_cond_timedwait64 (cond=cond@entry=0xc68c08 <_PyRuntime+93512>, 
    mutex=mutex@entry=0xc68c38 <_PyRuntime+93560>, 
    abstime=abstime@entry=0x7ffc90fb2510) at ./nptl/pthread_cond_wait.c:652
#5  0x0000000000654a1e in PyCOND_TIMEDWAIT (us=<optimized out>, 
    mut=0xc68c38 <_PyRuntime+93560>, cond=0xc68c08 <_PyRuntime+93512>)
    at ../Python/condvar.h:73
#6  take_gil (tstate=tstate@entry=0xcc6108 <_PyRuntime+475720>)
    at ../Python/ceval_gil.c:376
#7  0x0000000000654e80 in PyEval_RestoreThread (
    tstate=0xcc6108 <_PyRuntime+475720>) at ../Python/ceval_gil.c:708
#8  0x000078b1bbe6046c in uwsgi_python_master_fixup (step=<optimized out>)
    at plugins/python/python_plugin.c:1320
#9  0x000078b1bbe14866 in uwsgi_respawn_worker (wid=wid@entry=2)
    at core/master_utils.c:757
#10 0x000078b1bbe12a32 in master_loop (argv=0x2489670, environ=<optimized out>)
    at core/master.c:1084
#11 0x000078b1bbe54290 in uwsgi_run () at core/uwsgi.c:3305
#12 0x000078b1bbe5f0b7 in pyuwsgi_run (self=<optimized out>, 
    args=args@entry=(), kwds=kwds@entry=0x0) at plugins/pyuwsgi/pyuwsgi.c:159
#13 0x0000000000534e56 in cfunction_call (
    func=func@entry=<built-in method run of module object at remote 0x78b1bbf15b50>, args=args@entry=(), kwargs=kwargs@entry=0x0)
    at ../Objects/methodobject.c:537
#14 0x00000000004ce432 in _PyObject_MakeTpCall (
    tstate=tstate@entry=0xcc6108 <_PyRuntime+475720>, 
    callable=callable@entry=<built-in method run of module object at remote 0x78b1bbf15b50>, args=args@entry=0x78b1bcc66088, nargs=<optimized out>, 

(edited with debug line numbers against 3.12.4 + 2.0.26 + #2654)

@asottile-sentry
Copy link
Author

still an issue. I'm going to try and bisect cpython today and see if I can't find what changed and where the bug lies

@asottile
Copy link
Contributor

alrighty -- here's the result of bisection!

92d8bfffbf377e91d8b92666525cb8700bb1d5e8 is the first bad commit

python/cpython@92d8bff

@asottile
Copy link
Contributor

created python/cpython#123079 with a potential fix

@asottile
Copy link
Contributor

btw I have a fix for this in #2660 -- if you're using pyuwsgi from pypi this fix is cherry-picked and available in 2.0.27a1

@xrmx
Copy link
Collaborator

xrmx commented Sep 8, 2024

Fixed in #2670

@xrmx xrmx closed this as completed Sep 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants