graceful stop worker when max_requests/realod_on_* #2615

methane · 2024-03-08T15:58:24Z

worker stops when reached max_requests or reload_on_*.

Lines 1216 to 1251 in 39f3ade

    
           	if (uwsgi.max_requests > 0 && uwsgi.workers[uwsgi.mywid].delta_requests >= (uwsgi.max_requests + ((uwsgi.mywid-1) * uwsgi.max_requests_delta)) 
        
           	    && (end_of_request - (uwsgi.workers[uwsgi.mywid].last_spawn * 1000000) >= uwsgi.min_worker_lifetime * 1000000)) { 
        
           		goodbye_cruel_world("max requests reached (%llu >= %llu)", 
        
           			(unsigned long long) uwsgi.workers[uwsgi.mywid].delta_requests, 
        
           			(unsigned long long) (uwsgi.max_requests + ((uwsgi.mywid-1) * uwsgi.max_requests_delta)) 
        
           		); 
        
           	} 
        
           	if (uwsgi.reload_on_as && (rlim_t) vsz >= uwsgi.reload_on_as && (end_of_request - (uwsgi.workers[uwsgi.mywid].last_spawn * 1000000) >= uwsgi.min_worker_lifetime * 1000000)) { 
        
           		goodbye_cruel_world("reload-on-as limit reached (%llu >= %llu)", 
        
           			(unsigned long long) (rlim_t) vsz, 
        
           			(unsigned long long) uwsgi.reload_on_as 
        
           		); 
        
           	} 
        
           	if (uwsgi.reload_on_rss && (rlim_t) rss >= uwsgi.reload_on_rss && (end_of_request - (uwsgi.workers[uwsgi.mywid].last_spawn * 1000000) >= uwsgi.min_worker_lifetime * 1000000)) { 
        
           		goodbye_cruel_world("reload-on-rss limit reached (%llu >= %llu)", 
        
           			(unsigned long long) (rlim_t) rss, 
        
           			(unsigned long long) uwsgi.reload_on_rss 
        
           		); 
        
           	} 
        
           #ifdef __linux__ 
        
           	if (uwsgi.reload_on_uss && (rlim_t) uss >= uwsgi.reload_on_uss && (end_of_request - (uwsgi.workers[uwsgi.mywid].last_spawn * 1000000) >= uwsgi.min_worker_lifetime * 1000000)) { 
        
           		goodbye_cruel_world("reload-on-uss limit reached (%llu >= %llu)", 
        
           			(unsigned long long) (rlim_t) uss, 
        
           			(unsigned long long) uwsgi.reload_on_uss 
        
           		); 
        
           	} 
        
           	if (uwsgi.reload_on_pss && (rlim_t) pss >= uwsgi.reload_on_pss && (end_of_request - (uwsgi.workers[uwsgi.mywid].last_spawn * 1000000) >= uwsgi.min_worker_lifetime * 1000000)) { 
        
           		goodbye_cruel_world("reload-on-pss limit reached (%llu >= %llu)", 
        
           			(unsigned long long) (rlim_t) pss, 
        
           			(unsigned long long) uwsgi.reload_on_pss 
        
           		); 
        
           	}

goodbye_cruel_world() is not graceful. It caused atexit not called.
If atexit stops daemon threads, worker won't stop until killed from master.

Reproduce

# mt_sample.py
import atexit, time, sys
import uwsgi

@atexit.register
def on_exit():
    print(f"on_exit: {uwsgi.worker_id()=}")
    sys.stdout.flush()

def application(env, start_response):
    time.sleep(1)
    start_response('200 OK', [('Content-Type','text/html')])
    return [b"Hello World"]

./uwsgi --wsgi-file=mt_sample.py --master --http-socket=:8000 --workers=1 --threads=80 --max-requests=20 --min-worker-lifetime=10 -L

I used hey to request. But any other load testing tools is OK.

./hey -c 80 -z 1m  'http://127.0.0.1:8000/'

master branch

*** uWSGI is running in multiple interpreter mode ***
spawned uWSGI master process (pid: 93920)
spawned uWSGI worker 1 (pid: 93921, cores: 80)
...The work of process 93921 is done (max requests reached (641 >= 20)). Seeya!
worker 1 killed successfully (pid: 93921)
Respawned uWSGI worker 1 (new pid: 94019)
...The work of process 94019 is done (max requests reached (721 >= 20)). Seeya!
worker 1 killed successfully (pid: 94019)
Respawned uWSGI worker 1 (new pid: 94099)
...The work of process 94099 is done (max requests reached (721 >= 20)). Seeya!
worker 1 killed successfully (pid: 94099)
Respawned uWSGI worker 1 (new pid: 94179)
...The work of process 94179 is done (max requests reached (721 >= 20)). Seeya!
worker 1 killed successfully (pid: 94179)
Respawned uWSGI worker 1 (new pid: 94260)
...The work of process 94260 is done (max requests reached (721 >= 20)). Seeya!
worker 1 killed successfully (pid: 94260)
Respawned uWSGI worker 1 (new pid: 94340)

atexit is not called.

After this patch

*** uWSGI is running in multiple interpreter mode ***
spawned uWSGI master process (pid: 94781)
spawned uWSGI worker 1 (pid: 94782, cores: 80)
...The work of process 94782 is done (max requests reached (402 >= 20)). Seeya!
on_exit: uwsgi.worker_id()=1
worker 1 killed successfully (pid: 94782)
Respawned uWSGI worker 1 (new pid: 94880)
...The work of process 94880 is done (max requests reached (721 >= 20)). Seeya!
on_exit: uwsgi.worker_id()=1
worker 1 killed successfully (pid: 94880)
Respawned uWSGI worker 1 (new pid: 94960)
...The work of process 94960 is done (max requests reached (721 >= 20)). Seeya!
on_exit: uwsgi.worker_id()=1
worker 1 killed successfully (pid: 94960)
Respawned uWSGI worker 1 (new pid: 95040)
...The work of process 95040 is done (max requests reached (721 >= 20)). Seeya!
on_exit: uwsgi.worker_id()=1
worker 1 killed successfully (pid: 95040)
Respawned uWSGI worker 1 (new pid: 95120)                                                                                                                                   ...The work of process 95120 is done (max requests reached (721 >= 20)). Seeya!
on_exit: uwsgi.worker_id()=1
worker 1 killed successfully (pid: 95120)
Respawned uWSGI worker 1 (new pid: 95200)

atexit is called

Relating issues

open-telemetry/opentelemetry-python#3640

methane · 2024-03-08T16:31:19Z

This PR includes #2484.

methane · 2024-03-10T10:31:39Z

opentelemetry uses atexit not only stopping daemon thread, but also flushing buffered metrics/logs/traces.
Please consider this.

xrmx · 2024-03-13T14:54:12Z

@methane thanks for PR. Could you please add the reproducer to a file in tests/ and adding both the uwsgi line to call it and the hey invocation as comments at the top?

core/loop.c

methane · 2024-03-19T09:25:15Z

Although this PR fixes worker hangs up, hey still reports EOF error.
I believe pthread_cancel() is really bad API and uwsgi must not use it.
May I remove pthread_cancel() in wait_for_threads()?
This is the only caller of wait_for_threads().

If thread takes too long to be finish, master process sends KILL signal anyway.

methane · 2024-03-19T14:59:25Z

For the record, atexit is not related at all. I can see many issues without atexit.
The root cause was pthread_cancel() from subthreads.

svanoort · 2024-03-29T15:12:41Z

@xrmx I know you're probably quite busy, but may I humbly request a re-review of this PR & merge/release if it looks good to you?

Unfortunately for us the issue this PR fixes causes a hard failure in production when our services using uWSGI are upgraded to Python versions containing https://bugs.python.org/issue44434 -- so the upgrade from 3.9.7 to 3.9.8 is a hard breaking change (and potentially 3.10 +upgrades) and we've been holding them back to 3.9.7 or lower. We're at the point where that upgrade is needed. I imagine we're probably not the only ones facing that situation since several others have reported the threading issue as well.

And kudos to @methane for doing the deep investigation to provide this fix!

Thank you so much, to both of you!

xrmx · 2024-04-06T20:15:43Z

Squashed and merged in #2626. Thanks!

When working to reproduce unbit#2615 I saw many strange "defunct" (zombie) workers. The master called waitpid(-1, ...) but it return 0 even there are some zombies. Finally, master sends KILL signal (MERCY) and worker is restarted. I believe this strange zombie was born from pthread_cancel. Subthreads calls pthread_cancel() for main thread and it cause strange process. pthread_cancel() is very hard to use and debug. I can not even attach the strange zombie with gdb --pid. I think it is not maintainable. In the end we can remove six_feet_under_lock and make wait_for_threads() static.

methane added 2 commits March 8, 2024 15:28

graceful stop worker when max_requests/realod_on_*

7686040

fixup

80917e0

xrmx reviewed Mar 13, 2024

View reviewed changes

core/loop.c Outdated Show resolved Hide resolved

methane added 2 commits March 18, 2024 12:44

fix indent and cancelled thread status

21ab71d

add reproducer

b4b8d5a

methane requested a review from xrmx March 19, 2024 06:29

add another sample for mutithread worker reload

cafb7df

This was referenced Mar 19, 2024

stop using pthread_cancel() #2619

Closed

uwsgi worker dying at high request load with error "taking too much time to die...NO MERCY !!!" #2611

Open

methane mentioned this pull request Mar 21, 2024

uwsgi + flask, got Respawned uWSGI worker, very frequently open-telemetry/opentelemetry-python#3640

Closed

fix code style

1cb0ec7

xrmx closed this Apr 6, 2024

xrmx mentioned this pull request Apr 7, 2024

core/uwsgi: stop using pthread_cancel() #2627

Merged

methane deleted the worker-graceful-stop branch July 8, 2024 08:06

methane mentioned this pull request Jul 8, 2024

Graceful reload|shutdown doesn't work when no requests coming #2653

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

graceful stop worker when max_requests/realod_on_* #2615

graceful stop worker when max_requests/realod_on_* #2615

methane commented Mar 8, 2024 •

edited

Loading

methane commented Mar 8, 2024

methane commented Mar 10, 2024

xrmx commented Mar 13, 2024

methane commented Mar 19, 2024

methane commented Mar 19, 2024

svanoort commented Mar 29, 2024

xrmx commented Apr 6, 2024

	if (uwsgi.max_requests > 0 && uwsgi.workers[uwsgi.mywid].delta_requests >= (uwsgi.max_requests + ((uwsgi.mywid-1) * uwsgi.max_requests_delta))
	&& (end_of_request - (uwsgi.workers[uwsgi.mywid].last_spawn * 1000000) >= uwsgi.min_worker_lifetime * 1000000)) {
	goodbye_cruel_world("max requests reached (%llu >= %llu)",
	(unsigned long long) uwsgi.workers[uwsgi.mywid].delta_requests,
	(unsigned long long) (uwsgi.max_requests + ((uwsgi.mywid-1) * uwsgi.max_requests_delta))
	);
	}

	if (uwsgi.reload_on_as && (rlim_t) vsz >= uwsgi.reload_on_as && (end_of_request - (uwsgi.workers[uwsgi.mywid].last_spawn * 1000000) >= uwsgi.min_worker_lifetime * 1000000)) {
	goodbye_cruel_world("reload-on-as limit reached (%llu >= %llu)",
	(unsigned long long) (rlim_t) vsz,
	(unsigned long long) uwsgi.reload_on_as
	);
	}

	if (uwsgi.reload_on_rss && (rlim_t) rss >= uwsgi.reload_on_rss && (end_of_request - (uwsgi.workers[uwsgi.mywid].last_spawn * 1000000) >= uwsgi.min_worker_lifetime * 1000000)) {
	goodbye_cruel_world("reload-on-rss limit reached (%llu >= %llu)",
	(unsigned long long) (rlim_t) rss,
	(unsigned long long) uwsgi.reload_on_rss
	);
	}

	#ifdef __linux__
	if (uwsgi.reload_on_uss && (rlim_t) uss >= uwsgi.reload_on_uss && (end_of_request - (uwsgi.workers[uwsgi.mywid].last_spawn * 1000000) >= uwsgi.min_worker_lifetime * 1000000)) {
	goodbye_cruel_world("reload-on-uss limit reached (%llu >= %llu)",
	(unsigned long long) (rlim_t) uss,
	(unsigned long long) uwsgi.reload_on_uss
	);
	}

	if (uwsgi.reload_on_pss && (rlim_t) pss >= uwsgi.reload_on_pss && (end_of_request - (uwsgi.workers[uwsgi.mywid].last_spawn * 1000000) >= uwsgi.min_worker_lifetime * 1000000)) {
	goodbye_cruel_world("reload-on-pss limit reached (%llu >= %llu)",
	(unsigned long long) (rlim_t) pss,
	(unsigned long long) uwsgi.reload_on_pss
	);
	}

graceful stop worker when max_requests/realod_on_* #2615

graceful stop worker when max_requests/realod_on_* #2615

Conversation

methane commented Mar 8, 2024 • edited Loading

Reproduce

master branch

After this patch

Relating issues

methane commented Mar 8, 2024

methane commented Mar 10, 2024

xrmx commented Mar 13, 2024

methane commented Mar 19, 2024

methane commented Mar 19, 2024

svanoort commented Mar 29, 2024

xrmx commented Apr 6, 2024

methane commented Mar 8, 2024 •

edited

Loading