Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WQ: worker does not seem to respect --idle-timeout #4012

Open
svandenhaute opened this issue Dec 20, 2024 · 4 comments
Open

WQ: worker does not seem to respect --idle-timeout #4012

svandenhaute opened this issue Dec 20, 2024 · 4 comments
Assignees

Comments

@svandenhaute
Copy link

A worker was started using the following launch command:

work_queue_worker --parent-death --wall-time=252000 --cores=12 --gpus=1 --idle-timeout=20 node5019.dodrio.os 1026

Based on the --help, I would think that this worker should quit after 20 seconds of not being given any task. However, the debug_log of the WQ process which accepts the incoming worker connections looks like this:

...
2024/12/12 10:14:14.22 work_queue_python[1869534] wq: rx from gpu507.dodrio.os (10.141.35.7:50664): alive
2024/12/12 10:14:14.22 work_queue_python[1869534] wq: rx from gpu507.dodrio.os (10.141.35.7:50654): alive
2024/12/12 10:14:14.23 work_queue_python[1869534] wq: rx from gpu529.dodrio.os (10.141.40.9:39582): alive
2024/12/12 10:14:14.23 work_queue_python[1869534] wq: rx from gpu510.dodrio.os (10.141.35.10:42408): alive
2024/12/12 10:14:14.23 work_queue_python[1869534] wq: rx from gpu507.dodrio.os (10.141.35.7:50668): alive
2024/12/12 10:14:14.23 work_queue_python[1869534] wq: rx from gpu507.dodrio.os (10.141.35.7:50664): info tasks_running 0
2024/12/12 10:14:14.23 work_queue_python[1869534] wq: rx from gpu507.dodrio.os (10.141.35.7:50654): info tasks_running 0
2024/12/12 10:14:14.23 work_queue_python[1869534] wq: rx from gpu529.dodrio.os (10.141.40.9:39582): info tasks_running 1
2024/12/12 10:14:14.23 work_queue_python[1869534] wq: rx from gpu510.dodrio.os (10.141.35.10:42408): info tasks_running 0
2024/12/12 10:14:14.23 work_queue_python[1869534] wq: rx from gpu507.dodrio.os (10.141.35.7:50668): info tasks_running 0
2024/12/12 10:14:14.23 work_queue_python[1869534] wq: rx from gpu529.dodrio.os (10.141.40.9:39568): alive
2024/12/12 10:14:14.23 work_queue_python[1869534] wq: rx from gpu510.dodrio.os (10.141.35.10:42398): alive
2024/12/12 10:14:14.23 work_queue_python[1869534] wq: rx from gpu507.dodrio.os (10.141.35.7:50672): alive
2024/12/12 10:14:14.23 work_queue_python[1869534] wq: rx from gpu529.dodrio.os (10.141.40.9:39568): info tasks_running 0
2024/12/12 10:14:14.23 work_queue_python[1869534] wq: rx from gpu510.dodrio.os (10.141.35.10:42398): info tasks_running 0
2024/12/12 10:14:14.23 work_queue_python[1869534] wq: rx from gpu507.dodrio.os (10.141.35.7:50672): info tasks_running 0
2024/12/12 10:16:14.41 work_queue_python[1869534] wq: tx to gpu507.dodrio.os (10.141.35.7:50664): check
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: Sent keepalive check to worker gpu507.dodrio.os (10.141.35.7:50664)
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: tx to gpu507.dodrio.os (10.141.35.7:50654): check
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: Sent keepalive check to worker gpu507.dodrio.os (10.141.35.7:50654)
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: tx to gpu529.dodrio.os (10.141.40.9:39582): check
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: Sent keepalive check to worker gpu529.dodrio.os (10.141.40.9:39582)
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: tx to gpu510.dodrio.os (10.141.35.10:42408): check
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: Sent keepalive check to worker gpu510.dodrio.os (10.141.35.10:42408)
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: tx to gpu507.dodrio.os (10.141.35.7:50668): check
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: Sent keepalive check to worker gpu507.dodrio.os (10.141.35.7:50668)
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: tx to gpu529.dodrio.os (10.141.40.9:39568): check
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: Sent keepalive check to worker gpu529.dodrio.os (10.141.40.9:39568)
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: tx to gpu510.dodrio.os (10.141.35.10:42398): check
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: Sent keepalive check to worker gpu510.dodrio.os (10.141.35.10:42398)
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: tx to gpu507.dodrio.os (10.141.35.7:50672): check
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: Sent keepalive check to worker gpu507.dodrio.os (10.141.35.7:50672)
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: rx from gpu507.dodrio.os (10.141.35.7:50664): alive
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: rx from gpu507.dodrio.os (10.141.35.7:50654): alive
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: rx from gpu529.dodrio.os (10.141.40.9:39582): alive
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: rx from gpu510.dodrio.os (10.141.35.10:42408): alive
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: rx from gpu507.dodrio.os (10.141.35.7:50668): alive
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: rx from gpu529.dodrio.os (10.141.40.9:39568): alive
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: rx from gpu510.dodrio.os (10.141.35.10:42398): alive
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: rx from gpu507.dodrio.os (10.141.35.7:50664): info tasks_running 0
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: rx from gpu507.dodrio.os (10.141.35.7:50654): info tasks_running 0
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: rx from gpu529.dodrio.os (10.141.40.9:39582): info tasks_running 1
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: rx from gpu510.dodrio.os (10.141.35.10:42408): info tasks_running 0
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: rx from gpu507.dodrio.os (10.141.35.7:50668): info tasks_running 0
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: rx from gpu529.dodrio.os (10.141.40.9:39568): info tasks_running 0
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: rx from gpu510.dodrio.os (10.141.35.10:42398): info tasks_running 0
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: rx from gpu507.dodrio.os (10.141.35.7:50672): alive
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: rx from gpu507.dodrio.os (10.141.35.7:50672): info tasks_running 0

i.e. there is only one task running but all the other workers remain alive and somehow don't disconnect due to an idle timeout. Is this intentional behavior? I would expect that if there's only one task running, the idling workers should be killed after some time.

This is cctools==7.11.0, used in combination with a WorkQueueExecutor from Parsl. I quickly checked in with @benclifford about this but he didn't seem to spot an immediate mistake on my end here.

@dthain
Copy link
Member

dthain commented Dec 20, 2024

Hello Sander - We use the --timeout feature quite regularly in testing, and what you are asking for ought to work. When a worker doesn't get a task for N seconds, it should disconnect. Something else must be going on here.

Two things to check:
1 - Can you send the entire debug_log for us to look at
2 - How exactly are you starting the workers? Are you manually running that command line, or using the Parsl provider, or the Work Queue factory, or something else?

@svandenhaute
Copy link
Author

svandenhaute commented Dec 20, 2024

debug.log

Workers are started using the Parsl provider, and the work_queue_worker command that I quoted above is actually taken directly from the (SLURM) submission script that is parsed and submitted by the provider, so we do know that the idle_timeout argument was actually passed into the worker correctly

@dthain
Copy link
Member

dthain commented Jan 9, 2025

Getting back to this after the holiday break...

I have been doing some testing here, and from my end, the idle-timeout feature works as expected. It is baked right into the main loop of the program, and it doesn't interact with any other features. When activated, the worker should send an idle-disconnecting message to the manager (which then appears in the manager log) and then also drop something in it's own log here:

2025/01/09 15:17:40.18 work_queue_worker[270215] notice: disconnecting from 127.0.0.1:1024 because I did not receive any task in 5 seconds (--idle-timeout).
2025/01/09 15:17:40.18 work_queue_worker[270215] wq: tx: info idle-disconnecting 5
2025/01/09 15:17:40.18 work_queue_worker[270215] wq: disconnecting from manager 127.0.0.1:1024
2025/01/09 15:17:40.18 work_queue_worker[270215] tcp: disconnected from 127.0.0.1 port 1024
2025/01/09 15:17:40.18 work_queue_worker[270215] wq: killing all outstanding tasks
2025/01/09 15:17:40.18 work_queue_worker[270215] wq: all data structures are clean
disconnected from manager localhost:1024
2025/01/09 15:17:45.18 work_queue_worker[270215] wq: cleaning workspace /tmp/worker-102744-270215
2025/01/09 15:17:45.18 work_queue_worker[270215] wq: trashing file cache to /tmp/worker-102744-270215/trash/iusxinl.1736453865186466
2025/01/09 15:17:45.18 work_queue_worker[270215] notice: stopping: no other managers available

The only actions that reset the idle timer are:
1 - The manager sends a task or file to the worker here
2 - The worker has a process waiting, running, or complete here

So I am a bit puzzled as to what is happening.

@dthain
Copy link
Member

dthain commented Jan 9, 2025

How about this:

Run your application again, and this time run a work_queue_worker manually from the command line, with the above arguments and add -d all -o worker.log to generate a detailed worker log.

If that times out as expected, then our problem likely lies in passing the command line properly through all the layers of parsl, slurm, etc

If it does not time out as expected, then the problem likely lies in WQ, and the debug log will give us some insight into that.

@dthain dthain self-assigned this Jan 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants