WQ: worker does not seem to respect `--idle-timeout` #4012

svandenhaute · 2024-12-20T12:11:48Z

A worker was started using the following launch command:

work_queue_worker --parent-death --wall-time=252000 --cores=12 --gpus=1 --idle-timeout=20 node5019.dodrio.os 1026

Based on the --help, I would think that this worker should quit after 20 seconds of not being given any task. However, the debug_log of the WQ process which accepts the incoming worker connections looks like this:

...
2024/12/12 10:14:14.22 work_queue_python[1869534] wq: rx from gpu507.dodrio.os (10.141.35.7:50664): alive
2024/12/12 10:14:14.22 work_queue_python[1869534] wq: rx from gpu507.dodrio.os (10.141.35.7:50654): alive
2024/12/12 10:14:14.23 work_queue_python[1869534] wq: rx from gpu529.dodrio.os (10.141.40.9:39582): alive
2024/12/12 10:14:14.23 work_queue_python[1869534] wq: rx from gpu510.dodrio.os (10.141.35.10:42408): alive
2024/12/12 10:14:14.23 work_queue_python[1869534] wq: rx from gpu507.dodrio.os (10.141.35.7:50668): alive
2024/12/12 10:14:14.23 work_queue_python[1869534] wq: rx from gpu507.dodrio.os (10.141.35.7:50664): info tasks_running 0
2024/12/12 10:14:14.23 work_queue_python[1869534] wq: rx from gpu507.dodrio.os (10.141.35.7:50654): info tasks_running 0
2024/12/12 10:14:14.23 work_queue_python[1869534] wq: rx from gpu529.dodrio.os (10.141.40.9:39582): info tasks_running 1
2024/12/12 10:14:14.23 work_queue_python[1869534] wq: rx from gpu510.dodrio.os (10.141.35.10:42408): info tasks_running 0
2024/12/12 10:14:14.23 work_queue_python[1869534] wq: rx from gpu507.dodrio.os (10.141.35.7:50668): info tasks_running 0
2024/12/12 10:14:14.23 work_queue_python[1869534] wq: rx from gpu529.dodrio.os (10.141.40.9:39568): alive
2024/12/12 10:14:14.23 work_queue_python[1869534] wq: rx from gpu510.dodrio.os (10.141.35.10:42398): alive
2024/12/12 10:14:14.23 work_queue_python[1869534] wq: rx from gpu507.dodrio.os (10.141.35.7:50672): alive
2024/12/12 10:14:14.23 work_queue_python[1869534] wq: rx from gpu529.dodrio.os (10.141.40.9:39568): info tasks_running 0
2024/12/12 10:14:14.23 work_queue_python[1869534] wq: rx from gpu510.dodrio.os (10.141.35.10:42398): info tasks_running 0
2024/12/12 10:14:14.23 work_queue_python[1869534] wq: rx from gpu507.dodrio.os (10.141.35.7:50672): info tasks_running 0
2024/12/12 10:16:14.41 work_queue_python[1869534] wq: tx to gpu507.dodrio.os (10.141.35.7:50664): check
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: Sent keepalive check to worker gpu507.dodrio.os (10.141.35.7:50664)
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: tx to gpu507.dodrio.os (10.141.35.7:50654): check
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: Sent keepalive check to worker gpu507.dodrio.os (10.141.35.7:50654)
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: tx to gpu529.dodrio.os (10.141.40.9:39582): check
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: Sent keepalive check to worker gpu529.dodrio.os (10.141.40.9:39582)
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: tx to gpu510.dodrio.os (10.141.35.10:42408): check
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: Sent keepalive check to worker gpu510.dodrio.os (10.141.35.10:42408)
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: tx to gpu507.dodrio.os (10.141.35.7:50668): check
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: Sent keepalive check to worker gpu507.dodrio.os (10.141.35.7:50668)
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: tx to gpu529.dodrio.os (10.141.40.9:39568): check
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: Sent keepalive check to worker gpu529.dodrio.os (10.141.40.9:39568)
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: tx to gpu510.dodrio.os (10.141.35.10:42398): check
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: Sent keepalive check to worker gpu510.dodrio.os (10.141.35.10:42398)
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: tx to gpu507.dodrio.os (10.141.35.7:50672): check
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: Sent keepalive check to worker gpu507.dodrio.os (10.141.35.7:50672)
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: rx from gpu507.dodrio.os (10.141.35.7:50664): alive
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: rx from gpu507.dodrio.os (10.141.35.7:50654): alive
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: rx from gpu529.dodrio.os (10.141.40.9:39582): alive
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: rx from gpu510.dodrio.os (10.141.35.10:42408): alive
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: rx from gpu507.dodrio.os (10.141.35.7:50668): alive
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: rx from gpu529.dodrio.os (10.141.40.9:39568): alive
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: rx from gpu510.dodrio.os (10.141.35.10:42398): alive
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: rx from gpu507.dodrio.os (10.141.35.7:50664): info tasks_running 0
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: rx from gpu507.dodrio.os (10.141.35.7:50654): info tasks_running 0
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: rx from gpu529.dodrio.os (10.141.40.9:39582): info tasks_running 1
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: rx from gpu510.dodrio.os (10.141.35.10:42408): info tasks_running 0
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: rx from gpu507.dodrio.os (10.141.35.7:50668): info tasks_running 0
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: rx from gpu529.dodrio.os (10.141.40.9:39568): info tasks_running 0
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: rx from gpu510.dodrio.os (10.141.35.10:42398): info tasks_running 0
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: rx from gpu507.dodrio.os (10.141.35.7:50672): alive
2024/12/12 10:16:14.43 work_queue_python[1869534] wq: rx from gpu507.dodrio.os (10.141.35.7:50672): info tasks_running 0

i.e. there is only one task running but all the other workers remain alive and somehow don't disconnect due to an idle timeout. Is this intentional behavior? I would expect that if there's only one task running, the idling workers should be killed after some time.

This is cctools==7.11.0, used in combination with a WorkQueueExecutor from Parsl. I quickly checked in with @benclifford about this but he didn't seem to spot an immediate mistake on my end here.

The text was updated successfully, but these errors were encountered:

dthain · 2024-12-20T14:17:16Z

Hello Sander - We use the --timeout feature quite regularly in testing, and what you are asking for ought to work. When a worker doesn't get a task for N seconds, it should disconnect. Something else must be going on here.

Two things to check:
1 - Can you send the entire debug_log for us to look at
2 - How exactly are you starting the workers? Are you manually running that command line, or using the Parsl provider, or the Work Queue factory, or something else?

svandenhaute · 2024-12-20T14:31:55Z

debug.log

Workers are started using the Parsl provider, and the work_queue_worker command that I quoted above is actually taken directly from the (SLURM) submission script that is parsed and submitted by the provider, so we do know that the idle_timeout argument was actually passed into the worker correctly

dthain · 2025-01-09T20:31:47Z

Getting back to this after the holiday break...

I have been doing some testing here, and from my end, the idle-timeout feature works as expected. It is baked right into the main loop of the program, and it doesn't interact with any other features. When activated, the worker should send an idle-disconnecting message to the manager (which then appears in the manager log) and then also drop something in it's own log here:

2025/01/09 15:17:40.18 work_queue_worker[270215] notice: disconnecting from 127.0.0.1:1024 because I did not receive any task in 5 seconds (--idle-timeout).
2025/01/09 15:17:40.18 work_queue_worker[270215] wq: tx: info idle-disconnecting 5
2025/01/09 15:17:40.18 work_queue_worker[270215] wq: disconnecting from manager 127.0.0.1:1024
2025/01/09 15:17:40.18 work_queue_worker[270215] tcp: disconnected from 127.0.0.1 port 1024
2025/01/09 15:17:40.18 work_queue_worker[270215] wq: killing all outstanding tasks
2025/01/09 15:17:40.18 work_queue_worker[270215] wq: all data structures are clean
disconnected from manager localhost:1024
2025/01/09 15:17:45.18 work_queue_worker[270215] wq: cleaning workspace /tmp/worker-102744-270215
2025/01/09 15:17:45.18 work_queue_worker[270215] wq: trashing file cache to /tmp/worker-102744-270215/trash/iusxinl.1736453865186466
2025/01/09 15:17:45.18 work_queue_worker[270215] notice: stopping: no other managers available

The only actions that reset the idle timer are:
1 - The manager sends a task or file to the worker here
2 - The worker has a process waiting, running, or complete here

So I am a bit puzzled as to what is happening.

dthain · 2025-01-09T20:36:15Z

How about this:

Run your application again, and this time run a work_queue_worker manually from the command line, with the above arguments and add -d all -o worker.log to generate a detailed worker log.

If that times out as expected, then our problem likely lies in passing the command line properly through all the layers of parsl, slurm, etc

If it does not time out as expected, then the problem likely lies in WQ, and the debug log will give us some insight into that.

dthain self-assigned this Jan 17, 2025

dthain added the Work Queue label Jan 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WQ: worker does not seem to respect `--idle-timeout` #4012

WQ: worker does not seem to respect `--idle-timeout` #4012

svandenhaute commented Dec 20, 2024

dthain commented Dec 20, 2024

svandenhaute commented Dec 20, 2024 •

edited

Loading

dthain commented Jan 9, 2025

dthain commented Jan 9, 2025

WQ: worker does not seem to respect --idle-timeout #4012

WQ: worker does not seem to respect --idle-timeout #4012

Comments

svandenhaute commented Dec 20, 2024

dthain commented Dec 20, 2024

svandenhaute commented Dec 20, 2024 • edited Loading

dthain commented Jan 9, 2025

dthain commented Jan 9, 2025

WQ: worker does not seem to respect `--idle-timeout` #4012

WQ: worker does not seem to respect `--idle-timeout` #4012

svandenhaute commented Dec 20, 2024 •

edited

Loading