Workers stopping due to lost heartbeat or connection #625

mbercx · 2023-09-23T18:21:02Z

mbercx
Sep 23, 2023

I've noticed that quit a large fraction of my workers stop running with the HEARTBEAT LOST / CONNECTION LOST states (mostly the former). This typically happens quite some time before the scheduler walltime is reached (I also see that some do end in the TIME LIMIT REACHED state, but those are a minority). Is there some way to figure out what went wrong, and how to avoid these workers stopping? The file system can be a bit finicky on the cluster where I'm running (LUMI), so it's possible the connections times out and stops the worker when it shouldn't. I suppose increasing the heartbeat with hq worker start --heartbeat might help? I see there is also the --on-server-lost option, would this help to avoid CONNECTION LOST states? The server has not stopped running at any point, I'm sure.

Kobzol · 2023-09-24T07:31:55Z

Kobzol
Sep 24, 2023
Maintainer

Hi, this sounds mildly concerning. The two most probable causes might be networking issues, or the worker being so incredibly overloaded that it cannot even send the heartbeat messages. I really hope that the connection doesn't time out, as the default heartbeat is quite short (8s). You can try to increase it to see if it helps. It might also be useful to run the worker and the server with debug logging (RUST_LOG=debug hq ...) and see if there are any errors in the logs.

Regarding --on-server-lost, that only specifies how should the worker terminate itself when a server is lost. It's not really a general solution to a problem where the worker frequently disconnects from the server.

4 replies

spirali Sep 24, 2023
Maintainer

Let me just note that HEARBEAT_LOST means that server did not received the heartbeat message in time. Sensitivity on this can be configurated via --heartbeat. --hearbeat=Xs means that worker is sending heartbeat every X seconds and and if server does not receive heartbeat in 2X seconds then it disconnects the worker.
On the other hand, CONNECTION_LOST means that operating system informs us that connection is closed, and basically we cannot do anything about this.

Heartbeat lost in practice means one of 3 scenarios:

networks problems
the worker crashes together with the machine and so worker's OS does not close the connection gracefully
worker is stucked somewhere but did not crashed so OS did not close the connection.
- this may have the following causes:
  a) we have a serious bug
  b) Some I/O operations takes too long. All network operations are async so it should not be a problem. But not all disc operations are async yet, namely creating and removing task dir, so if you are using task dir it may cause this but by default task dir are made in /tmp so it should not take 16s to remove directory from there
  c) You are redirecting stdout like: hw worker start > /somewhere/path.log on a network filesystem that have some problems and it may takes 16s+ to write a data (8s is default heartbeat).

It would be useful to see the debug log from worker that causes the problem.

mbercx Sep 24, 2023
Author

Thanks a lot for the feedback @Kobzol and @spirali!

Regarding --on-server-lost, that only specifies how should the worker terminate itself when a server is lost. It's not really a general solution to a problem where the worker frequently disconnects from the server.

I see, thanks! I was thinking that maybe the connection fails for some reason, and the worker thinks the server is lost.

b) Some I/O operations takes too long. All network operations are async so it should not be a problem. But not all disc operations are async yet, namely creating and removing task dir, so if you are using task dir it may cause this but by default task dir are made in /tmp so it should not take 16s to remove directory from there
c) You are redirecting stdout like: hw worker start > /somewhere/path.log on a network filesystem that have some problems and it may takes 16s+ to write a data (8s is default heartbeat).

These are actually not impossible on LUMI. From time to time I'm trying to save changes after a vim edit, and it seemingly takes forever.

I'm currently still running in production, so don't want to stop the server and restart it with debug mode, but I'll already start some workers with a larger heartbeat (60 seconds too much?) and once I can restart the server I'll do some tests in debug mode.

spirali Sep 24, 2023
Maintainer

Setting arbitrarily large heartbeat is totally ok, as long as you are ok with the situation that if the worker really dies together with the machine, then server will be informed about it later. Setting even the limit as "10m" (10 minutes) or "1h" (1 hour) is ok. Heartbeat is not used to anything else then detection of dead worker.

Running server in the debug mode would be also interesting but I think that for solving the issue, it would be more interesting to see a worker in debug mode. For this, you do not have to restart the server.

mbercx Sep 27, 2023
Author

Happy to report back that after setting the -heartbeat to 10 minutes, my new workers are running smoothly until they reach the time limit. I haven't had time to run much in debug mode yet, but will get back to this later!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workers stopping due to lost heartbeat or connection #625

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Workers stopping due to lost heartbeat or connection #625

mbercx Sep 23, 2023

Replies: 1 comment · 4 replies

Kobzol Sep 24, 2023 Maintainer

spirali Sep 24, 2023 Maintainer

mbercx Sep 24, 2023 Author

spirali Sep 24, 2023 Maintainer

mbercx Sep 27, 2023 Author

mbercx
Sep 23, 2023

Replies: 1 comment 4 replies

Kobzol
Sep 24, 2023
Maintainer

spirali Sep 24, 2023
Maintainer

mbercx Sep 24, 2023
Author

spirali Sep 24, 2023
Maintainer

mbercx Sep 27, 2023
Author