Replies: 1 comment 4 replies
-
Hi, this sounds mildly concerning. The two most probable causes might be networking issues, or the worker being so incredibly overloaded that it cannot even send the heartbeat messages. I really hope that the connection doesn't time out, as the default heartbeat is quite short (8s). You can try to increase it to see if it helps. It might also be useful to run the worker and the server with debug logging ( Regarding |
Beta Was this translation helpful? Give feedback.
-
I've noticed that quit a large fraction of my workers stop running with the
HEARTBEAT LOST
/CONNECTION LOST
states (mostly the former). This typically happens quite some time before the scheduler walltime is reached (I also see that some do end in theTIME LIMIT REACHED
state, but those are a minority). Is there some way to figure out what went wrong, and how to avoid these workers stopping? The file system can be a bit finicky on the cluster where I'm running (LUMI), so it's possible the connections times out and stops the worker when it shouldn't. I suppose increasing the heartbeat withhq worker start --heartbeat
might help? I see there is also the--on-server-lost
option, would this help to avoidCONNECTION LOST
states? The server has not stopped running at any point, I'm sure.Beta Was this translation helpful? Give feedback.
All reactions