-
Notifications
You must be signed in to change notification settings - Fork 9
Kubernetes Worker does not release sockets after Job completes #102
Comments
I will add on the issue that i have faced with k8 with s3 download async tasks (using presigned url). Not sure if it is relevant. I have also mentioned this in slack I have a task that downloads s3 file using the async function. it runs perfectly fine while in prefect agent running in a normal ubuntu container in rancher. recently i have been trying out kubernetes worker with it. The behaviour is strange i did not see this issue while in dev (lesser files) but happened in prod where the files number are much larger. Of the 1400 files to be downloaded via async task, about 30+ completed successfully in the beginning. then the rest of the tasks just continue running perpetually until time out killed it. This also happens when i tried to run a non async .map() task where i map the list of file to the non async function of s3 download. If i loop through it it does not happen. It seems to happen during concurrency runs. |
Has anyone been able to work around this besides periodically restarting the worker? |
I have also got a similar issue, trying to find a workaround solution. Will share the same if I get any solutions |
+1 on this - we see that the worker stops responding and "Failed to establish a new connection: [Errno 99] Cannot assign requested address'". Notably, this seems to occur during high volume cases where a large number of flow-runs are being executed daily? |
@prabhatkgupta If you need a work around, the simple solution is to restart the workers on a schedule before they reach the connection limit |
Expectation / Proposal
After a Job responsible for a flow run completes, TCP connections on the worker pod should close and release sockets. Instead, one TCP connection per flow run persists in state
CLOSE_WAIT
. Eventually, the worker pod will run out of sockets and flow runs will begin to fail during calls tocreate_namespaced_job
.Observed in
prefect-kubernetes==0.3.1
.Traceback / Example
This can be reproduced by starting a Kubernetes worker using the helm chart with all default configs, and then running some flows. After the flows complete, the output from running
cat /proc/net/tcp | wc -l
on the worker pod will eventually show to have increased by exactly the number of flow runs. Runningcat /proc/net/tcp
shows these connections in state08
:from tcp_states.h:
08
isCLOSE_WAIT
Here are some issues reporting the same behavior for async and multi-threaded applications that use the Python Kubernetes client:
The text was updated successfully, but these errors were encountered: