Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Kubernetes Worker does not release sockets after Job completes #102

Closed
kevingrismore opened this issue Dec 5, 2023 · 5 comments · Fixed by #111
Closed

Kubernetes Worker does not release sockets after Job completes #102

kevingrismore opened this issue Dec 5, 2023 · 5 comments · Fixed by #111

Comments

@kevingrismore
Copy link
Contributor

kevingrismore commented Dec 5, 2023

Expectation / Proposal

After a Job responsible for a flow run completes, TCP connections on the worker pod should close and release sockets. Instead, one TCP connection per flow run persists in state CLOSE_WAIT. Eventually, the worker pod will run out of sockets and flow runs will begin to fail during calls to create_namespaced_job.

Observed in prefect-kubernetes==0.3.1.

Traceback / Example

Failed to submit flow run 'c34aed0d-0396-424c-bc88-b8497b79ba63' to infrastructure. Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/urllib3/[connection.py](https://connection.py/)", line 174, in _new_conn conn = connection.create_connection( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/urllib3/util/[connection.py](https://connection.py/)", line 95, in create_connection raise err File "/usr/local/lib/python3.11/site-packages/urllib3/util/[connection.py](https://connection.py/)", line 85, in create_connection sock.connect(sa) OSError: [Errno 99] Cannot assign requested address During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/urllib3/[connectionpool.py](https://connectionpool.py/)", line 715, in urlopen httplib_response = self._make_request( ^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/urllib3/[connectionpool.py](https://connectionpool.py/)", line 404, in _make_request self._validate_conn(conn) File "/usr/local/lib/python3.11/site-packages/urllib3/[connectionpool.py](https://connectionpool.py/)", line 1058, in _validate_conn conn.connect() File "/usr/local/lib/python3.11/site-packages/urllib3/[connection.py](https://connection.py/)", line 363, in connect self.sock = conn = self._new_conn() ^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/urllib3/[connection.py](https://connection.py/)", line 186, in _new_conn raise NewConnectionError( urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7f8e2acb28d0>: Failed to establish a new connection: [Errno 99] Cannot assign requested address During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/prefect/workers/[base.py](https://base.py/)", line 896, in _submit_run_and_capture_errors result = await self.run( ^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/prefect_kubernetes/[worker.py](https://worker.py/)", line 567, in run job = await run_sync_in_worker_thread( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/prefect/utilities/[asyncutils.py](https://asyncutils.py/)", line 91, in run_sync_in_worker_thread return await anyio.to_thread.run_sync( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/anyio/to_thread.py", line 33, in run_sync return await get_asynclib().run_sync_in_worker_thread( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread return await future ^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 807, in run result = context.run(func, *args) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/prefect_kubernetes/[worker.py](https://worker.py/)", line 763, in _create_job job = batch_client.create_namespaced_job( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/kubernetes/client/api/batch_v1_api.py", line 210, in create_namespaced_job return self.create_namespaced_job_with_http_info(namespace, body, **kwargs) # noqa: E501 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/kubernetes/client/api/batch_v1_api.py", line 309, in create_namespaced_job_with_http_info return self.api_client.call_api( ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/kubernetes/client/api_client.py", line 348, in call_api return self.__call_api(resource_path, method, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/kubernetes/client/api_client.py", line 180, in __call_api response_data = self.request( ^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/kubernetes/client/api_client.py", line 391, in request return self.rest_client.POST(url, ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/kubernetes/client/[rest.py](https://rest.py/)", line 279, in POST return self.request("POST", url, ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/kubernetes/client/[rest.py](https://rest.py/)", line 172, in request r = self.pool_manager.request( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/urllib3/[request.py](https://request.py/)", line 81, in request return self.request_encode_body( ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/urllib3/[request.py](https://request.py/)", line 173, in request_encode_body return self.urlopen(method, url, **extra_kw) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/urllib3/[poolmanager.py](https://poolmanager.py/)", line 376, in urlopen response = conn.urlopen(method, u.request_uri, **kw) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/urllib3/[connectionpool.py](https://connectionpool.py/)", line 827, in urlopen return self.urlopen( ^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/urllib3/[connectionpool.py](https://connectionpool.py/)", line 827, in urlopen return self.urlopen( ^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/urllib3/[connectionpool.py](https://connectionpool.py/)", line 827, in urlopen return self.urlopen( ^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/urllib3/[connectionpool.py](https://connectionpool.py/)", line 799, in urlopen retries = retries.increment( ^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/urllib3/util/[retry.py](https://retry.py/)", line 592, in increment raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='172.20.0.1', port=443): Max retries exceeded with url: /apis/batch/v1/namespaces/prefect2-flows/jobs (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f8e2acb28d0>: Failed to establish a new connection: [Errno 99] Cannot assign requested address'))

This can be reproduced by starting a Kubernetes worker using the helm chart with all default configs, and then running some flows. After the flows complete, the output from running cat /proc/net/tcp | wc -l on the worker pod will eventually show to have increased by exactly the number of flow runs. Running cat /proc/net/tcp shows these connections in state 08:

23: 08013C0A:EBD2 0100400A:01BB 08 00000000:00000019 00:00000000 00000000  1001        0 137269 1 0000000000000000 20 4 12 10 -1                   
24: 08013C0A:9874 0100400A:01BB 08 00000000:00000019 00:00000000 00000000  1001        0 156620 1 0000000000000000 20 4 12 10 -1                   
25: 08013C0A:98D6 0100400A:01BB 08 00000000:00000019 00:00000000 00000000  1001        0 171986 1 0000000000000000 20 4 12 10 -1                   
26: 08013C0A:98EA 0100400A:01BB 08 00000000:00000019 00:00000000 00000000  1001        0 172205 1 0000000000000000 20 4 12 10 -1

from tcp_states.h:

enum {
    TCP_ESTABLISHED = 1,
    TCP_SYN_SENT,
    TCP_SYN_RECV,
    TCP_FIN_WAIT1,
    TCP_FIN_WAIT2,
    TCP_TIME_WAIT,
    TCP_CLOSE,
    TCP_CLOSE_WAIT,
    TCP_LAST_ACK,
    TCP_LISTEN,
    TCP_CLOSING,    /* Now a valid state */
    TCP_NEW_SYN_RECV,

    TCP_MAX_STATES  /* Leave at the end! */
};

08 is CLOSE_WAIT

The CLOSE_WAIT state indicates that the remote end of the connection has finished transmitting data and that the remote application has issued a close(2) or shutdown(2) call. The local TCP stack is now waiting for the local application that owns the socket to close(2) the local socket as well.

Here are some issues reporting the same behavior for async and multi-threaded applications that use the Python Kubernetes client:

@ytl-liva
Copy link

ytl-liva commented Dec 8, 2023

I will add on the issue that i have faced with k8 with s3 download async tasks (using presigned url). Not sure if it is relevant. I have also mentioned this in slack

I have a task that downloads s3 file using the async function. it runs perfectly fine while in prefect agent running in a normal ubuntu container in rancher. recently i have been trying out kubernetes worker with it. The behaviour is strange i did not see this issue while in dev (lesser files) but happened in prod where the files number are much larger. Of the 1400 files to be downloaded via async task, about 30+ completed successfully in the beginning. then the rest of the tasks just continue running perpetually until time out killed it.

This also happens when i tried to run a non async .map() task where i map the list of file to the non async function of s3 download. If i loop through it it does not happen. It seems to happen during concurrency runs.

@czechdave
Copy link

Has anyone been able to work around this besides periodically restarting the worker?

@prabhatkgupta
Copy link

I have also got a similar issue, trying to find a workaround solution. Will share the same if I get any solutions

@prefectcboyd
Copy link

+1 on this - we see that the worker stops responding and "Failed to establish a new connection: [Errno 99] Cannot assign requested address'".

Notably, this seems to occur during high volume cases where a large number of flow-runs are being executed daily?

@brandonwissmann
Copy link

@prabhatkgupta If you need a work around, the simple solution is to restart the workers on a schedule before they reach the connection limit

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants