Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster creation constantly failing because of existing scheduler in "Terminating" status #846

Closed
dbalabka opened this issue Dec 20, 2023 · 3 comments

Comments

@dbalabka
Copy link
Contributor

dbalabka commented Dec 20, 2023

Cluster creation fails when there is another scheduler with the same name in "Terminating" status:
image

Stack trace:

File ~/.../.venv/lib/python3.10/site-packages/dask_kubernetes/operator/kubecluster/kubecluster.py:284, in KubeCluster.__init__(self, name, namespace, image, n_workers, resources, env, worker_command, auth, port_forward_cluster_ip, create_mode, shutdown_on_close, idle_timeout, resource_timeout, scheduler_service_type, custom_cluster_spec, scheduler_forward_port, jupyter, loop, asynchronous, **kwargs)
    282 if not called_from_running_loop:
    283     self._loop_runner.start()
--> 284     self.sync(self._start)

File ~/.../.venv/lib/python3.10/site-packages/distributed/utils.py:358, in SyncMethodMixin.sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    356     return future
    357 else:
--> 358     return sync(
    359         self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
    360     )

File ~/.../.venv/lib/python3.10/site-packages/distributed/utils.py:434, in sync(loop, func, callback_timeout, *args, **kwargs)
    431         wait(10)
    433 if error is not None:
--> 434     raise error
    435 else:
    436     return result

File ~/.../.venv/lib/python3.10/site-packages/distributed/utils.py:408, in sync.<locals>.f()
    406         awaitable = wait_for(awaitable, timeout)
    407     future = asyncio.ensure_future(awaitable)
--> 408     result = yield future
    409 except Exception as exception:
    410     error = exception

File ~/.../.venv/lib/python3.10/site-packages/tornado/gen.py:767, in Runner.run(self)
    765 try:
    766     try:
--> 767         value = future.result()
    768     except Exception as e:
    769         # Save the exception for later. It's important that
    770         # gen.throw() not be called inside this try/except block
    771         # because that makes sys.exc_info behave unexpectedly.
    772         exc: Optional[Exception] = e

File ~/.../.venv/lib/python3.10/site-packages/dask_kubernetes/operator/kubecluster/kubecluster.py:322, in KubeCluster._start(self)
    320 else:
    321     self._log("Creating cluster")
--> 322     await self._create_cluster()
    324 await super()._start()
    325 self._log(f"Ready, dashboard available at {self.dashboard_link}")

File ~/.../.venv/lib/python3.10/site-packages/dask_kubernetes/operator/kubecluster/kubecluster.py:380, in KubeCluster._create_cluster(self)
    378 try:
    379     self._log("Waiting for scheduler pod")
--> 380     await wait_for_scheduler(
    381         self.name,
    382         self.namespace,
    383         timeout=self._resource_timeout,
    384     )
    385 except CrashLoopBackOffError as e:
    386     scheduler_pod = await Pod.get(
    387         namespace=self.namespace,
    388         label_selector=f"dask.org/component=scheduler,dask.org/cluster-name={self.name}",
    389     )

File ~/.../.venv/lib/python3.10/site-packages/dask_kubernetes/common/networking.py:205, in wait_for_scheduler(cluster_name, namespace, timeout)
    203 while True:
    204     try:
--> 205         pod = await Pod.get(
    206             label_selector=f"dask.org/component=scheduler,dask.org/cluster-name={cluster_name}",
    207             namespace=namespace,
    208         )
    209     except kr8s.NotFoundError:
    210         await asyncio.sleep(0.25)

File ~/.../.venv/lib/python3.10/site-packages/kr8s/_objects.py:199, in APIObject.get(cls, name, namespace, api, label_selector, field_selector, timeout, **kwargs)
    197         continue
    198     if len(resources) > 1:
--> 199         raise ValueError(
    200             f"Expected exactly one {cls.kind} object. Use selectors to narrow down the search."
    201         )
    202     return resources[0]
    203 raise NotFoundError(
    204     f"Could not find {cls.kind} {name} in namespace {namespace}."
    205 )

ValueError: Expected exactly one Pod object. Use selectors to narrow down the search.

Minimal Complete Verifiable Example:

It is complicated to reproduce because it requires creating a scheduler, which will hang in the "Termination" status. The eases way is to:

  1. Create a cluster via Python script
  2. Kill the cluster:
kubectl delete daskcluster <name>
  1. Quickly try to recreate it via Python script with the same name.

Anything else we need to know?:

Environment:

  • Dask version: 2023.11.0
  • Python version: 3.10
  • Operating System: Linux
  • Install method (conda, pip, source): poetry
dbalabka added a commit to dbalabka/dask-kubernetes that referenced this issue Dec 20, 2023
@dbalabka
Copy link
Contributor Author

The proposed PR provides a possible solution. However, I don't believe that it is the completely right solution. It is still not prone to multiple running schedulers.

jacobtomlinson added a commit that referenced this issue Jan 5, 2024
@jacobtomlinson
Copy link
Member

Thanks @dbalabka. I think having multiple running schedulers is an edge case that folks shouldn't run into. If you have any interest in making another PR to protect against that edge case then please feel free, but I think #847 should resolve this issue for most people.

@dbalabka
Copy link
Contributor Author

dbalabka commented Jan 10, 2024

@jacobtomlinson , it would be great if dask-kubernetes could handle the exception and provide a better explanation of what is happening. The error ValueError: Expected exactly one Pod object. Use selectors to narrow down the search. does not help to resolve the issue. I can add the proper exception handling with another PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants