Skip to content

Commit

Permalink
Skypilot with Kubernetes (#3033)
Browse files Browse the repository at this point in the history
* Add Skypilot Kubernetes integration for remote orchestration

This commit adds the Skypilot Kubernetes integration to enable remote orchestration of ZenML pipelines on VMs. The integration includes the necessary initialization, configuration, and flavor classes. It also adds the Skypilot Kubernetes orchestrator and its settings. This integration provides an alternative to the local orchestrator for running pipelines on Kubernetes.

* Add Skypilot Kubernetes integration for remote orchestration

* Refactor SkypilotKubernetesOrchestrator prepare_environment_variable method

* Refactor SkypilotKubernetesOrchestrator prepare_environment_variable method

* Refactor SkypilotKubernetesOrchestrator prepare_environment_variable method

* Refactor SkypilotBaseOrchestrator setup_credentials method

* Refactor SkypilotBaseOrchestrator setup_credentials method and SkypilotKubernetesOrchestrator prepare_environment_variable method

* Refactor SkypilotBaseOrchestrator setup_credentials method and SkypilotKubernetesOrchestrator prepare_environment_variable method

* Update src/zenml/integrations/skypilot/orchestrators/skypilot_base_vm_orchestrator.py

Co-authored-by: Stefan Nica <[email protected]>

* Refactor SkypilotKubernetesOrchestrator to handle missing service connector

* Refactor SkypilotBaseOrchestrator to use the correct virtual environment path for Kubernetes

* specll check

* fix run command

* Refactor SkypilotKubernetesVMOrchestrator to remove unnecessary code

* Refactor SkypilotKubernetesVMOrchestrator to remove unnecessary code

---------

Co-authored-by: Stefan Nica <[email protected]>
  • Loading branch information
safoinme and stefannica authored Sep 25, 2024
1 parent 3eb3d0d commit b711963
Show file tree
Hide file tree
Showing 9 changed files with 419 additions and 26 deletions.
77 changes: 77 additions & 0 deletions docs/book/component-guide/orchestrators/skypilot-vm.md
Original file line number Diff line number Diff line change
Expand Up @@ -242,6 +242,55 @@ The Lambda Labs orchestrator does not support some of the features like `job_rec
While testing the orchestrator, we noticed that the Lambda Labs orchestrator does not support the `down` flag. This means the orchestrator will not automatically tear down the cluster after all jobs finish. We recommend manually tearing down the cluster after all jobs finish to avoid unnecessary costs.
{% endhint %}
{% endtab %}
{% tab title="Kubernetes" %}
We need first to install the SkyPilot integration for Kubernetes, using the following two commands:
```shell
zenml integration install skypilot_kubernetes
```
To provision skypilot on kubernetes cluster, your orchestrator stack components needs to be configured to authenticate with a
[Service Connector](../../how-to/auth-management/service-connectors-guide.md). To configure the Service Connector, you need to register a new service connector configured with the appropriate credentials and permissions to access the K8s cluster. You can then use the service connector to configure your registered the Orchestrator stack component using the following command:
First, check that the Kubernetes service connector type is available using the following command:
```shell
zenml service-connector list-types --type kubernetes
```
```shell
┏━━━━━━━━━━━━┯━━━━━━━━━━━━┯━━━━━━━━━━━━┯━━━━━━━━━━━┯━━━━━━━┯━━━━━━━━┓
┃ │ │ RESOURCE │ AUTH │ │ ┃
┃ NAME │ TYPE │ TYPES │ METHODS │ LOCAL │ REMOTE ┃
┠────────────┼────────────┼────────────┼───────────┼───────┼────────┨
┃ Kubernetes │ 🌀 │ 🌀 │ password │ ✅ │ ✅ ┃
┃ Service │ kubernetes │ kubernetes │ token │ │ ┃
┃ Connector │ │ -cluster │ │ │ ┃
┗━━━━━━━━━━━━┷━━━━━━━━━━━━┷━━━━━━━━━━━━┷━━━━━━━━━━━┷━━━━━━━┷━━━━━━━━┛
```
Next, configure a service connector using the CLI or the dashboard with the AWS credentials. For example, the following command uses the local AWS CLI credentials to auto-configure the service connector:
```shell
zenml service-connector register kubernetes-skypilot --type kubernetes -i
```
This will automatically configure the service connector with the appropriate credentials and permissions to provision VMs on AWS. You can then use the service connector to configure your registered VM Orchestrator stack component using the following command:
```shell
# Register the orchestrator
zenml orchestrator register <ORCHESTRATOR_NAME> --flavor sky_kubernetes
# Connect the orchestrator to the service connector
zenml orchestrator connect <ORCHESTRATOR_NAME> --connector kubernetes-skypilot
# Register and activate a stack with the new orchestrator
zenml stack register <STACK_NAME> -o <ORCHESTRATOR_NAME> ... --set
```
{% hint style="warning" %}
Some of the features like `job_recovery`, `disk_tier`, `image_id`, `zone`, `idle_minutes_to_autostop`, `disk_size`, `use_spot` are not supported by the Kubernetes orchestrator. It is recommended not to use these features with the Kubernetes orchestrator and not to use [step-specific settings](skypilot-vm.md#configuring-step-specific-resources).
{% endhint %}
{% endtab %}
{% endtabs %}
#### Additional Configuration
Expand Down Expand Up @@ -392,6 +441,34 @@ skypilot_settings = SkypilotLambdaOrchestratorSettings(
)
@pipeline(
settings={
"orchestrator": skypilot_settings
}
)
```
{% endtab %}

{% tab title="Kubernetes" %}

**Code Example:**

```python
from zenml.integrations.skypilot_kubernetes.flavors.skypilot_orchestrator_kubernetes_vm_flavor import SkypilotKubernetesOrchestratorSettings
skypilot_settings = SkypilotKubernetesOrchestratorSettings(
cpus="2",
memory="16",
accelerators="V100:2",
image_id="ami-1234567890abcdef0",
disk_size=100,
cluster_name="my_cluster",
retry_until_up=True,
stream_logs=True
docker_run_args=["--gpus=all"]
)
@pipeline(
settings={
"orchestrator": skypilot_settings
Expand Down
1 change: 1 addition & 0 deletions src/zenml/integrations/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@
from zenml.integrations.skypilot_gcp import SkypilotGCPIntegration # noqa
from zenml.integrations.skypilot_azure import SkypilotAzureIntegration # noqa
from zenml.integrations.skypilot_lambda import SkypilotLambdaIntegration # noqa
from zenml.integrations.skypilot_kubernetes import SkypilotKubernetesIntegration # noqa
from zenml.integrations.slack import SlackIntegration # noqa
from zenml.integrations.spark import SparkIntegration # noqa
from zenml.integrations.tekton import TektonIntegration # noqa
Expand Down
1 change: 1 addition & 0 deletions src/zenml/integrations/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,7 @@
SKYPILOT_GCP = "skypilot_gcp"
SKYPILOT_AZURE = "skypilot_azure"
SKYPILOT_LAMBDA = "skypilot_lambda"
SKYPILOT_KUBERNETES = "skypilot_kubernetes"
SLACK = "slack"
SPARK = "spark"
TEKTON = "tekton"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -250,6 +250,7 @@ def prepare_or_run_pipeline(
entrypoint_str = " ".join(command)
arguments_str = " ".join(args)

task_envs = environment
docker_environment_str = " ".join(
f"-e {k}={v}" for k, v in environment.items()
)
Expand All @@ -271,29 +272,33 @@ def prepare_or_run_pipeline(
f"sudo docker login --username $DOCKER_USERNAME --password "
f"$DOCKER_PASSWORD {stack.container_registry.config.uri}"
)
task_envs = {
"DOCKER_USERNAME": docker_username,
"DOCKER_PASSWORD": docker_password,
}
task_envs["DOCKER_USERNAME"] = docker_username
task_envs["DOCKER_PASSWORD"] = docker_password
else:
setup = None
task_envs = None

# Run the entire pipeline

# Set the service connector AWS profile ENV variable
self.prepare_environment_variable(set=True)

try:
if isinstance(self.cloud, sky.clouds.Kubernetes):
run_command = f"${{VIRTUAL_ENV:+$VIRTUAL_ENV/bin/}}{entrypoint_str} {arguments_str}"
setup = None
down = False
idle_minutes_to_autostop = None
else:
run_command = f"sudo docker run --rm {custom_run_args}{docker_environment_str} {image} {entrypoint_str} {arguments_str}"
down = settings.down
idle_minutes_to_autostop = settings.idle_minutes_to_autostop
task = sky.Task(
run=f"sudo docker run --rm {custom_run_args}{docker_environment_str} {image} {entrypoint_str} {arguments_str}",
run=run_command,
setup=setup,
envs=task_envs,
)
logger.debug(
f"Running run: sudo docker run --rm {custom_run_args}{docker_environment_str} {image} {entrypoint_str} {arguments_str}"
)
logger.debug(f"Running run: {setup}")
logger.debug(f"Running run: {run_command}")

task = task.set_resources(
sky.Resources(
cloud=self.cloud,
Expand All @@ -306,15 +311,24 @@ def prepare_or_run_pipeline(
job_recovery=settings.job_recovery,
region=settings.region,
zone=settings.zone,
image_id=settings.image_id,
image_id=image
if isinstance(self.cloud, sky.clouds.Kubernetes)
else settings.image_id,
disk_size=settings.disk_size,
disk_tier=settings.disk_tier,
)
)

# Set the cluster name
cluster_name = settings.cluster_name
if cluster_name is None:
if settings.cluster_name:
sky.exec(
task,
settings.cluster_name,
down=down,
stream_logs=settings.stream_logs,
backend=None,
detach_run=True,
)
else:
# Find existing cluster
for i in sky.status(refresh=True):
if isinstance(
Expand All @@ -324,21 +338,19 @@ def prepare_or_run_pipeline(
logger.info(
f"Found existing cluster {cluster_name}. Reusing..."
)
if cluster_name is None:
cluster_name = self.sanitize_cluster_name(
f"{orchestrator_run_name}"
)

# Launch the cluster
sky.launch(
task,
cluster_name,
retry_until_up=settings.retry_until_up,
idle_minutes_to_autostop=settings.idle_minutes_to_autostop,
down=settings.down,
stream_logs=settings.stream_logs,
detach_setup=True,
)
# Launch the cluster
sky.launch(
task,
cluster_name,
retry_until_up=settings.retry_until_up,
idle_minutes_to_autostop=idle_minutes_to_autostop,
down=down,
stream_logs=settings.stream_logs,
detach_setup=True,
)

except Exception as e:
logger.error(f"Pipeline run failed: {e}")
Expand Down
52 changes: 52 additions & 0 deletions src/zenml/integrations/skypilot_kubernetes/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Copyright (c) ZenML GmbH 2024. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at:
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
# or implied. See the License for the specific language governing
# permissions and limitations under the License.
"""Initialization of the Skypilot Kubernetes integration for ZenML.
The Skypilot integration sub-module powers an alternative to the local
orchestrator for a remote orchestration of ZenML pipelines on VMs.
"""
from typing import List, Type

from zenml.integrations.constants import (
SKYPILOT_KUBERNETES,
)
from zenml.integrations.integration import Integration
from zenml.stack import Flavor

SKYPILOT_KUBERNETES_ORCHESTRATOR_FLAVOR = "vm_kubernetes"


class SkypilotKubernetesIntegration(Integration):
"""Definition of Skypilot Kubernetes Integration for ZenML."""

NAME = SKYPILOT_KUBERNETES
# all 0.6.x versions of skypilot[kubernetes] are compatible
REQUIREMENTS = ["skypilot[kubernetes]~=0.6.1"]
APT_PACKAGES = ["openssh-client", "rsync"]

@classmethod
def flavors(cls) -> List[Type[Flavor]]:
"""Declare the stack component flavors for the Skypilot Kubernetes integration.
Returns:
List of stack component flavors for this integration.
"""
from zenml.integrations.skypilot_kubernetes.flavors import (
SkypilotKubernetesOrchestratorFlavor,
)

return [SkypilotKubernetesOrchestratorFlavor]


SkypilotKubernetesIntegration.check_installation()
26 changes: 26 additions & 0 deletions src/zenml/integrations/skypilot_kubernetes/flavors/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Copyright (c) ZenML GmbH 2024. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at:
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
# or implied. See the License for the specific language governing
# permissions and limitations under the License.
"""Skypilot integration flavor for Skypilot Kubernetes orchestrator."""

from zenml.integrations.skypilot_kubernetes.flavors.skypilot_orchestrator_kubernetes_vm_flavor import (
SkypilotKubernetesOrchestratorConfig,
SkypilotKubernetesOrchestratorFlavor,
SkypilotKubernetesOrchestratorSettings,
)

__all__ = [
"SkypilotKubernetesOrchestratorConfig",
"SkypilotKubernetesOrchestratorFlavor",
"SkypilotKubernetesOrchestratorSettings",
]
Loading

0 comments on commit b711963

Please sign in to comment.