Ray on GKE

This repository contains a Terraform template for running Ray on Google Kubernetes Engine. We've also included some example notebooks, including one that serves a GPT-J-6B model with Ray AIR (see here for the original notebook).

The solution is split into platform and user resources.

Platform resources (deployed once):

GKE Cluster
Nvidia GPU drivers
Kuberay operator and CRDs

User resources (deployed once per user):

User namespace
Kubernetes service accounts
Kuberay cluster
Prometheus monitoring
Logging container
Jupyter notebook

Installation

Note: Terraform keeps state metadata in a local file called terraform.tfstate. If you need to reinstall any resources, make sure to delete this file as well.

Platform

cd platform
Edit variables.tf with your GCP settings.
Run terraform init
Run terraform apply

User

cd user
Edit variables.tf with your GCP settings.
Run terraform init
Get the GKE cluster name and location/region from platform/variables.tf. Run gcloud container clusters get-credentials %gke_cluster_name% --location=%location% Configuring gcloud instructions
Run terraform apply

Using Ray with Jupyter

Run kubectl get services -n <namespace>
Copy the external IP for the notebook.
Open the external IP in a browser and login. The default user names and passwords can be found in the Jupyter settings file.
The Ray cluster is available at ray://example-cluster-kuberay-head-svc:10001. To access the cluster, you can open one of the sample notebooks under example_notebooks (via File -> Open from URL in the Jupyter notebook window and use the raw file URL from GitHub) and run through the example. Ex url: https://raw.githubusercontent.com/richardsliu/ray-on-gke/main/example_notebooks/gpt-j-online.ipynb
To use the Ray dashboard, run the following command to port-forward:

kubectl port-forward -n ray service/example-cluster-kuberay-head-svc 8265:8265

And then open the dashboard using the following URL:

http://localhost:8265

Using Ray with Ray Jobs API

To connect to the remote GKE cluster with the Ray API, setup the Ray dashboard. Run the following command to port-forward:

kubectl port-forward -n ray service/example-cluster-kuberay-head-svc 8265:8265

And then open the dashboard using the following URL:

http://localhost:8265

Set the RAY_ADDRESS environment variable: export RAY_ADDRESS="http://127.0.0.1:8265"
Create a working directory with some job file ray_job.py.
Submit the job: ray job submit --working-dir %your_working_directory% -- python ray_job.py
Note the job submission ID from the output, eg.: Job 'raysubmit_inB2ViQuE29aZRJ5' succeeded

See Ray docs for more info.

Securing Your Cluster Endpoints

For demo purposes, this repo creates a public IP for the Jupyter notebook with basic dummy authentication. To secure your cluster, it is strong recommended to replace this with your own secure endpoints.

For more information, please take a look at the following links:

Running GPT-J-6B

This example is adapted from Ray AIR's examples here.

Open the gpt-j-online.ipynb notebook.
Open a terminal in the Jupyter session and install Ray AIR:

pip install ray[air]

Run through the notebook cells. You can change the prompt in the last cell:

prompt = (
     ## Input your own prompt here
)

This should output a generated text response.

Logging and Monitoring

This repository comes with out-of-the-box integrations with Google Cloud Logging and Managed Prometheus for monitoring. To see your Ray cluster logs:

Open Cloud Console and open Logging
If using Jupyter notebook for job submission, use the following query parameters:

resource.type="k8s_container"
resource.labels.cluster_name=%CLUSTER_NAME%
resource.labels.pod_name=%RAY_HEAD_POD_NAME%
resource.labels.container_name="fluentbit"

If using Ray Jobs API: (a) Note the job ID returned by the ray job submit API. Eg: Job submission: ray job submit --working-dir /Users/imreddy/ray_working_directory -- python script.py Job submission ID: Job 'raysubmit_kFWB6VkfyqK1CbEV' submitted successfully (b) Get the namespace name from user/variables.tf or kubectl get namespaces (c) Use the following query to search for the job logs:

resource.labels.namespace_name=%NAMESPACE_NAME%
jsonpayload.job_id=%RAY_JOB_ID%

To see monitoring metrics:

Open Cloud Console and open Metrics Explorer
In "Target", select "Prometheus Target" and then "Ray".
Select the metric you want to view, and then click "Apply".

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
example_notebooks		example_notebooks
platform		platform
user		user
LICENSE		LICENSE
README.md		README.md
contributing.md		contributing.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ray on GKE

Installation

Platform

User

Using Ray with Jupyter

Using Ray with Ray Jobs API

Securing Your Cluster Endpoints

Running GPT-J-6B

Logging and Monitoring

About

Releases

Packages

Contributors 4

Languages

License

richardsliu/ray-on-gke

Folders and files

Latest commit

History

Repository files navigation

Ray on GKE

Installation

Platform

User

Using Ray with Jupyter

Using Ray with Ray Jobs API

Securing Your Cluster Endpoints

Running GPT-J-6B

Logging and Monitoring

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages