Skip to content

Latest commit

 

History

History
80 lines (47 loc) · 3.36 KB

File metadata and controls

80 lines (47 loc) · 3.36 KB

Nebius package for Ray cluster

Description

Ray is an open-source distributed computing framework built for the deployment and orchestration of scalable distributed computing environments for a variety of large-scale AI workloads. Ray Cluster provides a robust infrastructure for training complex machine learning models and running reinforcement learning algorithms at scale. Leveraging Kubernetes orchestration capabilities, Ray Cluster simplifies the deployment process, allowing users to efficiently allocate resources and manage workloads across clusters. With support for distributed execution and parallelism, Ray Cluster optimizes resource utilization and accelerates model training, enabling faster iteration and experimentation in AI research and development.

{% note warning %}

Before installing Ray Cluster, you must install NVIDIA® GPU Operator on the cluster.

{% endnote %}

Short description

Ray simplifies scalable AI workload deployment with Kubernetes orchestration.

Tutorial

Before installing this product:

  1. Create a node group with GPUs in it. The product supports the following VM platforms with GPUs:

    • NVIDIA® H100 NVLink with Intel Sapphire Rapids

    {% note info %}

    It is strongly recommended that each node has at least 4 vCPUs and 8 GB of RAM.

    {% endnote %}

To install the product:

  1. Configure the application:

    {% note info %}

    It is stronly recommended to keep the default values for the head pod and worker pods without GPUs so that it takes up an entire node. For more details, see the Ray documentation.

    {% endnote %}

  2. Click Install.

  3. Wait for the application to change its status to Deployed.

Usage

To check that Ray is working:

  1. Set up port forwarding:

    kubectl -n <namespace> port-forward \
      services/<application_name>-kuberay-head-svc 8265:8265
  2. Go to http://localhost:8265/ in your web browser.

Use cases

  • Reinforcement learning research and development.
  • Distributed model training for deep learning applications.
  • High-performance computing for scientific simulations and data analysis.
  • Large-scale data processing and analytics.
  • Experimentation with parallel algorithms and distributed systems.
  • Development and deployment of AI-powered applications in production environments.

Links

Term of service

Legal

By using the application, you agree to their terms and conditions: the helm-chart and KubeRay.