Skip to content

Commit

Permalink
Merge branch 'master' into shared-long-poll-client
Browse files Browse the repository at this point in the history
  • Loading branch information
JoshKarpel authored Nov 27, 2024
2 parents cfc7d19 + 3bd3a02 commit c673bec
Show file tree
Hide file tree
Showing 194 changed files with 5,078 additions and 2,810 deletions.
46 changes: 39 additions & 7 deletions BUILD.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -1627,7 +1627,7 @@ ray_cc_test(
deps = [
":gcs_server_lib",
":gcs_test_util_lib",
"@com_google_googletest//:gtest_main",
"@com_google_googletest//:gtest",
],
)

Expand All @@ -1649,7 +1649,7 @@ ray_cc_test(
deps = [
":gcs_server_lib",
":gcs_test_util_lib",
"@com_google_googletest//:gtest_main",
"@com_google_googletest//:gtest",
],
)

Expand Down Expand Up @@ -1883,7 +1883,7 @@ ray_cc_test(
":gcs_table_storage_test_lib",
":gcs_test_util_lib",
":store_client_test_lib",
"@com_google_googletest//:gtest_main",
"@com_google_googletest//:gtest",
],
)

Expand Down Expand Up @@ -2403,11 +2403,43 @@ ray_cc_test(
)

ray_cc_test(
name = "gcs_export_event_test",
name = "gcs_job_manager_export_event_test",
size = "small",
srcs = glob([
"src/ray/gcs/gcs_server/test/export_api/*.cc",
]),
srcs = ["src/ray/gcs/gcs_server/test/export_api/gcs_job_manager_export_event_test.cc"],
tags = [
"no_windows",
"team:core"
],
deps = [
":gcs_server_lib",
":gcs_server_test_util",
":gcs_test_util_lib",
":ray_mock",
"@com_google_googletest//:gtest_main",
],
)

ray_cc_test(
name = "gcs_actor_manager_export_event_test",
size = "small",
srcs = ["src/ray/gcs/gcs_server/test/export_api/gcs_actor_manager_export_event_test.cc"],
tags = [
"no_windows",
"team:core"
],
deps = [
":gcs_server_lib",
":gcs_server_test_util",
":gcs_test_util_lib",
":ray_mock",
"@com_google_googletest//:gtest_main",
],
)

ray_cc_test(
name = "gcs_node_manager_export_event_test",
size = "small",
srcs = ["src/ray/gcs/gcs_server/test/export_api/gcs_node_manager_export_event_test.cc"],
tags = [
"no_windows",
"team:core"
Expand Down
2 changes: 1 addition & 1 deletion doc/source/cluster/configure-manage-dashboard.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
Dashboard configurations may differ depending on how you launch Ray Clusters (e.g., local Ray Cluster v.s. KubeRay). Integrations with Prometheus and Grafana are optional for enhanced Dashboard experience.

:::{note}
Ray Dashboard is only intended for interactive development and debugging because the Dashboard UI and the underlying data are not accessible after Clusters are terminated. For production monitoring and debugging, users should rely on [persisted logs](../cluster/kubernetes/user-guides/logging.md), [persisted metrics](./metrics.md), [persisted Ray states](../ray-observability/user-guides/cli-sdk.rst), and other observability tools.
Ray Dashboard is useful for interactive development and debugging because when clusters terminate, the dashboard UI and the underlying data are no longer accessible. For production monitoring and debugging, you should rely on [persisted logs](../cluster/kubernetes/user-guides/persist-kuberay-custom-resource-logs.md), [persisted metrics](./metrics.md), [persisted Ray states](../ray-observability/user-guides/cli-sdk.rst), and other observability tools.
:::

## Changing the Ray Dashboard port
Expand Down
46 changes: 46 additions & 0 deletions doc/source/cluster/kubernetes/configs/loki.log.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# Fluent Bit Config
config:
inputs: |
[INPUT]
Name tail
Path /var/log/containers/*.log
multiline.parser docker, cri
Tag kube.*
Mem_Buf_Limit 5MB
Skip_Long_Lines On
filters: |
[FILTER]
Name kubernetes
Match kube.*
Merge_Log On
Keep_Log Off
K8S-Logging.Parser On
K8S-Logging.Exclude On
outputs: |
[OUTPUT]
Name loki
Match *
Host loki-gateway
Port 80
Labels job=fluent-bit,namespace=$kubernetes['namespace_name'],pod=$kubernetes['pod_name'],container=$kubernetes['container_name']
Auto_Kubernetes_Labels Off
tenant_id test
---
# Grafana Datasource Config
datasources:
datasources.yaml:
apiVersion: 1
datasources:
- name: Loki
type: loki
access: proxy
editable: true
url: http://loki-gateway.default
jsonData:
timeout: 60
maxLines: 1000
httpHeaderName1: "X-Scope-OrgID"
secureJsonData:
httpHeaderValue1: "test"
2 changes: 1 addition & 1 deletion doc/source/cluster/kubernetes/configs/ray-cluster.gpu.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ spec:
######################headGroupSpec#################################
# head group template and specs, (perhaps 'group' is not needed in the name)
headGroupSpec:
# logical group name, for this called head-group, also can be functional
# logical group name, for this called headgroup, also can be functional
# pod type head or worker
# rayNodeType: head # Not needed since it is under the headgroup
# the following params are used to complete the ray start: ray start --head --block ...
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -35,11 +35,12 @@ kubectl get pods
# kuberay-operator-7fbdbf8c89-pt8bk 1/1 Running 0 27s
```

KubeRay offers multiple options for operator installations, such as Helm, Kustomize, and a single-namespaced operator. For further information, please refer to [the installation instructions in the KubeRay documentation](https://ray-project.github.io/kuberay/deploy/installation/).
KubeRay offers multiple options for operator installations, such as Helm, Kustomize, and a single-namespaced operator. For further information, see [the installation instructions in the KubeRay documentation](https://ray-project.github.io/kuberay/deploy/installation/).

(raycluster-deploy)=
## Step 3: Deploy a RayCluster custom resource

Once the KubeRay operator is running, we are ready to deploy a RayCluster. To do so, we create a RayCluster Custom Resource (CR) in the `default` namespace.
Once the KubeRay operator is running, you're ready to deploy a RayCluster. Create a RayCluster Custom Resource (CR) in the `default` namespace.

::::{tab-set}

Expand Down
6 changes: 4 additions & 2 deletions doc/source/cluster/kubernetes/user-guides.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,8 @@ user-guides/config
user-guides/configuring-autoscaling
user-guides/kuberay-gcs-ft
user-guides/gke-gcs-bucket
user-guides/logging
user-guides/persist-kuberay-custom-resource-logs
user-guides/persist-kuberay-operator-logs
user-guides/gpu
user-guides/tpu
user-guides/rayserve-dev-doc
Expand Down Expand Up @@ -45,7 +46,8 @@ at the {ref}`introductory guide <kuberay-quickstart>` first.
* {ref}`kuberay-gpu`
* {ref}`kuberay-tpu`
* {ref}`kuberay-gcs-ft`
* {ref}`kuberay-logging`
* {ref}`persist-kuberay-custom-resource-logs`
* {ref}`persist-kuberay-operator-logs`
* {ref}`kuberay-dev-serve`
* {ref}`kuberay-pod-command`
* {ref}`kuberay-pod-security`
Expand Down
2 changes: 1 addition & 1 deletion doc/source/cluster/kubernetes/user-guides/config.md
Original file line number Diff line number Diff line change
Expand Up @@ -126,7 +126,7 @@ Here are some of the subfields of the pod `template` to pay attention to:
#### containers
A Ray pod template specifies at minimum one container, namely the container
that runs the Ray processes. A Ray pod template may also specify additional sidecar
containers, for purposes such as {ref}`log processing <kuberay-logging>`. However, the KubeRay operator assumes that
containers, for purposes such as {ref}`log processing <persist-kuberay-custom-resource-logs>`. However, the KubeRay operator assumes that
the first container in the containers list is the main Ray container.
Therefore, make sure to specify any sidecar containers
**after** the main Ray container. In other words, the Ray container should be the **first**
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
(kuberay-logging)=
(persist-kuberay-custom-resource-logs)=

# Log Persistence
# Persist KubeRay custom resource logs

Logs (both system and application logs) are useful for troubleshooting Ray applications and Clusters. For example, you may want to access system logs if a node terminates unexpectedly.

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
(persist-kuberay-operator-logs)=

# Persist KubeRay Operator Logs

The KubeRay Operator plays a vital role in managing Ray clusters on Kubernetes. Persisting its logs is essential for effective troubleshooting and monitoring. This guide describes methods to set up centralized logging for KubeRay Operator logs.

## Grafana Loki

[Grafana Loki][GrafanaLoki] is a log aggregation system optimized for Kubernetes, providing efficient log storage and querying. The following steps set up [Fluent Bit][FluentBit] as a DaemonSet to collect logs from Kubernetes containers and send them to Loki for centralized storage and analysis.

### Deploy Loki monolithic mode

Loki’s Helm chart supports three deployment methods to fit different scalability and performance needs: Monolithic, Simple Scalable, and Microservices. This guide demonstrates the monolithic method. For details on each deployment mode, see the [Loki deployment](https://grafana.com/docs/loki/latest/get-started/deployment-modes/) modes documentation.

Deploy the Loki deployment with the [Helm chart repository](https://github.com/grafana/loki/tree/main/production/helm/loki).

```shell
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

# Install Loki with single replica mode
helm install loki grafana/loki --version 6.21.0 -f https://raw.githubusercontent.com/grafana/loki/refs/heads/main/production/helm/loki/single-binary-values.yaml
```

### Configure log processing

Create a `fluent-bit-config.yaml` file, which configures Fluent Bit to:

* Tail log files from Kubernetes containers.
* Parse multi-line logs for Docker and Container Runtime Interface (CRI) formats.
* Enrich logs with Kubernetes metadata such as namespace, pod, and container names.
* Send the logs to Loki for centralized storage and querying.
```{literalinclude} ../configs/loki.log.yaml
:language: yaml
:start-after: Fluent Bit Config
:end-before: ---
```

A few notes on the above config:

* Inputs: The `tail` input reads log files from `/var/log/containers/*.log`, with `multiline.parser` to handle complex log messages across multiple lines.
* Filters: The `kubernetes` filter adds metadata like namespace, pod, and container names to each log, enabling more efficient log management and querying in Loki.
* Outputs: The `loki` output block specifies Loki as the target. The `Host` and `Port` define the Loki service endpoint, and `Labels` adds metadata for easier querying in Grafana. Additionally, `tenant_id` allows for multi-tenancy if required by the Loki setup.

Deploy the Fluent Bit deployment with the [Helm chart repository](https://github.com/fluent/helm-charts/tree/main/charts/fluent-bit).

```shell
helm repo add fluent https://fluent.github.io/helm-charts
helm repo update

helm install fluent-bit fluent/fluent-bit --version 0.48.2 -f fluent-bit-config.yaml
```

### Install the KubeRay Operator

Follow [Deploy a KubeRay operator](kuberay-operator-deploy) to install the KubeRay operator.


### Deploy a RayCluster

Follow [Deploy a RayCluster custom resource](raycluster-deploy) to deploy a RayCluster.


### Deploy Grafana

Create a `datasource-config.yaml` file with the following configuration to set up Grafana's Loki datasource:
```{literalinclude} ../configs/loki.log.yaml
:language: yaml
:start-after: Grafana Datasource Config
```

Deploy the Grafana deployment with the [Helm chart repository](https://github.com/grafana/helm-charts/tree/main/charts/grafana).

```shell
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

helm install grafana grafana/grafana --version 8.6.2 -f datasource-config.yaml
```

### Check the Grafana Dashboard

```shell
# Verify that the Grafana pod is running in the `default` namespace.
kubectl get pods --namespace default -l "app.kubernetes.io/name=grafana"
# NAME READY STATUS RESTARTS AGE
# grafana-54d5d747fd-5fldc 1/1 Running 0 8m21s
```

To access Grafana from your local machine, set up port forwarding by running:
```shell
export POD_NAME=$(kubectl get pods --namespace default -l "app.kubernetes.io/name=grafana,app.kubernetes.io/instance=grafana" -o jsonpath="{.items[0].metadata.name}")
kubectl --namespace default port-forward $POD_NAME 3000
```

This command makes Grafana available locally at `http://localhost:3000`.

* Username: "admin"
* Password: Get the password using the following command:

```shell
kubectl get secret --namespace default grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo
```

Finally, use a LogQL query to view logs for a specific pod, such as the KubeRay Operator, and filter logs by the `RayCluster_name`:

```
{pod="kuberay-operator-xxxxxxxx-xxxxx"} | json | RayCluster_name = `raycluster-kuberay`
```

![Loki Logs](images/loki-logs.png)

You can use LogQL's JSON syntax to filter logs based on specific fields, such as `RayCluster_name`. See [Log query language doc](https://grafana.com/docs/loki/latest/query/) for more information about LogQL filtering.

[GrafanaLoki]: https://grafana.com/oss/loki/
[FluentBit]: https://docs.fluentbit.io/manual
12 changes: 11 additions & 1 deletion doc/source/cluster/metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,17 @@ ray_dashboard_api_requests_count_requests_total

You can then see the number of requests to the Ray Dashboard API over time.

To stop Prometheus, run `kill <PID>` where `<PID>` is the PID of the Prometheus process that was printed out when you ran the command. To find the PID, you can also run `ps aux | grep prometheus`.
To stop Prometheus, run the following commands:

```sh
# case 1: Ray > 2.40
ray metrics shutdown-prometheus

# case 2: Otherwise
# Run `ps aux | grep prometheus` to find the PID of the Prometheus process. Then, kill the process.
kill <PID>
```


### [Optional] Manual: Running Prometheus locally

Expand Down
1 change: 1 addition & 0 deletions doc/source/custom_directives.py
Original file line number Diff line number Diff line change
Expand Up @@ -481,6 +481,7 @@ def key(cls: type) -> str:
class Framework(ExampleEnum):
"""Framework type for example metadata."""

AWSNEURON = "AWS Neuron"
PYTORCH = "PyTorch"
LIGHTNING = "Lightning"
TRANSFORMERS = "Transformers"
Expand Down
5 changes: 4 additions & 1 deletion doc/source/ray-more-libs/dask-on-ray.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,10 @@ workload. Using the Dask-on-Ray scheduler, the entire Dask ecosystem can be exec

* - Ray Version
- Dask Version
* - ``2.8.0`` or above
* - ``2.34.0`` or above
- | ``2022.10.1 (Python version < 3.12)``
| ``2024.6.0 (Python version >= 3.12)``
* - ``2.8.0`` to ``2.33.x``
- ``2022.10.1``
* - ``2.5.0`` to ``2.7.x``
- | ``2022.2.0 (Python version < 3.8)``
Expand Down
12 changes: 6 additions & 6 deletions doc/source/ray-observability/user-guides/configure-logging.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ A new Ray session creates a new folder to the temp directory. The latest session

Usually, temp directories are cleared up whenever the machines reboot. As a result, log files may get lost whenever your cluster or some of the nodes are stopped or terminated.

If you need to inspect logs after the clusters are stopped or terminated, you need to store and persist the logs. View the instructions for how to process and export logs for {ref}`clusters on VMs <vm-logging>` and {ref}`KubeRay Clusters <kuberay-logging>`.
If you need to inspect logs after the clusters stop or terminate, you need to store and persist the logs. See the instructions for how to process and export logs for {ref}`Log persistence <vm-logging>` and {ref}`KubeRay Clusters <persist-kuberay-custom-resource-logs>`.

(logging-directory-structure)=
## Log files in logging directory
Expand Down Expand Up @@ -131,12 +131,12 @@ ray.get([task.remote() for _ in range(100)])
The output is as follows:

```bash
2023-03-27 15:08:34,195 INFO worker.py:1603 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265
2023-03-27 15:08:34,195 INFO worker.py:1603 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265
(task pid=534172) Hello there, I am a task 0.20583517821231412
(task pid=534174) Hello there, I am a task 0.17536720316370757 [repeated 99x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication)
```

This feature is useful when importing libraries such as `tensorflow` or `numpy`, which may emit many verbose warning messages when you import them.
This feature is useful when importing libraries such as `tensorflow` or `numpy`, which may emit many verbose warning messages when you import them.

Configure the following environment variables on the driver process **before importing Ray** to customize log deduplication:

Expand Down Expand Up @@ -247,8 +247,8 @@ ray_tune_logger.addHandler(logging.FileHandler("extra_ray_tune_log.log"))
Implement structured logging to enable downstream users and applications to consume the logs efficiently.

### Application logs
A Ray applications include both driver and worker processes. For Python applications, use Python loggers to format and structure your logs.
As a result, Python loggers need to be set up for both driver and worker processes.
A Ray app includes both driver and worker processes. For Python apps, use Python loggers to format and structure your logs.
As a result, you need to set up Python loggers for both driver and worker processes.

::::{tab-set}

Expand Down Expand Up @@ -472,4 +472,4 @@ The max size of a log file, including its backup, is `RAY_ROTATION_MAX_BYTES * R

## Log persistence

To process and export logs to external stroage or management systems, view {ref}`log persistence on Kubernetes <kuberay-logging>` and {ref}`log persistence on VMs <vm-logging>` for more details.
To process and export logs to external stroage or management systems, view {ref}`log persistence on Kubernetes <persist-kuberay-custom-resource-logs>` see {ref}`log persistence on VMs <vm-logging>` for more details.
2 changes: 1 addition & 1 deletion doc/source/serve/production-guide/kubernetes.md
Original file line number Diff line number Diff line change
Expand Up @@ -238,7 +238,7 @@ Monitor your Serve application using the Ray Dashboard.
- Learn more about how to configure and manage Dashboard [here](observability-configure-manage-dashboard).
- Learn about the Ray Serve Dashboard [here](serve-monitoring).
- Learn how to set up [Prometheus](prometheus-setup) and [Grafana](grafana) for Dashboard.
- Learn about the [Ray Serve logs](serve-logging) and how to [persistent logs](kuberay-logging) on Kubernetes.
- Learn about the [Ray Serve logs](serve-logging) and how to [persistent logs](persist-kuberay-custom-resource-logs) on Kubernetes.

:::{note}
- To troubleshoot application deployment failures in Serve, you can check the KubeRay operator logs by running `kubectl logs -f <kuberay-operator-pod-name>` (e.g., `kubectl logs -f kuberay-operator-7447d85d58-lv7pf`). The KubeRay operator logs contain information about the Serve application deployment event and Serve application health checks.
Expand Down
Loading

0 comments on commit c673bec

Please sign in to comment.