You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I searched the issues and found no similar issues.
KubeRay Component
ray-operator
What happened + What you expected to happen
I need to expose application level metrics on ray serve application; which shall be consumed by Prometheus
I tried to use the Gauge from ray.serve.metrics module. Please find the reference code as follows
from ray import serve
from ray.serve.metrics import Gauge
from starlette.responses import JSONResponse
import psutil
@serve.deployment
class MyDeployment:
def __init__(self):
self.num_requests = 0
self.my_gauge = Gauge(
"memory_usage_bytes",
description="Memory usage of the current process in bytes.",
tag_keys=("model",),
)
self.my_gauge.set_default_tags({"model": "123"})
async def __call__(self, request):
# Update the request count
self.num_requests += 1
# Get current memory usage
process = psutil.Process()
memory_usage = process.memory_info().rss
# Update the gauge metric
self.my_gauge.set(memory_usage)
# Return a response
return JSONResponse({
"message": "Metrics updated!",
"memory_usage_bytes": memory_usage,
"total_requests": self.num_requests,
})
app = MyDeployment.bind()
which is sample code provided by ray documentation
When this code is run locally using serve run as follows
serve run deploy:app
2024-11-18 22:47:15,041 INFO scripts.py:499 -- Running import path: 'deploy:app'.
2024-11-18 22:47:15,054 INFO worker.py:1568 -- Connecting to existing Ray cluster at address: 192.168.225.219:6379...
2024-11-18 22:47:15,060 INFO worker.py:1744 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265
2024-11-18 22:47:16,419 INFO handle.py:126 -- Created DeploymentHandle '9y048115' for Deployment(name='MyDeployment', app='default').
2024-11-18 22:47:16,419 INFO handle.py:126 -- Created DeploymentHandle 'pu69dec0' for Deployment(name='MyDeployment', app='default').
(ServeController pid=77471) INFO 2024-11-18 22:47:16,454 controller 77471 deployment_state.py:1598 - Deploying new version of Deployment(name='MyDeployment', app='default') (initial target replicas: 1).
(ProxyActor pid=77474) INFO 2024-11-18 22:47:16,397 proxy 192.168.225.219 proxy.py:1165 - Proxy starting on node 9c1fd0028a2d1265ec47f7e6105d318b0176767ca6800b6754419452 (HTTP port: 8000).
(ServeController pid=77471) INFO 2024-11-18 22:47:16,556 controller 77471 deployment_state.py:1844 - Adding 1 replica to Deployment(name='MyDeployment', app='default').
2024-11-18 22:47:17,428 INFO handle.py:126 -- Created DeploymentHandle 'b8lhc4lw' for Deployment(name='MyDeployment', app='default').
2024-11-18 22:47:17,429 INFO api.py:584 -- Deployed app 'default' successfully.
(ServeReplica:default:MyDeployment pid=77479) INFO 2024-11-18 22:47:20,498 default_MyDeployment 9goiuke5 8e990763-875e-42d8-a014-1b8047f9a9c1 /GenericModelApp1/GM1v1 replica.py:373 - __CALL__ OK 1.7ms
^C2024-11-18 22:48:23,304 WARNING api.py:592 -- Got KeyboardInterrupt, exiting...
2024-11-18 22:48:23,305 INFO scripts.py:585 -- Got KeyboardInterrupt, shutting down...
(ServeController pid=77471) INFO 2024-11-18 22:48:23,351 controller 77471 deployment_state.py:1860 - Removing 1 replica from Deployment(name='MyDeployment', app='default').
(ServeController pid=77471) INFO 2024-11-18 22:48:25,388 controller 77471 deployment_state.py:2182 - Replica(id='9goiuke5', deployment='MyDeployment', app='default') is stopped.
The custom metric ray_memory_usage_bytes is available at http://127.0.0.1:8080/ please refer to serverun.txt
But the same source file when containerised and deployed using RayService.yaml as follows:
apiVersion: ray.io/v1
kind: RayService
metadata:
name: rayservice-customer1
namespace: customer1
spec:
serviceUnhealthySecondThreshold: 300 # Config for the health check threshold for Ray Serve applications. Default value is 900.
deploymentUnhealthySecondThreshold: 300 # Config for the health check threshold for Ray dashboard agent. Default value is 300.
serveConfigV2: |
applications:
- name: DeployApp
import_path: deploy:app
route_prefix: /app
runtime_env: {}
deployments:
- name: MyDeployment
max_concurrent_queries: 100
autoscaling_config:
metrics_interval_s: 0.1
min_replicas: 1
max_replicas: 5
upscale_delay_s: 1
downscale_delay_s: 2
look_back_period_s: 2
target_num_ongoing_requests_per_replica: 5
ray_actor_options:
num_cpus: 0.1
rayClusterConfig:
rayVersion: '2.32.0' # should match the Ray version in the image of the containers
## raycluster autoscaling config
enableInTreeAutoscaling: true
autoscalerOptions:
upscalingMode: Default
resources:
limits:
cpu: 1
memory: "1000Mi"
requests:
cpu: 1
memory: "1000Mi"
######################headGroupSpecs#################################
# Ray head pod template.
headGroupSpec:
# The `rayStartParams` are used to configure the `ray start` command.
# See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
# See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
rayStartParams:
dashboard-host: '0.0.0.0'
num-cpus: "0"
# Include the dashboard
include-dashboard: "true"
# Set the metrics export port
metrics-export-port: "9080"
#pod template
template:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9080"
prometheus.io/path: "/metrics"
spec:
imagePullSecrets:
- name: test-docker
containers:
- name: ray-head
image:ray-base-image:0.0.2-SNAPSHOT-a18d3e13
imagePullPolicy: IfNotPresent
resources:
limits:
cpu: 1
memory: 2Gi
requests:
cpu: 1
memory: 2Gi
env:
- name: RAY_memory_usage_threshold
value: "0.90" # Adjust threshold as needed
- name: RAY_memory_monitor_refresh_ms
value: "0" # Disable memory monitoring
- name: RAY_GRAFANA_IFRAME_HOST
value: http://127.0.0.1:3000
- name: RAY_GRAFANA_HOST
value: http://prometheus-grafana.prometheus-system.svc:80
- name: RAY_PROMETHEUS_HOST
value: http://prometheus-kube-prometheus-prometheus.prometheus-system.svc:9090
- name: RAY_LOG_LEVEL
value: debug
ports:
- containerPort: 6379
name: gcs-server
- containerPort: 8265 # Ray dashboard
name: dashboard
- containerPort: 10001
name: client
- containerPort: 8000
name: serve
- containerPort: 9080
name: metrics
workerGroupSpecs:
# the pod replicas in this group typed worker
- replicas: 1
minReplicas: 1
maxReplicas: 5
# logical group name, for this called small-group, also can be functional
groupName: worker
# The `rayStartParams` are used to configure the `ray start` command.
# See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
# See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
rayStartParams:
metrics-export-port: "9080"
#pod template
template:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9080"
prometheus.io/path: "/metrics"
spec:
volumes:
- name: data
emptyDir: {}
imagePullSecrets:
- name:test-docker
containers:
- name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name', or '123-abc'
image:ray-base-image:0.0.2-SNAPSHOT-a18d3e13 #edfd0115 # #b6a89258
imagePullPolicy: IfNotPresent
lifecycle:
preStop:
exec:
command: ["/bin/sh","-c","ray stop"]
resources:
limits:
cpu: 2
memory: 3Gi
requests:
cpu: 2
memory: 3Gi
env:
- name: RAY_LOG_LEVEL
value: debug
volumeMounts:
- name: data
mountPath: /data
ports:
- containerPort: 9080
name: metrics
Port forwarding 9080 port of rayservice-customer1-head-svc service in customer1 namepace the custom metric is not available but ray and system metrics are available please find attached rayservice.txt Rayservice.txt
I am not sure what is missing here.
Intially I tried with default 8080 port latter changed to 9080 port check if metrics-export-port is functional
Please provide your inputs to debug further
Anything else
I tried multiple times
Are you willing to submit a PR?
Yes I am willing to submit a PR!
The text was updated successfully, but these errors were encountered:
Please find the further analysis of this issue
I am able to find my custom metric available on the worker pod localhost:9080/metrics. (had verified running curl on the http://127.0.0.1:9080/metrics)
I tried to explore on the services created by rayservice kubectl get svc -n customer1 -owide NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR rayservice-customer1-head-svc ClusterIP 10.0.253.179 <none> 10001/TCP,8265/TCP,6379/TCP,9080/TCP,8000/TCP 8m33s app.kubernetes.io/created-by=kuberay-operator,app.kubernetes.io/name=kuberay,ray.io/cluster=rayservice-customer1-raycluster-mj72p,ray.io/identifier=rayservice-customer1-raycluster-mj72p-head,ray.io/node-type=head rayservice-customer1-raycluster-mj72p-head-svc ClusterIP 10.0.234.131 <none> 10001/TCP,8265/TCP,6379/TCP,9080/TCP,8000/TCP 9m13s app.kubernetes.io/created-by=kuberay-operator,app.kubernetes.io/name=kuberay,ray.io/cluster=rayservice-customer1-raycluster-mj72p,ray.io/identifier=rayservice-customer1-raycluster-mj72p-head,ray.io/node-type=head rayservice-customer1-serve-svc ClusterIP 10.0.17.9 <none> 8000/TCP 8m33s ray.io/cluster=rayservice-customer1-raycluster-mj72p,ray.io/serve=true
When I tried to verify on the service rayservice-customer1-raycluster-mj72p-head-svc on port 9080 I couldn't find the metric. I tried on the other service too.
Is this both service tied to head as it has selector ray.io/node-type=head
Is my rayservice configuration is correct can you please review
Search before asking
KubeRay Component
ray-operator
What happened + What you expected to happen
I need to expose application level metrics on ray serve application; which shall be consumed by Prometheus
I tried to use the Gauge from ray.serve.metrics module. Please find the reference code as follows
which is sample code provided by ray documentation
When this code is run locally using serve run as follows
The custom metric ray_memory_usage_bytes is available at http://127.0.0.1:8080/ please refer to serverun.txt
But the same source file when containerised and deployed using RayService.yaml as follows:
serverun.txt
Reproduction script
And created custom resource at
rayservicedesc.txt
Port forwarding 9080 port of rayservice-customer1-head-svc service in customer1 namepace the custom metric is not available but ray and system metrics are available please find attached rayservice.txt
Rayservice.txt
I am not sure what is missing here.
Intially I tried with default 8080 port latter changed to 9080 port check if metrics-export-port is functional
Please provide your inputs to debug further
Anything else
I tried multiple times
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: