Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Application-Level Metrics works locally but fails when deployed as rayservice #2553

Open
2 tasks done
rajendra-avesha opened this issue Nov 18, 2024 · 2 comments
Open
2 tasks done

Comments

@rajendra-avesha
Copy link

rajendra-avesha commented Nov 18, 2024

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

I need to expose application level metrics on ray serve application; which shall be consumed by Prometheus
I tried to use the Gauge from ray.serve.metrics module. Please find the reference code as follows

from ray import serve
from ray.serve.metrics import Gauge

from starlette.responses import JSONResponse
import psutil

@serve.deployment
class MyDeployment:
    def __init__(self):
        self.num_requests = 0
        self.my_gauge = Gauge(
            "memory_usage_bytes",
            description="Memory usage of the current process in bytes.",
            tag_keys=("model",),
        )
        self.my_gauge.set_default_tags({"model": "123"})

    async def __call__(self, request):
        # Update the request count
        self.num_requests += 1

        # Get current memory usage
        process = psutil.Process()
        memory_usage = process.memory_info().rss

        # Update the gauge metric
        self.my_gauge.set(memory_usage)

        # Return a response
        return JSONResponse({
            "message": "Metrics updated!",
            "memory_usage_bytes": memory_usage,
            "total_requests": self.num_requests,
        })


app = MyDeployment.bind()

which is sample code provided by ray documentation
When this code is run locally using serve run as follows

serve run deploy:app
2024-11-18 22:47:15,041 INFO scripts.py:499 -- Running import path: 'deploy:app'.
2024-11-18 22:47:15,054 INFO worker.py:1568 -- Connecting to existing Ray cluster at address: 192.168.225.219:6379...
2024-11-18 22:47:15,060 INFO worker.py:1744 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265 
2024-11-18 22:47:16,419 INFO handle.py:126 -- Created DeploymentHandle '9y048115' for Deployment(name='MyDeployment', app='default').
2024-11-18 22:47:16,419 INFO handle.py:126 -- Created DeploymentHandle 'pu69dec0' for Deployment(name='MyDeployment', app='default').
(ServeController pid=77471) INFO 2024-11-18 22:47:16,454 controller 77471 deployment_state.py:1598 - Deploying new version of Deployment(name='MyDeployment', app='default') (initial target replicas: 1).
(ProxyActor pid=77474) INFO 2024-11-18 22:47:16,397 proxy 192.168.225.219 proxy.py:1165 - Proxy starting on node 9c1fd0028a2d1265ec47f7e6105d318b0176767ca6800b6754419452 (HTTP port: 8000).
(ServeController pid=77471) INFO 2024-11-18 22:47:16,556 controller 77471 deployment_state.py:1844 - Adding 1 replica to Deployment(name='MyDeployment', app='default').
2024-11-18 22:47:17,428 INFO handle.py:126 -- Created DeploymentHandle 'b8lhc4lw' for Deployment(name='MyDeployment', app='default').
2024-11-18 22:47:17,429 INFO api.py:584 -- Deployed app 'default' successfully.
(ServeReplica:default:MyDeployment pid=77479) INFO 2024-11-18 22:47:20,498 default_MyDeployment 9goiuke5 8e990763-875e-42d8-a014-1b8047f9a9c1 /GenericModelApp1/GM1v1 replica.py:373 - __CALL__ OK 1.7ms
^C2024-11-18 22:48:23,304       WARNING api.py:592 -- Got KeyboardInterrupt, exiting...
2024-11-18 22:48:23,305 INFO scripts.py:585 -- Got KeyboardInterrupt, shutting down...
(ServeController pid=77471) INFO 2024-11-18 22:48:23,351 controller 77471 deployment_state.py:1860 - Removing 1 replica from Deployment(name='MyDeployment', app='default').
(ServeController pid=77471) INFO 2024-11-18 22:48:25,388 controller 77471 deployment_state.py:2182 - Replica(id='9goiuke5', deployment='MyDeployment', app='default') is stopped.

The custom metric ray_memory_usage_bytes is available at http://127.0.0.1:8080/ please refer to serverun.txt
But the same source file when containerised and deployed using RayService.yaml as follows:

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: rayservice-customer1
  namespace: customer1
spec:
  serviceUnhealthySecondThreshold: 300 # Config for the health check threshold for Ray Serve applications. Default value is 900.
  deploymentUnhealthySecondThreshold: 300 # Config for the health check threshold for Ray dashboard agent. Default value is 300.
  serveConfigV2: |
      applications:
      - name: DeployApp
        import_path: deploy:app
        route_prefix: /app
        runtime_env: {}
        deployments:
        - name: MyDeployment
          max_concurrent_queries: 100
          autoscaling_config:
            metrics_interval_s: 0.1
            min_replicas: 1
            max_replicas: 5
            upscale_delay_s: 1
            downscale_delay_s: 2
            look_back_period_s: 2
            target_num_ongoing_requests_per_replica: 5
          ray_actor_options:
            num_cpus: 0.1

  rayClusterConfig:
    rayVersion: '2.32.0' # should match the Ray version in the image of the containers
    ## raycluster autoscaling config
    enableInTreeAutoscaling: true
    autoscalerOptions:
      upscalingMode: Default
      resources:
        limits:
          cpu: 1
          memory: "1000Mi"
        requests:
          cpu: 1
          memory: "1000Mi"
    ######################headGroupSpecs#################################
    # Ray head pod template.
    headGroupSpec:
      # The `rayStartParams` are used to configure the `ray start` command.
      # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
      # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
      rayStartParams:
        dashboard-host: '0.0.0.0'
        num-cpus: "0"
        # Include the dashboard
        include-dashboard: "true"
        # Set the metrics export port
        metrics-export-port: "9080"
      #pod template
      template:
        metadata:
          annotations:
            prometheus.io/scrape: "true"
            prometheus.io/port: "9080"
            prometheus.io/path: "/metrics"
        spec:
          imagePullSecrets:
            - name: test-docker
          containers:
            - name: ray-head
              image:ray-base-image:0.0.2-SNAPSHOT-a18d3e13
              imagePullPolicy: IfNotPresent
              resources:
                limits:
                  cpu: 1
                  memory: 2Gi
                requests:
                  cpu: 1
                  memory: 2Gi
              env:
                - name: RAY_memory_usage_threshold
                  value: "0.90"  # Adjust threshold as needed
                - name: RAY_memory_monitor_refresh_ms
                  value: "0"  # Disable memory monitoring
                - name: RAY_GRAFANA_IFRAME_HOST
                  value: http://127.0.0.1:3000
                - name: RAY_GRAFANA_HOST
                  value: http://prometheus-grafana.prometheus-system.svc:80
                - name: RAY_PROMETHEUS_HOST
                  value: http://prometheus-kube-prometheus-prometheus.prometheus-system.svc:9090
                - name: RAY_LOG_LEVEL
                  value: debug
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265 # Ray dashboard
                  name: dashboard
                - containerPort: 10001
                  name: client
                - containerPort: 8000
                  name: serve
                - containerPort: 9080
                  name: metrics
    workerGroupSpecs:
      # the pod replicas in this group typed worker
      - replicas: 1
        minReplicas: 1
        maxReplicas: 5
        # logical group name, for this called small-group, also can be functional
        groupName: worker
        # The `rayStartParams` are used to configure the `ray start` command.
        # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
        # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
        rayStartParams:
          metrics-export-port: "9080"
        #pod template
        template:
          metadata:
            annotations:
              prometheus.io/scrape: "true"
              prometheus.io/port: "9080"
              prometheus.io/path: "/metrics"
          spec:
            volumes:
              - name: data
                emptyDir: {}
            imagePullSecrets:
              - name:test-docker
            containers:
              - name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc'
                image:ray-base-image:0.0.2-SNAPSHOT-a18d3e13 #edfd0115 # #b6a89258
                imagePullPolicy: IfNotPresent
                lifecycle:
                  preStop:
                    exec:
                      command: ["/bin/sh","-c","ray stop"]
                resources:
                  limits:
                    cpu: 2
                    memory: 3Gi
                  requests:
                    cpu: 2
                    memory: 3Gi
                env:
                  - name: RAY_LOG_LEVEL
                    value: debug
                volumeMounts:
                  - name: data
                    mountPath: /data
                ports:
                  - containerPort: 9080
                    name: metrics

serverun.txt

Reproduction script

And created custom resource at
rayservicedesc.txt

Port forwarding 9080 port of rayservice-customer1-head-svc service in customer1 namepace the custom metric is not available but ray and system metrics are available please find attached rayservice.txt
Rayservice.txt

I am not sure what is missing here.
Intially I tried with default 8080 port latter changed to 9080 port check if metrics-export-port is functional
Please provide your inputs to debug further

Anything else

I tried multiple times

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@rajendra-avesha rajendra-avesha added bug Something isn't working triage labels Nov 18, 2024
@rajendra-avesha
Copy link
Author

Please find the further analysis of this issue
I am able to find my custom metric available on the worker pod localhost:9080/metrics. (had verified running curl on the http://127.0.0.1:9080/metrics)
I tried to explore on the services created by rayservice
kubectl get svc -n customer1 -owide NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR rayservice-customer1-head-svc ClusterIP 10.0.253.179 <none> 10001/TCP,8265/TCP,6379/TCP,9080/TCP,8000/TCP 8m33s app.kubernetes.io/created-by=kuberay-operator,app.kubernetes.io/name=kuberay,ray.io/cluster=rayservice-customer1-raycluster-mj72p,ray.io/identifier=rayservice-customer1-raycluster-mj72p-head,ray.io/node-type=head rayservice-customer1-raycluster-mj72p-head-svc ClusterIP 10.0.234.131 <none> 10001/TCP,8265/TCP,6379/TCP,9080/TCP,8000/TCP 9m13s app.kubernetes.io/created-by=kuberay-operator,app.kubernetes.io/name=kuberay,ray.io/cluster=rayservice-customer1-raycluster-mj72p,ray.io/identifier=rayservice-customer1-raycluster-mj72p-head,ray.io/node-type=head rayservice-customer1-serve-svc ClusterIP 10.0.17.9 <none> 8000/TCP 8m33s ray.io/cluster=rayservice-customer1-raycluster-mj72p,ray.io/serve=true
When I tried to verify on the service rayservice-customer1-raycluster-mj72p-head-svc on port 9080 I couldn't find the metric. I tried on the other service too.

Is this both service tied to head as it has selector ray.io/node-type=head

Is my rayservice configuration is correct can you please review

@kevin85421
Copy link
Member

Hi @rajendra-avesha, this thread https://ray.slack.com/archives/CNCKBBRJL/p1730741501573559 might be useful. If you still have the issue, feel free to reach out to us on the KubeRay Slack or reply to this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants