-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Report histogram metrics to Triton metrics server #56
feat: Report histogram metrics to Triton metrics server #56
Conversation
4c74307
to
20acebe
Compare
Could you possibly target the initial metrics branch |
self.histogram_time_to_first_token = ( | ||
self.histogram_time_to_first_token_family.Metric( | ||
labels=labels, | ||
buckets=[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Buckets here are just example from vLLM repo metrics.py. I think we want to let user define the interval buckets. Also good for the unittest since data observed are pretty small when prompts are simply. What is the best practice to allow customizable buckets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oandreeva-nv Explanation to comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, if we ship utils/metrics.py
as a part of a supported backend, we need to ship a defined set of buckets anyways, at least as a default. Since we ship this as a python script, users can always adjust it on their side. Since these values correspond to vllm side of things, I think it worth adding a comment about it with a permalink, so that we could easily refer to the original source and adjust
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
class VLLMTritonMetricsTest(TestResultCollector): | ||
def setUp(self): | ||
self.triton_client = grpcclient.InferenceServerClient(url="localhost:8001") | ||
self.tritonserver_ipaddr = os.environ.get("TRITONSERVER_IPADDR", "localhost") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why need additional env var while you also have hard-coded localhost
for inference client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This env var is used to work for windows tests as well because localhost doesn't currently work on our windows tests (cc @fpetrini15) but agree the hard-coded cases should be swapped to use shared variable
@oandreeva-nv Target? |
@yinggeh , i.e. instead of merging this branch into main, set the target merge branch as |
@oandreeva-nv Is it better now? |
@yinggeh I think you would need to resolve conflicts as well. Basically, it is still better to re-base this branch on top of |
@oandreeva-nv Do I need to resolve conflicts for |
Depending on the commit. |
20acebe
to
4b91f8c
Compare
Rebased. |
4b91f8c
to
2810d3f
Compare
b24fef2
to
20184c3
Compare
993f0c7
to
ce0cf0f
Compare
ce0cf0f
to
9534298
Compare
README.md
Outdated
vllm:time_per_output_token_seconds_bucket{model="vllm_model",version="1",le="0.05"} 15 | ||
vllm:time_per_output_token_seconds_bucket{model="vllm_model",version="1",le="0.075"} 15 | ||
vllm:time_per_output_token_seconds_bucket{model="vllm_model",version="1",le="0.1"} 15 | ||
vllm:time_per_output_token_seconds_bucket{model="vllm_model",version="1",le="0.15"} 15 | ||
vllm:time_per_output_token_seconds_bucket{model="vllm_model",version="1",le="0.2"} 15 | ||
vllm:time_per_output_token_seconds_bucket{model="vllm_model",version="1",le="0.3"} 15 | ||
vllm:time_per_output_token_seconds_bucket{model="vllm_model",version="1",le="0.4"} 15 | ||
vllm:time_per_output_token_seconds_bucket{model="vllm_model",version="1",le="0.5"} 15 | ||
vllm:time_per_output_token_seconds_bucket{model="vllm_model",version="1",le="0.75"} 15 | ||
vllm:time_per_output_token_seconds_bucket{model="vllm_model",version="1",le="1"} 15 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vllm:time_per_output_token_seconds_bucket{model="vllm_model",version="1",le="0.05"} 15 | |
vllm:time_per_output_token_seconds_bucket{model="vllm_model",version="1",le="0.075"} 15 | |
vllm:time_per_output_token_seconds_bucket{model="vllm_model",version="1",le="0.1"} 15 | |
vllm:time_per_output_token_seconds_bucket{model="vllm_model",version="1",le="0.15"} 15 | |
vllm:time_per_output_token_seconds_bucket{model="vllm_model",version="1",le="0.2"} 15 | |
vllm:time_per_output_token_seconds_bucket{model="vllm_model",version="1",le="0.3"} 15 | |
vllm:time_per_output_token_seconds_bucket{model="vllm_model",version="1",le="0.4"} 15 | |
vllm:time_per_output_token_seconds_bucket{model="vllm_model",version="1",le="0.5"} 15 | |
vllm:time_per_output_token_seconds_bucket{model="vllm_model",version="1",le="0.75"} 15 | |
vllm:time_per_output_token_seconds_bucket{model="vllm_model",version="1",le="1"} 15 | |
... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
@@ -125,6 +125,15 @@ | |||
# vllm:generation_tokens_total | |||
self.assertEqual(metrics_dict["vllm:generation_tokens_total"], 48) | |||
|
|||
# vllm:time_to_first_token_seconds | |||
self.assertEqual(metrics_dict["vllm:time_to_first_token_seconds_count"], 3) | |||
self.assertTrue(metrics_dict["vllm:time_to_first_token_seconds_sum"] > 0) |
Check notice
Code scanning / CodeQL
Imprecise assert Note
self.assertEqual(metrics_dict["vllm:time_to_first_token_seconds_bucket"], 3) | ||
# vllm:time_per_output_token_seconds | ||
self.assertEqual(metrics_dict["vllm:time_per_output_token_seconds_count"], 45) | ||
self.assertTrue(metrics_dict["vllm:time_per_output_token_seconds_sum"] > 0) |
Check notice
Code scanning / CodeQL
Imprecise assert Note
Note: new commits should dismiss approvals, but it looks like that setting wasn't applied in this repo -- will update the settings, but please keep that in mind 🙏 |
@rmccorm4 Thanks. Btw I merged to the wrong branch... It was initially set to merge to |
Sample histogram output
What does the PR do?
Support histogram metric type and add tests.
Checklist
<commit_type>: <Title>
Commit Type:
Check the conventional commit type
box here and add the label to the github PR.
Related PRs:
triton-inference-server/python_backend#374
triton-inference-server/core#386
triton-inference-server/server#7525
Where should the reviewer start?
n/a
Test plan:
n/a
17487728
Caveats:
n/a
Background
Customer requested histogram metrics from vLLM.
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
n/a