-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add vLLM counter metrics access through Triton #53
Conversation
e867687
to
0686a7c
Compare
test? |
acc216d
to
21e2356
Compare
src/model.py
Outdated
"version": self.args["model_version"], | ||
} | ||
logger = VllmStatLogger(labels=labels) | ||
self.llm_engine.add_logger("triton", logger) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the cadence that the logger gets called? CC @kthui as this will involve round trips with core, similar to your investigation with request cancellation frequency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How often will the metrics get updated? Every request, every token, every full response, etc. ? in other words, how often will vLLM engine call this attached triton stats logger?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
every iteration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That will probably significantly affect the total throughput then if the core round trip communication will interrupt the generation at every iteration based on Jacky and Iman's recent findings. We probably want this feature either way - just calling out that we'll likely need to make similar optimizations for this feature that @kthui is working on right now. Please work together to align on the best path forward for metrics feature + parity with vllm performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kthui will run benchmarks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current path forward is to allow metrics to be turned off. There is still room to improve in the future, i.e. perform the core round-trip communication on a side branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point, the impact of having metrics (counter and gauge) on performance with --disable-log-stats
flag set on FastAPI completion vs Triton generate_stream is negligible. The delta between FastAPI completion and Triton generate_stream without any metrics functionality added is approximately the same with metrics and having the --disable-log-stats
flag set.
Do we have a corresponding PR on the server side? Right now during the build container only copies |
I would like to have a look at this PR as well, before it gets merged |
4f6e9f7
to
321faa0
Compare
|
Added |
class TritonMetrics: | ||
def __init__(self, labels): | ||
# Initialize metric families | ||
# Iteration stats |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on the meaning of "Iteration stats"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's one of the vLLM metric catagories. See https://github.com/vllm-project/vllm/blob/fc93e5614374688bddc432279244ba7fbf8169c2/vllm/engine/metrics.py#L68
…TRICS" in config.pbtxt.
following lines to its config.pbtxt. | ||
```bash | ||
parameters: { | ||
key: "REPORT_CUSTOM_METRICS" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: not sure if it should be in caps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be consistent with the only parameters
example found in our code
parameters: {
key: "FORCE_CPU_ONLY_INPUT_TENSORS"
value: {
string_value:"no"
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's one example, here's lower case as well: https://github.com/triton-inference-server/server/blob/53200091b84f08a5e4921f5073137784570283e9/docs/user_guide/optimization.md#onnx-with-tensorrt-optimization-ort-trt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am more inclined to upper case for boolean keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can use key_value.upper()
before the comparison:
>>> "nO".upper()
'NO'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather make either all parameters case-insensitive at the time config.pbtxt was loaded or all case-sensitive.
"REPORT_CUSTOM_METRICS" in self.model_config["parameters"] | ||
and self.model_config["parameters"]["REPORT_CUSTOM_METRICS"]["string_value"] | ||
== "yes" | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: potentially we can also check if disable_log_stats
is true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch. If FORCE_CPU_ONLY_INPUT_TENSORS = true
but disable_log_stats=true
, add_logger()
throws exception. Test added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean REPORT_CUSTOM_METRICS
not FORCE_CPU_ONLY_INPUT_TENSORS
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. Yes I meant REPORT_CUSTOM_METRICS
.
LGTM, please make sure @oandreeva-nv's comments are addressed. Thanks Yingge! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for this work!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, nice work!
Report vLLM counter metrics through Triton server
Co-authored-by: Yingge He <[email protected]>
Sample endpoint output
What does the PR do?
Add vLLM counter metrics access through python_backend custom metrics.
Checklist
<commit_type>: <Title>
Commit Type:
Check the conventional commit type
box here and add the label to the github PR.
Related PRs:
triton-inference-server/server#7493
Where should the reviewer start?
n/a
Test plan:
L0_backend_vllm/metrics_test
17372863
Caveats:
Background
Customers requested.
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
n/a