Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add vLLM counter metrics access through Triton #53

Merged
merged 27 commits into from
Aug 16, 2024

Conversation

yinggeh
Copy link
Contributor

@yinggeh yinggeh commented Jul 29, 2024

Sample endpoint output

# HELP vllm:prompt_tokens_total Number of prefill tokens processed.
# TYPE vllm:prompt_tokens_total counter
vllm:prompt_tokens_total{model="vllm_metrics",version="1"} 60
# HELP vllm:generation_tokens_total Number of generation tokens processed.
# TYPE vllm:generation_tokens_total counter
vllm:generation_tokens_total{model="vllm_metrics",version="1"} 96

What does the PR do?

Add vLLM counter metrics access through python_backend custom metrics.

Checklist

  • PR title reflects the change and is of format <commit_type>: <Title>
  • Changes are described in the pull request.
  • Related issues are referenced.
  • Populated github labels field
  • Added test plan and verified test passes.
  • Verified that the PR passes existing CI.
  • Verified copyright is correct on all changed files.
  • Added succinct git squash message before merging ref.
  • All template sections are filled out.
  • Optional: Additional screenshots for behavior/output changes with before/after.

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

  • feat

Related PRs:

triton-inference-server/server#7493

Where should the reviewer start?

n/a

Test plan:

L0_backend_vllm/metrics_test

  • CI Pipeline ID:
    17372863

Caveats:

  1. Update user guide and docs

Background

Customers requested.

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

n/a

@yinggeh yinggeh added the enhancement New feature or request label Jul 29, 2024
@yinggeh yinggeh requested a review from GuanLuo July 29, 2024 22:55
@yinggeh yinggeh self-assigned this Jul 29, 2024
src/metrics.py Outdated Show resolved Hide resolved
@yinggeh yinggeh force-pushed the yinggeh-DLIS-7061-add-vllm-metrics branch from e867687 to 0686a7c Compare July 29, 2024 23:03
@yinggeh yinggeh changed the title Add vLLM metrics access Add vLLM metrics access through Triton Jul 29, 2024
src/model.py Outdated Show resolved Hide resolved
src/model.py Outdated Show resolved Hide resolved
src/metrics.py Outdated Show resolved Hide resolved
@GuanLuo
Copy link

GuanLuo commented Jul 30, 2024

test?

@yinggeh yinggeh requested a review from krishung5 July 30, 2024 18:10
@yinggeh yinggeh force-pushed the yinggeh-DLIS-7061-add-vllm-metrics branch from acc216d to 21e2356 Compare July 30, 2024 18:15
src/model.py Outdated
"version": self.args["model_version"],
}
logger = VllmStatLogger(labels=labels)
self.llm_engine.add_logger("triton", logger)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the cadence that the logger gets called? CC @kthui as this will involve round trips with core, similar to your investigation with request cancellation frequency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How often will the metrics get updated? Every request, every token, every full response, etc. ? in other words, how often will vLLM engine call this attached triton stats logger?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

every iteration

Copy link
Contributor

@rmccorm4 rmccorm4 Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That will probably significantly affect the total throughput then if the core round trip communication will interrupt the generation at every iteration based on Jacky and Iman's recent findings. We probably want this feature either way - just calling out that we'll likely need to make similar optimizations for this feature that @kthui is working on right now. Please work together to align on the best path forward for metrics feature + parity with vllm performance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kthui will run benchmarks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current path forward is to allow metrics to be turned off. There is still room to improve in the future, i.e. perform the core round-trip communication on a side branch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, the impact of having metrics (counter and gauge) on performance with --disable-log-stats flag set on FastAPI completion vs Triton generate_stream is negligible. The delta between FastAPI completion and Triton generate_stream without any metrics functionality added is approximately the same with metrics and having the --disable-log-stats flag set.

@rmccorm4 rmccorm4 changed the title Add vLLM metrics access through Triton feat: Add vLLM metrics access through Triton Jul 30, 2024
src/metrics.py Outdated Show resolved Hide resolved
@yinggeh yinggeh requested a review from kthui July 31, 2024 01:21
src/metrics.py Outdated Show resolved Hide resolved
@oandreeva-nv
Copy link
Contributor

Do we have a corresponding PR on the server side? Right now during the build container only copies model.py from this github repo, if we're adding metrics.py, that should be copied as well. Ideally, I suggest putting metics.py under utils/ folder and add logic to build.py that copies utils folder as well

@oandreeva-nv
Copy link
Contributor

I would like to have a look at this PR as well, before it gets merged

@yinggeh yinggeh force-pushed the yinggeh-DLIS-7061-add-vllm-metrics branch from 4f6e9f7 to 321faa0 Compare August 3, 2024 01:30
@yinggeh
Copy link
Contributor Author

yinggeh commented Aug 3, 2024

Do we have a corresponding PR on the server side? Right now during the build container only copies model.py from this github repo, if we're adding metrics.py, that should be copied as well. Ideally, I suggest putting metics.py under utils/ folder and add logic to build.py that copies utils folder as well

@oandreeva-nv Updated. triton-inference-server/server#7493

@yinggeh yinggeh requested a review from oandreeva-nv August 3, 2024 01:33
@yinggeh
Copy link
Contributor Author

yinggeh commented Aug 3, 2024

test?

Added

@yinggeh yinggeh requested a review from oandreeva-nv August 9, 2024 23:25
GuanLuo
GuanLuo previously approved these changes Aug 10, 2024
kthui
kthui previously approved these changes Aug 14, 2024
README.md Outdated Show resolved Hide resolved
src/utils/metrics.py Outdated Show resolved Hide resolved
class TritonMetrics:
def __init__(self, labels):
# Initialize metric families
# Iteration stats
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on the meaning of "Iteration stats"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yinggeh yinggeh dismissed stale reviews from kthui and GuanLuo via 89ca6f4 August 15, 2024 02:57
following lines to its config.pbtxt.
```bash
parameters: {
key: "REPORT_CUSTOM_METRICS"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: not sure if it should be in caps

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be consistent with the only parameters example found in our code

parameters: {
  key: "FORCE_CPU_ONLY_INPUT_TENSORS"
  value: {
    string_value:"no"
  }
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@yinggeh yinggeh Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am more inclined to upper case for boolean keys.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use key_value.upper() before the comparison:

>>> "nO".upper()
'NO'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather make either all parameters case-insensitive at the time config.pbtxt was loaded or all case-sensitive.

"REPORT_CUSTOM_METRICS" in self.model_config["parameters"]
and self.model_config["parameters"]["REPORT_CUSTOM_METRICS"]["string_value"]
== "yes"
):
Copy link
Contributor

@oandreeva-nv oandreeva-nv Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: potentially we can also check if disable_log_stats is true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. If FORCE_CPU_ONLY_INPUT_TENSORS = true but disable_log_stats=true, add_logger() throws exception. Test added.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean REPORT_CUSTOM_METRICS not FORCE_CPU_ONLY_INPUT_TENSORS ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. Yes I meant REPORT_CUSTOM_METRICS.

Tabrizian
Tabrizian previously approved these changes Aug 15, 2024
@Tabrizian
Copy link
Member

LGTM, please make sure @oandreeva-nv's comments are addressed. Thanks Yingge!

Copy link
Contributor

@oandreeva-nv oandreeva-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for this work!

Copy link
Contributor

@kthui kthui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice work!

@yinggeh yinggeh merged commit 3829366 into main Aug 16, 2024
3 checks passed
yinggeh added a commit that referenced this pull request Aug 16, 2024
mc-nv pushed a commit that referenced this pull request Aug 19, 2024
Report vLLM counter metrics through Triton server
mc-nv added a commit that referenced this pull request Aug 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants