Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add histogram metric type #386

Merged
merged 10 commits into from
Aug 16, 2024

Conversation

yinggeh
Copy link
Contributor

@yinggeh yinggeh commented Aug 7, 2024

What does the PR do?

Support histogram metric type and add tests.

Checklist

  • PR title reflects the change and is of format <commit_type>: <Title>
  • Changes are described in the pull request.
  • Related issues are referenced.
  • Populated github labels field
  • Added test plan and verified test passes.
  • Verified that the PR passes existing CI.
  • Verified copyright is correct on all changed files.
  • Added succinct git squash message before merging ref.
  • All template sections are filled out.
  • Optional: Additional screenshots for behavior/output changes with before/after.

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

  • feat

Related PRs:

triton-inference-server/vllm_backend#56
triton-inference-server/python_backend#374
triton-inference-server/server#7525

Where should the reviewer start?

n/a

Test plan:

n/a

  • CI Pipeline ID:
    17487728

Caveats:

n/a

Background

Customer requested histogram metrics from vLLM.

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

n/a

include/triton/core/tritonserver.h Outdated Show resolved Hide resolved
/// \param metric The metric object to update.
/// \param value The amount to observe metric's value to.
/// \return a TRITONSERVER_Error indicating success or failure.
TRITONSERVER_DECLSPEC struct TRITONSERVER_Error* TRITONSERVER_MetricObserve(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reuse TRITONSERVER_MetricSet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rmccorm4 Any thought? Basically we need to merge two funcions below into one Metric::Set(double value). It works but may add confusion.

core/src/metric_family.cc

Lines 338 to 404 in fd5c44b

TRITONSERVER_Error*
Metric::Set(double value)
{
if (metric_ == nullptr) {
return TRITONSERVER_ErrorNew(
TRITONSERVER_ERROR_INTERNAL,
"Could not set metric value. Metric has been invalidated.");
}
switch (kind_) {
case TRITONSERVER_METRIC_KIND_COUNTER: {
return TRITONSERVER_ErrorNew(
TRITONSERVER_ERROR_UNSUPPORTED,
"TRITONSERVER_METRIC_KIND_COUNTER does not support Set");
}
case TRITONSERVER_METRIC_KIND_GAUGE: {
auto gauge_ptr = reinterpret_cast<prometheus::Gauge*>(metric_);
gauge_ptr->Set(value);
break;
}
case TRITONSERVER_METRIC_KIND_HISTOGRAM: {
return TRITONSERVER_ErrorNew(
TRITONSERVER_ERROR_UNSUPPORTED,
"TRITONSERVER_METRIC_KIND_HISTOGRAM does not support Set");
}
default:
return TRITONSERVER_ErrorNew(
TRITONSERVER_ERROR_UNSUPPORTED,
"Unsupported TRITONSERVER_MetricKind");
}
return nullptr; // Success
}
TRITONSERVER_Error*
Metric::Observe(double value)
{
if (metric_ == nullptr) {
return TRITONSERVER_ErrorNew(
TRITONSERVER_ERROR_INTERNAL,
"Could not set metric value. Metric has been invalidated.");
}
switch (kind_) {
case TRITONSERVER_METRIC_KIND_COUNTER: {
return TRITONSERVER_ErrorNew(
TRITONSERVER_ERROR_UNSUPPORTED,
"TRITONSERVER_METRIC_KIND_COUNTER does not support Observe");
}
case TRITONSERVER_METRIC_KIND_GAUGE: {
return TRITONSERVER_ErrorNew(
TRITONSERVER_ERROR_UNSUPPORTED,
"TRITONSERVER_METRIC_KIND_GAUGE does not support Observe");
}
case TRITONSERVER_METRIC_KIND_HISTOGRAM: {
auto histogram_ptr = reinterpret_cast<prometheus::Histogram*>(metric_);
histogram_ptr->Observe(value);
break;
}
default:
return TRITONSERVER_ErrorNew(
TRITONSERVER_ERROR_UNSUPPORTED,
"Unsupported TRITONSERVER_MetricKind");
}
return nullptr; // Success
}

Copy link
Contributor

@rmccorm4 rmccorm4 Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would need to take a closer look, but my gut reaction is that Guan is probably right and we can probably just reuse MetricValue and MetricSet which will call Collect and Observe internally when kind == KIND_HISTOGRAM if functionally equivalent

Copy link
Contributor

@rmccorm4 rmccorm4 Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MetricValue may not work if Collect returns multiple values (one per bucket?), but again will need to take a closer look. Let me know if you already know more details on this from your research.

But similar to the new C API for MetricV2, keep in mind how this would work if we added support for Summary metric and wanted to get the values for each quantile, which is basically same as values for each bucket. Ideally the same API would work for both or all types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MetricValue cannot be reused for histogram.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Copy link
Contributor Author

@yinggeh yinggeh Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to revisit this change. The consensus is to keep C API and python_backend API 1:1 matched. I am inclined to add a new C API TRITONSERVER_MetricObserve for histogram instead of reusing TRITONSERVER_MetricSet for three reasons.

  1. Both Histogram and Summary types call Observe to record new value. We can reuse observe for Summary type if we add it in the future.
  2. Histogram also has ObserveMultiple API which may be added in the future. I don't like the idea that Histogram.Set and Histogram.ObserveMultiple coexist.
  3. Setting histogram to a value aka Histogram.set(val) is semantically wrong. It is confusing to users familiar with Prometheus APIs. The description of TRITONSERVER_MetricSet can be verbose as well in order to describe different behaviors for counter/gauge and histogram/summary.

cc @Tabrizian @rmccorm4 @GuanLuo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well.. Triton metrics API is not supposed to be mirroring "Prometheus API", Prometheus is one of the "forms" we can exhibit the metrics as. So we should design the API to be using generic terms for statistics, the meaning of gauge/counter/histogram (and summary?) is not affected by the fact that Prometheus or other metrics libraries are used.

Thinking from this mindset, my question is if observe is the generic verb for recording the statistics of histogram. If that is the case, then I am fine to add XXXObserve, otherwise, we should use the proper verb

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think either sample or observe. Voting for Observe for simplicity.

A histogram samples observations (usually things like request durations or response sizes) and counts them in configurable buckets. It also provides a sum of all observed values.

Similar to a histogram, a summary samples observations (usually things like request durations and response sizes). While it also provides a total count of observations and a sum of all observed values, it calculates configurable quantiles over a sliding time window.

include/triton/core/tritonserver.h Outdated Show resolved Hide resolved
src/test/metrics_api_test.cc Outdated Show resolved Hide resolved
@yinggeh yinggeh force-pushed the yinggeh-DLIS-7113-support-histogram-metric-type branch 2 times, most recently from d0fed63 to fd5c44b Compare August 13, 2024 01:53
@yinggeh yinggeh force-pushed the yinggeh-DLIS-7113-support-histogram-metric-type branch from 79566e7 to edb0533 Compare August 13, 2024 02:30
include/triton/core/tritonserver.h Show resolved Hide resolved
include/triton/core/tritonserver.h Outdated Show resolved Hide resolved
include/triton/core/tritonserver.h Outdated Show resolved Hide resolved
/// Supports metrics of kind TRITONSERVER_METRIC_KIND_GAUGE and returns
/// Set the current value of metric to value or observe the value to metric.
/// Supports metrics of kind TRITONSERVER_METRIC_KIND_GAUGE and
/// TRITONSERVER_METRIC_KIND_HISTOGRAM. Returns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to explain what it does when TRITONSERVER_METRIC_KIND_HISTOGRAM is histogram (i.e. increment the counter for the bucket that value matches)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "observe" mean? Can we add more details?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why I still think we need a new C API TRITONSERVER_MetricObserve.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 54 to 55
buckets_.resize(bucket_count);
std::memcpy(buckets_.data(), buckets, sizeof(double) * bucket_count);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
buckets_.resize(bucket_count);
std::memcpy(buckets_.data(), buckets, sizeof(double) * bucket_count);
buckets_ = std::vector<double>(buckets, buckets + bucket_count);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Comment on lines 250 to 251
std::vector<std::uint64_t> cumulative_counts = {1, 1, 2, 2, 3, 3};
ASSERT_EQ(buckets.size() + 1, cumulative_counts.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cumulative_counts is depending on the buckets you split for the histogram, you should initialize cumulative_counts according to the buckets and data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@yinggeh yinggeh requested a review from GuanLuo August 16, 2024 19:32
GuanLuo
GuanLuo previously approved these changes Aug 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Development

Successfully merging this pull request may close these issues.

5 participants