-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace Summary algorithm with improved implementation #62
Conversation
Note: fixes #20 |
|
||
instance NFData Summary where | ||
rnf (MkSummary a b) = a `seq` b `seq` () |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seq
for b
doesn't look right here - don't you need to deepseq the entire list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
d'oh, should be fixed now.
@fimad I'm not really familiar with summaries at all here, so I'm not sure I'm in the best position to review. That said, the quality seems very high and presumably @iand675 is happy with the data that's eventually reaching Prometheus with this change. The extra deps of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really awesome, thank you for contributing this!
I just have one request for documentation on determineK
otherwise I think this looks good to merge.
Just published prometheus-client version 1.1.0 with this change. I also bumped the version for the other libraries so that they would pick up the expanded version range. @ocharles, could you also publish prometheus-proc? I don't have hackage access to that one. |
Context
Workers systems in our Haskell codebase were exhibiting dramatic slowdowns over time when using summary metrics. Through heap profiling, we determined that the culprit was the particular algorithm in use by
prometheus-client
having linear memory growth for "adversarial" inputs, such as repeating duplicate numbers, monotonically increasing values, and monotonically decreasing values:This was our worker heap profile for repeated observations of the value
1
:We found through a Rust implementation of the same algorithm (CKMS) that the targeted quantile variant in use is fundamentally flawed, and it pointed us to bq-pods.pdf this paper as the followup research by the same authors that solved the problem. Unfortunately, the paper assumed the use of effectively bounded ranges of possible inputs to make its sublinear memory growth guarantees, and wasn't practically implementable. We contacted the authors of the paper, and Graham Cormode provided us with a link to the latest work in this line, and thankfully it also had publicly available code under the Apache foundation. We ported over the
ReqSketch
implementation to Haskell and have it published as a package on HackageHow We Tested
We ported the tests from the Java implementation to our version here: https://github.com/iand675/datasketches-haskell/tree/main/test
Benchmarks in the general case (ReqSketch/insert/mvar) vs (Prometheus/insert/existing) show dramatic performance increases (~257x) over the existing Prometheus code.
We also ran tests to ensure sublinear growth, and in the adversarial case of repeated inserts of the same value, we are seeing the desired behaviour for the new implementation:
Returns
We also tested against our workers locally with the new algorithm, and we're seeing consistent garbage collection / memory usage when processing 100k webhooks. To be clear, this is the same workload as the graph posted above, the only difference is the usage of the new quantile estimation algorithm.