Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace Summary algorithm with improved implementation #62

Merged
merged 7 commits into from
Aug 27, 2021

Conversation

iand675
Copy link
Contributor

@iand675 iand675 commented Aug 24, 2021

Context

Workers systems in our Haskell codebase were exhibiting dramatic slowdowns over time when using summary metrics. Through heap profiling, we determined that the culprit was the particular algorithm in use by prometheus-client having linear memory growth for "adversarial" inputs, such as repeating duplicate numbers, monotonically increasing values, and monotonically decreasing values:

This was our worker heap profile for repeated observations of the value 1:

image

We found through a Rust implementation of the same algorithm (CKMS) that the targeted quantile variant in use is fundamentally flawed, and it pointed us to bq-pods.pdf this paper as the followup research by the same authors that solved the problem. Unfortunately, the paper assumed the use of effectively bounded ranges of possible inputs to make its sublinear memory growth guarantees, and wasn't practically implementable. We contacted the authors of the paper, and Graham Cormode provided us with a link to the latest work in this line, and thankfully it also had publicly available code under the Apache foundation. We ported over the ReqSketch implementation to Haskell and have it published as a package on Hackage

How We Tested

We ported the tests from the Java implementation to our version here: https://github.com/iand675/datasketches-haskell/tree/main/test

Benchmarks in the general case (ReqSketch/insert/mvar) vs (Prometheus/insert/existing) show dramatic performance increases (~257x) over the existing Prometheus code.

image

We also ran tests to ensure sublinear growth, and in the adversarial case of repeated inserts of the same value, we are seeing the desired behaviour for the new implementation:

sk <- mkReqSketch 6 HighRanksAreAccurate
replicateM_ 100_000_000 (insert sk 1)
print =<< getRetainedItems

Returns

937

We also tested against our workers locally with the new algorithm, and we're seeing consistent garbage collection / memory usage when processing 100k webhooks. To be clear, this is the same workload as the graph posted above, the only difference is the usage of the new quantile estimation algorithm.

image

@iand675
Copy link
Contributor Author

iand675 commented Aug 24, 2021

Note: fixes #20


instance NFData Summary where
rnf (MkSummary a b) = a `seq` b `seq` ()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seq for b doesn't look right here - don't you need to deepseq the entire list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

d'oh, should be fixed now.

@ocharles
Copy link
Collaborator

@fimad I'm not really familiar with summaries at all here, so I'm not sure I'm in the best position to review. That said, the quality seems very high and presumably @iand675 is happy with the data that's eventually reaching Prometheus with this change. The extra deps of data-sketches look reasonable to me. What do you think?

Copy link
Owner

@fimad fimad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really awesome, thank you for contributing this!

I just have one request for documentation on determineK otherwise I think this looks good to merge.

@iand675 iand675 requested a review from fimad August 26, 2021 14:58
@fimad fimad merged commit 7146fd5 into fimad:master Aug 27, 2021
@fimad
Copy link
Owner

fimad commented Aug 27, 2021

Just published prometheus-client version 1.1.0 with this change.

I also bumped the version for the other libraries so that they would pick up the expanded version range. @ocharles, could you also publish prometheus-proc? I don't have hackage access to that one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants