Replace Summary algorithm with improved implementation #62

iand675 · 2021-08-24T21:02:01Z

Context

Workers systems in our Haskell codebase were exhibiting dramatic slowdowns over time when using summary metrics. Through heap profiling, we determined that the culprit was the particular algorithm in use by prometheus-client having linear memory growth for "adversarial" inputs, such as repeating duplicate numbers, monotonically increasing values, and monotonically decreasing values:

This was our worker heap profile for repeated observations of the value 1:

We found through a Rust implementation of the same algorithm (CKMS) that the targeted quantile variant in use is fundamentally flawed, and it pointed us to bq-pods.pdf this paper as the followup research by the same authors that solved the problem. Unfortunately, the paper assumed the use of effectively bounded ranges of possible inputs to make its sublinear memory growth guarantees, and wasn't practically implementable. We contacted the authors of the paper, and Graham Cormode provided us with a link to the latest work in this line, and thankfully it also had publicly available code under the Apache foundation. We ported over the ReqSketch implementation to Haskell and have it published as a package on Hackage

How We Tested

We ported the tests from the Java implementation to our version here: https://github.com/iand675/datasketches-haskell/tree/main/test

Benchmarks in the general case (ReqSketch/insert/mvar) vs (Prometheus/insert/existing) show dramatic performance increases (~257x) over the existing Prometheus code.

We also ran tests to ensure sublinear growth, and in the adversarial case of repeated inserts of the same value, we are seeing the desired behaviour for the new implementation:

sk <- mkReqSketch 6 HighRanksAreAccurate
replicateM_ 100_000_000 (insert sk 1)
print =<< getRetainedItems

Returns

We also tested against our workers locally with the new algorithm, and we're seeing consistent garbage collection / memory usage when processing 100k webhooks. To be clear, this is the same workload as the graph posted above, the only difference is the usage of the new quantile estimation algorithm.

iand675 · 2021-08-24T21:02:41Z

Note: fixes #20

ocharles · 2021-08-25T08:49:57Z

prometheus-client/src/Prometheus/Metric/Summary.hs


+instance NFData Summary where
+  rnf (MkSummary a b) = a `seq` b `seq` ()


seq for b doesn't look right here - don't you need to deepseq the entire list?

d'oh, should be fixed now.

ocharles · 2021-08-25T08:51:23Z

@fimad I'm not really familiar with summaries at all here, so I'm not sure I'm in the best position to review. That said, the quality seems very high and presumably @iand675 is happy with the data that's eventually reaching Prometheus with this change. The extra deps of data-sketches look reasonable to me. What do you think?

fimad

This is really awesome, thank you for contributing this!

I just have one request for documentation on determineK otherwise I think this looks good to merge.

prometheus-client/src/Prometheus/Metric/Summary.hs

fimad · 2021-08-27T02:44:58Z

Just published prometheus-client version 1.1.0 with this change.

I also bumped the version for the other libraries so that they would pick up the expanded version range. @ocharles, could you also publish prometheus-proc? I don't have hackage access to that one.

iand675 added 5 commits August 23, 2021 15:50

Implement experimental alternative to existing Summary algorithm

b753567

Fix tests by using <= as rank criteria

1a02fd8

Update Summary to calculate appropriate precision requirements

e84c8ef

Start determineK with slightly higher accuracy

67636ef

Clean up some comments and bump version due to estimator removal

9151269

ocharles reviewed Aug 25, 2021

View reviewed changes

Fix NFData instance

3f1f93c

fimad reviewed Aug 26, 2021

View reviewed changes

prometheus-client/src/Prometheus/Metric/Summary.hs Show resolved Hide resolved

Add note about what determineK does

eaf7d11

iand675 requested a review from fimad August 26, 2021 14:58

fimad approved these changes Aug 27, 2021

View reviewed changes

fimad merged commit 7146fd5 into fimad:master Aug 27, 2021

fimad mentioned this pull request Aug 27, 2021

Excessive GC usage for Summary metric in wai-prometheus-middleware #20

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace Summary algorithm with improved implementation #62

Replace Summary algorithm with improved implementation #62

iand675 commented Aug 24, 2021 •

edited

Loading

iand675 commented Aug 24, 2021

ocharles Aug 25, 2021

iand675 Aug 25, 2021

ocharles commented Aug 25, 2021

fimad left a comment

fimad commented Aug 27, 2021


		instance NFData Summary where
		rnf (MkSummary a b) = a `seq` b `seq` ()

Replace Summary algorithm with improved implementation #62

Replace Summary algorithm with improved implementation #62

Conversation

iand675 commented Aug 24, 2021 • edited Loading

Context

How We Tested

iand675 commented Aug 24, 2021

ocharles Aug 25, 2021

Choose a reason for hiding this comment

iand675 Aug 25, 2021

Choose a reason for hiding this comment

ocharles commented Aug 25, 2021

fimad left a comment

Choose a reason for hiding this comment

fimad commented Aug 27, 2021

iand675 commented Aug 24, 2021 •

edited

Loading