-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Excessive GC usage for Summary
metric in wai-prometheus-middleware
#20
Comments
I actually see now that you moved away from I'll check if this fixes our performance issues. |
However, it does mean there seems to be something seriously wrong with the |
The impact of the metrics using histograms is much smaller than the impact using summaries. Some measurements we made last week. I think this one can be closed (CC @arianvp): Without per-request metrics
With per-request metrics (version 0.1.1, using summaries)
With per-request metrics (version 0.2.0, using histograms)
|
@fimad it seems like a good idea to deprecate |
To add another datapoint to the discussion, I'm working on a new library (axman6/servant-prometheus) which is a rewrite of the servant-ekg package. Once I added summaries to the metrics for each endpoint, the benchmarks went from handling 40k+ req/sec to ~5k. I have been looking at the summary code and believe the main culprit is the use of Rationals. I did a quick test of making all Is there a particularly good justification for using Rationals? I think the summaries are a very useful metric, which don't require an arbitrary decision a priori about which histogram bins make sense for a given application. I would like to see summary and histograms kept because they serve different purposes, but at the moment the overhead is too great. I've provided a way in servant-prometheus to disable summary calculation from an app currently, but it would be nice not to. |
I think the original reason I used Given the stark difference in performance, I think switching to Doubles would be reasonable. |
A possible middle ground might be using something like Ed Kmett’s compensated library to get |
We've recently had some issues with higher than expected CPU load on a couple of our servers. After reproducing the problem locally and profiling with GHC, our two largest cost centers are:
We're currently primarily collecting summary metrics, so it seems like it's related to this old ticket. We can transition more metrics to histograms instead, but it would be nice if there's a simple way to improve the performance of summary metrics too. @fimad or @axman6 did either of you ever have success transitioning from I'd be happy to dive in here more and open a PR if anyone has ideas that can point me in the right direction. |
We have a WAI application in production that serves only around
5 reqs/s
and we had serious performance issues as the server was GC'ing 95% of the time.Before
You can see that we're spending almost all clock cycles on garbage collection, and garbage collection happens often
Profiler output
After profiling, the culprit turned out to be the summaries that are kept for each request in
wai-prometheus-middleware
After disabling
wai-prometheus-middleware
We commented out the middleware in our application, and instead of using 400% CPU, we're back to 2% CPU, and the GC contribution is only
0.02
instead of1.0
. Also see the graph below. On the left is before removing the middleware, and on the right is after removing the middleware.The text was updated successfully, but these errors were encountered: