Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive GC usage for Summary metric in wai-prometheus-middleware #20

Open
arianvp opened this issue Jul 4, 2017 · 10 comments
Open

Excessive GC usage for Summary metric in wai-prometheus-middleware #20

arianvp opened this issue Jul 4, 2017 · 10 comments

Comments

@arianvp
Copy link

arianvp commented Jul 4, 2017

We have a WAI application in production that serves only around 5 reqs/s and we had serious performance issues as the server was GC'ing 95% of the time.

Before

You can see that we're spending almost all clock cycles on garbage collection, and garbage collection happens often
lol

Profiler output

After profiling, the culprit turned out to be the summaries that are kept for each request in wai-prometheus-middleware

	Tue Jul  4 11:17 2017 Time and Allocation Profiling Report  (Final)

	   warpmachine +RTS -N -T -p -RTS --proxy-host localhost --proxy-port 8000

	total time  =       17.73 secs   (17725 ticks @ 1000 us, 1 processor)
	total alloc = 10,233,152,016 bytes  (excludes profiling overheads)

COST CENTRE                           MODULE                          SRC                                                   %time %alloc

invariant.fj                          Prometheus.Metric.Summary       src/Prometheus/Metric/Summary.hs:(197,9)-(198,58)      17.9   13.1
wrapCompile.\                         Text.Regex.Posix.Wrap           Text/Regex/Posix/Wrap.hsc:(480,49)-(484,42)             7.9    0.0
throwSocketErrorIfMinus1RetryMayBlock Network.Socket.Internal         Network/Socket/Internal.hsc:(161,1)-(162,53)            6.2    0.6
getAddrInfo.\.\.\.\                   Network.Socket                  Network/Socket.hsc:(1460,36)-(1470,43)                  5.4    0.0
connect.\.\.connectLoop               Network.Socket                  Network/Socket.hsc:(444,9)-(458,29)                     4.7    0.0
throwSocketErrorIfMinus1_             Network.Socket.Internal         Network/Socket/Internal.hsc:166:1-47                    3.0    0.0
invariant                             Prometheus.Metric.Summary       src/Prometheus/Metric/Summary.hs:(193,1)-(198,58)       2.4    6.0
wrapTest.\                            Text.Regex.Posix.Wrap           Text/Regex/Posix/Wrap.hsc:(489,45)-(495,40)             1.9    0.0
compress.compressPair.inv             Prometheus.Metric.Summary       src/Prometheus/Metric/Summary.hs:172:17-60              1.8    5.1
connect.\.\.connectBlocked            Network.Socket                  Network/Socket.hsc:(460,9)-(465,67)                     1.6    0.4
compress                              Prometheus.Metric.Summary       src/Prometheus/Metric/Summary.hs:(156,1)-(172,60)       1.3    3.9
decodeWithTable/fill                  Data.ByteString.Base64.Internal Data/ByteString/Base64/Internal.hs:(171,62)-(194,68)    1.1    2.3
encode/fill                           Data.ByteString.Base64.Internal Data/ByteString/Base64/Internal.hs:(61,55)-(69,52)      1.1    2.6
urlDecode                             Network.HTTP.Types.URI          Network/HTTP/Types/URI.hs:(209,1)-(228,34)              1.0    2.3
peek8_32                              Data.ByteString.Base64.Internal Data/ByteString/Base64/Internal.hs:43:1-36              0.9    2.8
decodeWithTable.\.\.look              Data.ByteString.Base64.Internal Data/ByteString/Base64/Internal.hs:(165,11)-(168,49)    0.8    2.8
asCString                             Text.Regex.Posix.ByteString     Text/Regex/Posix/ByteString.hs:(150,1)-(152,40)         0.5    1.1
memConstEqual.loop                    Data.Memory.PtrMethods          Data/Memory/PtrMethods.hs:(103,5)-(107,34)              0.5    1.7
toByteString                          Blaze.ByteString.Builder        Blaze/ByteString/Builder.hs:151:1-46                    0.3    1.4
urlDecode.go                          Network.HTTP.Types.URI          Network/HTTP/Types/URI.hs:(211,5)-(221,40)              0.2    1.5
toByteStringIOWith.getBuffer          Blaze.ByteString.Builder        Blaze/ByteString/Builder.hs:(187,5)-(204,40)            0.2   10.8
toByteStringIOWith                    Blaze.ByteString.Builder        Blaze/ByteString/Builder.hs:(184,1)-(204,40)            0.2   10.7
recv                                  Network.Socket.ByteString       Network/Socket/ByteString.hsc:(226,1)-(231,62)          0.1    3.3

After disabling wai-prometheus-middleware

We commented out the middleware in our application, and instead of using 400% CPU, we're back to 2% CPU, and the GC contribution is only 0.02 instead of 1.0. Also see the graph below. On the left is before removing the middleware, and on the right is after removing the middleware.

lol

@arianvp
Copy link
Author

arianvp commented Jul 4, 2017

I actually see now that you moved away from Summary to Histogram in the newest release! 8d8a804

I'll check if this fixes our performance issues.

@arianvp
Copy link
Author

arianvp commented Jul 5, 2017

However, it does mean there seems to be something seriously wrong with the Summary metric which is creating and releasing alot of thunks at the same time, causing the GC to go nuts.

@fimad
Copy link
Owner

fimad commented Jul 13, 2017

Did you get a chance to check if Histogram solved the GC issues?

Given that we've moved the automatically generated metrics to histograms and the other recently reported problems with summary (#19, #14) I think it would make sense to deprecate / remove the summary metric type.

@duijf
Copy link

duijf commented Jul 18, 2017

The impact of the metrics using histograms is much smaller than the impact using summaries. Some measurements we made last week. I think this one can be closed (CC @arianvp):

Without per-request metrics

	Mon Jul 10 10:58 2017 Time and Allocation Profiling Report  (Final)

	   warpmachine +RTS -N -T -I0 -qg -A128m -p -RTS --proxy-host localhost --proxy-port 8000 --log-disabled

	total time  =        2.03 secs   (2033 ticks @ 1000 us, 1 processor)
	total alloc = 441,122,656 bytes  (excludes profiling overheads)

COST CENTRE                           MODULE                                  SRC                                                         %time %alloc

throwSocketErrorIfMinus1_             Network.Socket.Internal                 Network/Socket/Internal.hsc:166:1-47                         16.3    0.1
MAIN                                  MAIN                                    <built-in>                                                   15.0    8.6
throwSocketErrorIfMinus1RetryMayBlock Network.Socket.Internal                 Network/Socket/Internal.hsc:(161,1)-(162,53)                 14.8    3.4
receiveloop                           Network.Wai.Handler.Warp.Recv           Network/Wai/Handler/Warp/Recv.hs:(115,1)-(129,20)             4.6    2.4
socketConnection                      Network.Wai.Handler.Warp.Run            Network/Wai/Handler/Warp/Run.hs:(59,1)-(73,7)                 3.6    1.4
close.\                               Network.Socket                          Network/Socket.hsc:(1119,4)-(1124,80)                         2.6    2.1
receive                               Network.Wai.Handler.Warp.Recv           Network/Wai/Handler/Warp/Recv.hs:(96,1)-(99,48)               1.7    1.7
parseRequestLine.\                    Network.Wai.Handler.Warp.RequestHeader  Network/Wai/Handler/Warp/RequestHeader.hs:(60,80)-(90,33)     1.5    0.6
wrapTest.\                            Text.Regex.Posix.Wrap                   Text/Regex/Posix/Wrap.hsc:(489,45)-(495,40)                   1.4    0.0
serveConnection                       Network.Wai.Handler.Warp.Run            Network/Wai/Handler/Warp/Run.hs:(324,1)-(482,47)              1.3    2.4
sendResponse                          Network.Wai.Handler.Warp.Response       Network/Wai/Handler/Warp/Response.hs:(138,1)-(186,35)         1.3    1.5
serveConnection.processRequest        Network.Wai.Handler.Warp.Run            Network/Wai/Handler/Warp/Run.hs:(426,5)-(482,47)              1.1    0.6
asCString                             Text.Regex.Posix.ByteString             Text/Regex/Posix/ByteString.hs:(150,1)-(152,40)               1.0    5.7
foldCaseBS.toLower8'                  Data.CaseInsensitive.Internal           Data/CaseInsensitive/Internal.hs:(190,7)-(194,37)             0.9    6.3
push                                  Network.Wai.Handler.Warp.Request        Network/Wai/Handler/Warp/Request.hs:(230,1)-(291,48)          0.8    4.4
serveConnection.http1                 Network.Wai.Handler.Warp.Run            Network/Wai/Handler/Warp/Run.hs:(405,5)-(424,96)              0.8    2.3
defaultSettings                       Network.Wai.Handler.Warp.Settings       Network/Wai/Handler/Warp/Settings.hs:(126,1)-(149,5)          0.8    4.3
recvRequest                           Network.Wai.Handler.Warp.Request        Network/Wai/Handler/Warp/Request.hs:(68,1)-(106,54)           0.8    3.3
push.mnl                              Network.Wai.Handler.Warp.Request        Network/Wai/Handler/Warp/Request.hs:(237,5)-(249,28)          0.6    4.4
traverseHeader                        Network.Wai.Handler.Warp.Header         Network/Wai/Handler/Warp/Header.hs:(89,1)-(98,26)             0.5    1.6
parseHeader                           Network.Wai.Handler.Warp.RequestHeader  Network/Wai/Handler/Warp/RequestHeader.hs:(131,1)-(134,24)    0.5    1.7
parseHeader.(...)                     Network.Wai.Handler.Warp.RequestHeader  Network/Wai/Handler/Warp/RequestHeader.hs:132:9-37            0.4    1.9
acceptConnection.acceptLoop           Network.Wai.Handler.Warp.Run            Network/Wai/Handler/Warp/Run.hs:(226,5)-(242,26)              0.4    1.5
foldCaseBS                            Data.CaseInsensitive.Internal           Data/CaseInsensitive/Internal.hs:(187,1)-(194,37)             0.2    2.5
composeHeader                         Network.Wai.Handler.Warp.ResponseHeader Network/Wai/Handler/Warp/ResponseHeader.hs:(21,1)-(28,45)     0.2    1.2
parseHeaderLines.hdr                  Network.Wai.Handler.Warp.RequestHeader  Network/Wai/Handler/Warp/RequestHeader.hs:38:9-40             0.2    2.0

With per-request metrics (version 0.1.1, using summaries)

	Mon Jul 10 11:01 2017 Time and Allocation Profiling Report  (Final)

	   warpmachine +RTS -N -T -I0 -qg -A128m -p -RTS --proxy-host localhost --proxy-port 8000 --log-disabled

	total time  =        8.08 secs   (8082 ticks @ 1000 us, 1 processor)
	total alloc = 4,762,860,168 bytes  (excludes profiling overheads)

COST CENTRE                           MODULE                        SRC                                                %time %alloc

invariant.fj                          Prometheus.Metric.Summary     src/Prometheus/Metric/Summary.hs:(197,9)-(198,58)   57.2   42.3
invariant                             Prometheus.Metric.Summary     src/Prometheus/Metric/Summary.hs:(193,1)-(198,58)    7.5   18.0
compress.compressPair.inv             Prometheus.Metric.Summary     src/Prometheus/Metric/Summary.hs:172:17-60           5.4   15.4
throwSocketErrorIfMinus1RetryMayBlock Network.Socket.Internal       Network/Socket/Internal.hsc:(161,1)-(162,53)         3.9    0.3
compress                              Prometheus.Metric.Summary     src/Prometheus/Metric/Summary.hs:(156,1)-(172,60)    3.7   10.9
throwSocketErrorIfMinus1_             Network.Socket.Internal       Network/Socket/Internal.hsc:166:1-47                 3.6    0.0
MAIN                                  MAIN                          <built-in>                                           3.3    0.7
receiveloop                           Network.Wai.Handler.Warp.Recv Network/Wai/Handler/Warp/Recv.hs:(115,1)-(129,20)    1.2    0.2
compress.compressPair                 Prometheus.Metric.Summary     src/Prometheus/Metric/Summary.hs:(166,9)-(172,60)    0.9    1.2
invariant.n                           Prometheus.Metric.Summary     src/Prometheus/Metric/Summary.hs:196:9-30            0.4    2.0

With per-request metrics (version 0.2.0, using histograms)

	Mon Jul 10 10:35 2017 Time and Allocation Profiling Report  (Final)

	   warpmachine +RTS -N -T -I0 -qg -A128m -p -RTS --proxy-host localhost --proxy-port 8000

	total time  =        1.81 secs   (1815 ticks @ 1000 us, 1 processor)
	total alloc =  92,676,640 bytes  (excludes profiling overheads)

COST CENTRE                           MODULE                                 SRC                                                         %time %alloc

MAIN                                  MAIN                                   <built-in>                                                   13.9    5.2
throwSocketErrorIfMinus1RetryMayBlock Network.Socket.Internal                Network/Socket/Internal.hsc:(161,1)-(162,53)                 13.4    2.0
throwSocketErrorIfMinus1_             Network.Socket.Internal                Network/Socket/Internal.hsc:166:1-47                          8.3    0.1
mkRequestLogger.(...)                 Network.Wai.Middleware.RequestLogger   Network/Wai/Middleware/RequestLogger.hs:(93,9)-(97,44)        5.1    1.4
receiveloop                           Network.Wai.Handler.Warp.Recv          Network/Wai/Handler/Warp/Recv.hs:(115,1)-(129,20)             3.6    1.5
allocaAndPeek.\                       System.Clock                           System/Clock.hsc:166:36-52                                    2.7    0.0
observeSeconds.latency                Network.Wai.Middleware.Prometheus      src/Network/Wai/Middleware/Prometheus.hs:94:9-96              1.9    1.4
close.\                               Network.Socket                         Network/Socket.hsc:(1119,4)-(1124,80)                         1.7    1.3
apacheLogStr                          Network.Wai.Logger.Apache              Network/Wai/Logger/Apache.hs:(45,1)-(70,36)                   1.6    1.5
receive                               Network.Wai.Handler.Warp.Recv          Network/Wai/Handler/Warp/Recv.hs:(96,1)-(99,48)               1.5    1.0
toLogStr                              System.Log.FastLogger.LogStr           System/Log/FastLogger/LogStr.hs:75:5-54                       1.2    1.9
asCString                             Text.Regex.Posix.ByteString            Text/Regex/Posix/ByteString.hs:(150,1)-(152,40)               1.1    3.4
wrapTest.\                            Text.Regex.Posix.Wrap                  Text/Regex/Posix/Wrap.hsc:(489,45)-(495,40)                   1.1    0.0
fromBuilder                           System.Log.FastLogger.LogStr           System/Log/FastLogger/LogStr.hs:48:1-46                       1.1    9.8
serveConnection.http1                 Network.Wai.Handler.Warp.Run           Network/Wai/Handler/Warp/Run.hs:(405,5)-(424,96)              1.0    1.4
mappend                               System.Log.FastLogger.LogStr           System/Log/FastLogger/LogStr.hs:64:5-69                       1.0    4.9
serveConnection                       Network.Wai.Handler.Warp.Run           Network/Wai/Handler/Warp/Run.hs:(324,1)-(482,47)              0.8    1.4
defaultSettings                       Network.Wai.Handler.Warp.Settings      Network/Wai/Handler/Warp/Settings.hs:(126,1)-(149,5)          0.8    2.5
recvRequest                           Network.Wai.Handler.Warp.Request       Network/Wai/Handler/Warp/Request.hs:(68,1)-(106,54)           0.8    2.0
encodeUtf8Builder.\                   Data.Text.Lazy.Encoding                Data/Text/Lazy/Encoding.hs:164:26-59                          0.7    1.2
push                                  Network.Wai.Handler.Warp.Request       Network/Wai/Handler/Warp/Request.hs:(230,1)-(291,48)          0.7    2.6
toLogStr                              System.Log.FastLogger.LogStr           System/Log/FastLogger/LogStr.hs:79:5-33                       0.7    2.4
encodeUtf8                            Data.Text.Lazy.Encoding                Data/Text/Lazy/Encoding.hs:(150,1)-(160,70)                   0.6    2.4
foldCaseBS                            Data.CaseInsensitive.Internal          Data/CaseInsensitive/Internal.hs:(187,1)-(194,37)             0.4    1.5
foldCaseBS.toLower8'                  Data.CaseInsensitive.Internal          Data/CaseInsensitive/Internal.hs:(190,7)-(194,37)             0.4    3.7
showIPv4                              Network.Wai.Logger.IP                  Network/Wai/Logger/IP.hs:(15,1)-(26,26)                       0.4    2.0
parseHeaderLines.hdr                  Network.Wai.Handler.Warp.RequestHeader Network/Wai/Handler/Warp/RequestHeader.hs:38:9-40             0.3    1.2
parseHeader.(...)                     Network.Wai.Handler.Warp.RequestHeader Network/Wai/Handler/Warp/RequestHeader.hs:132:9-37            0.3    1.1
push.mnl                              Network.Wai.Handler.Warp.Request       Network/Wai/Handler/Warp/Request.hs:(237,5)-(249,28)          0.3    2.6
parseHeader                           Network.Wai.Handler.Warp.RequestHeader Network/Wai/Handler/Warp/RequestHeader.hs:(131,1)-(134,24)    0.2    1.0
show                                  Network.HTTP.Types.Version             Network/HTTP/Types/Version.hs:24:5-79                         0.1    1.5
encodeUtf8.strategy                   Data.Text.Lazy.Encoding                Data/Text/Lazy/Encoding.hs:160:5-70                           0.1    1.1

@arianvp
Copy link
Author

arianvp commented Jul 18, 2017

@fimad it seems like a good idea to deprecate Summary. As you can see, moving away from it fixed our issues :)

@axman6
Copy link

axman6 commented Nov 24, 2017

To add another datapoint to the discussion, I'm working on a new library (axman6/servant-prometheus) which is a rewrite of the servant-ekg package. Once I added summaries to the metrics for each endpoint, the benchmarks went from handling 40k+ req/sec to ~5k.

I have been looking at the summary code and believe the main culprit is the use of Rationals. I did a quick test of making all Rationals Doubles and performance jumped back to about 38k req/sec, which appears to be quite an acceptable overhead for most applications (the benchmark app simple returns a string from its endpoint, so the percentage overhead from monitoring is going to be a lot higher than real world apps).

Is there a particularly good justification for using Rationals? I think the summaries are a very useful metric, which don't require an arbitrary decision a priori about which histogram bins make sense for a given application. I would like to see summary and histograms kept because they serve different purposes, but at the moment the overhead is too great. I've provided a way in servant-prometheus to disable summary calculation from an app currently, but it would be nice not to.

@fimad
Copy link
Owner

fimad commented Dec 7, 2017

I think the original reason I used Rationals was because I was having trouble getting the implementation to maintain the invariants that the algorithm was suppose to when testing and wanted to eliminate rounding errors as a cause.

Given the stark difference in performance, I think switching to Doubles would be reasonable.

@axman6
Copy link

axman6 commented Dec 7, 2017

A possible middle ground might be using something like Ed Kmett’s compensated library to get double doubles - this should be fast but also give you the accuracy I expect you probably really want for this sort of metric

@jkeuhlen
Copy link

jkeuhlen commented Aug 6, 2021

We've recently had some issues with higher than expected CPU load on a couple of our servers.

After reproducing the problem locally and profiling with GHC, our two largest cost centers are:

COST CENTRE         MODULE                               SRC                                                          %time %alloc

compress            Prometheus.Metric.Summary            src/Prometheus/Metric/Summary.hs:(153,1)-(169,60)             51.1   79.6
invariant           Prometheus.Metric.Summary            src/Prometheus/Metric/Summary.hs:(190,1)-(195,58)             10.4    3.6

We're currently primarily collecting summary metrics, so it seems like it's related to this old ticket. We can transition more metrics to histograms instead, but it would be nice if there's a simple way to improve the performance of summary metrics too.

@fimad or @axman6 did either of you ever have success transitioning from Rational to Double? I played around with it a little bit but I didn't see a noticeable change to the existing benchmark times. It also seems like it causes the tests to fail in the way @fimad mentioned.

I'd be happy to dive in here more and open a PR if anyone has ideas that can point me in the right direction.

@fimad
Copy link
Owner

fimad commented Aug 27, 2021

This should be fixed as of #62 which is available in version 1.1.0 thanks to @iand675.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants