Performance issue with /metrics endpoint #28

xsb · 2019-07-25T13:09:45Z

I am trying to use lnd+lndmon on a rock64 board (similar to rpi, with arm64 and 4GB RAM) but Grafana only shows data points coming directly from lnd (Go Runtime + Performance dashboard). Everything supposed to come from lndmon is not there.

I noticed that when running simple queryes with PromQL I immediately got the error: "the queries returned no data for a table". Then went to Explore section and checked for up, there I can see how the lndmon process is reported to be down, which is not true.

After that I tried to get the metrics directly and I realized I was getting slow response times on the metrics endpoint (between 10s and 12s usually):

$ time curl -s --output /dev/null localhost:9092/metrics

real	0m10.717s
user	0m0.022s
sys	0m0.015s

I haven't investigated this deeply yet but the instance has more than enough Ram, and the CPU usage and load average don't look that bad.

Will try to spend more time in another moment but wanted to report soon just in case it's happening to more people.

The text was updated successfully, but these errors were encountered:

valentinewallace · 2019-07-25T23:00:19Z

Hm, admittedly lndmon has not been tested on rpi-type hardware.

Roasbeef · 2019-08-07T02:44:14Z

This is you attempting to hit the /metrics endpoint on lnd?

xsb · 2019-08-07T09:24:43Z

@Roasbeef lnd uses port :8989 for the metrics. I forgot to mention that that part works fine, I get the output in just a few milliseconds.

Honestly I haven't spent much time trying to debug this, but neither Prometheus nor myself (from the cli) can hit the metrics endpoint on lndmon (port :9092) fast enough.

xsb · 2019-08-08T11:03:26Z

After some time debugging I found out that what is taking so long is the GraphCollector's DescribeGraph request against lnd. The frequency seems to be too high for that call.

xsb · 2019-08-08T11:39:51Z

GraphCollector is taking more than 30% of the cpu time (understandable, this is the biggest dataset being ingested). pprof is not taking i/o into account so reality is much worse than what is shown in the flamegraph. The main issue then seems to be that lnd is taking a few seconds to serve the whole graph. Would it be possible to make this call less often?

xsb · 2019-08-08T12:04:36Z

I changed my Prometheus config (slower interval + higher timeout) and I am running lndmon on mainnet without issues now 😄.

diff --git a/prometheus.yml b/prometheus.yml
index 01797c0..81d781c 100755
--- a/prometheus.yml
+++ b/prometheus.yml
@@ -1,6 +1,7 @@
 scrape_configs:
 - job_name: "lndmon"
-  scrape_interval: "20s"
+  scrape_interval: "30s"
+  scrape_timeout: "15s"
   static_configs:
   - targets: ['lndmon:9092']
 - job_name: "lnd"

I am not saying this should be merged because it's totally arbitrary. A bigger network and/or a slower hardware device would require even more conservative defaults.

menzels · 2019-08-13T16:50:45Z

thanks for the reasearch @xsb i had the same problem. for me the scrape time was 30-50 seconds.
i am using a rpi3 for lnd, connected to lndmon running in the cloud. uplink bandwidth is about 2-3Mb/s. i guess the slowdown is a combination of cpu load and bandwidth limit.
i set the scrape interval and timeout to 60s. like this it seems to be working for now.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issue with /metrics endpoint #28

Performance issue with /metrics endpoint #28

xsb commented Jul 25, 2019

valentinewallace commented Jul 25, 2019

Roasbeef commented Aug 7, 2019

xsb commented Aug 7, 2019

xsb commented Aug 8, 2019

xsb commented Aug 8, 2019

xsb commented Aug 8, 2019

menzels commented Aug 13, 2019

Performance issue with /metrics endpoint #28

Performance issue with /metrics endpoint #28

Comments

xsb commented Jul 25, 2019

valentinewallace commented Jul 25, 2019

Roasbeef commented Aug 7, 2019

xsb commented Aug 7, 2019

xsb commented Aug 8, 2019

xsb commented Aug 8, 2019

xsb commented Aug 8, 2019

menzels commented Aug 13, 2019