Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exposing Metrics #1356

Open
13 of 16 tasks
marten-seemann opened this issue Mar 11, 2022 · 21 comments
Open
13 of 16 tasks

Exposing Metrics #1356

marten-seemann opened this issue Mar 11, 2022 · 21 comments
Assignees
Labels
effort/days Estimated to take multiple days, but less than a week exp/intermediate Prior experience is likely helpful kind/enhancement A net-new feature or improvement to an existing feature P1 High: Likely tackled by core team if no one steps up

Comments

@marten-seemann
Copy link
Contributor

marten-seemann commented Mar 11, 2022

libp2p currently collects a few metrics (for example, for TCP and QUIC). We also collect metrics for the resource manager, however, that is currently left as a responsibility to the application. This is how we end up with slightly different implementations (and metrics names) in go-ipfs and lotus.
In the (very near!) future, we want to expose more metrics from the swarm regarding transports, security protocols and muxers. This would have allowed us to detect the muxer prioritisation bug (ipfs/kubo#8750) a lot earlier.

We therefore should have a coherent libp2p metrics story.


Open questions:

  1. In the past, we debated if the metrics code should live within the respective repository, or if that code should live separately and hook in via tracers. Note that this issue mostly goes away once we make progress with our repo consolidation (and that's what we should base our designs on).
  2. Some people have expressed concerns introducing an OpenCensus / Prometheus dependency into the default build introduces too much bloat. Is that a valid concern? Does it really add too much overhead? If so, would it make sense to hide all metrics collection across our code base behind a build flag?
  3. OpenCensus vs. Prometheus: In the past, we've used both tools, very inconsistently. If I understand correctly, OpenCensus is essentially another abstraction layer between the tracer and Prometheus, and would in theory allow the usage of other tools than Prometheus, although it might be debatable how much we care about that.

Thoughts, @vyzo @aschmahmann @mxinden @Stebalien @lanzafame?


Tracking the various components we want to instrument:

Supporting work (need to close this issue):

defer:

@marten-seemann marten-seemann added kind/enhancement A net-new feature or improvement to an existing feature P1 High: Likely tackled by core team if no one steps up exp/intermediate Prior experience is likely helpful effort/days Estimated to take multiple days, but less than a week labels Mar 11, 2022
@mxinden
Copy link
Member

mxinden commented Mar 14, 2022

3. OpenCensus vs. Prometheus: In the past, we've used both tools, very inconsistently. If I understand correctly, OpenCensus is essentially another abstraction layer between the tracer and Prometheus, and would in theory allow the usage of other tools than Prometheus, although it might be debatable how much we care about that.

Slight preference for Prometheus for the sake of simplicity. I have not had the need of tracing (enabled e.g. through OpenCensus) in a peer-to-peer application.

@Stebalien
Copy link
Member

The latest is "opentelemetry" (a merger of opencensus and opentracing". I'd prefer that because, as you say, it's backend agnostic and is just a facade unless used.

In go-ipfs, we used to rely heavily on https://github.com/ipfs/go-metrics-interface/ to avoid depending on a specific metrics implementation, but opentelemetry should make that point moot.

@marten-seemann
Copy link
Contributor Author

The latest is "opentelemetry" (a merger of opencensus and opentracing". I'd prefer that because, as you say, it's backend agnostic and is just a facade unless used.

Are we already using that somewhere in our stack?

@marten-seemann
Copy link
Contributor Author

This doesn't look very reassuring at this point: https://opentelemetry.io/docs/instrumentation/go/manual/#creating-metrics

image

@vyzo
Copy link
Contributor

vyzo commented Mar 15, 2022

I very much dislike the idea of going with the flavour of the day; applications should be free to use whatever they want, and we already have two different systems in prod (lotus used opencensus, ipfs uses raw prometheus).

Can't we just define a very thin interface and use that?

@aschmahmann
Copy link
Collaborator

aschmahmann commented Mar 15, 2022

Are we already using that somewhere in our stack?

I haven't personally made much use of it, but it shows up in some consumers of go-libp2p. A few I'm aware of:

Can't we just define a very thin interface and use that?

I don't have a strong preference here, but it seems like the options are:

  1. Don't have exported metrics - not a viable option
  2. Have hard coded metrics system(s) in every package - not great given people like multiple different backends
  3. Use some popular logging interface (e.g. OpenTelemetry) - nice, but some people think is overkill also the fads and interfaces keep changing over time (I think the explanation of Initial switch to OpenTelemetry filecoin-project/lotus#7725 gives some view into the pain here)
  4. Roll our own interface for metrics that's totally better than everyone else's stupid interface (e.g. https://github.com/ipfs/go-metrics-interface/) - ok, but also https://xkcd.com/927/

If people are interested in 4, but not in using go-metrics-interface I'd like to understand why and how that same argument is unlikely to be reused in 6-12 months to argue why our new interface isn't good enough and needs an alternative.

I happen to be inclined to use some standard interface so we don't have to make all of our own wrappers + interfaces here and we can let other people mess around with creating "the perfect logging interface™️" while we work on p2p networking 😄. If people want to roll our own here then that's fine by me too.

@marten-seemann
Copy link
Contributor Author

I haven't personally made much use of it, but it shows up in some consumers of go-libp2p. A few I'm aware of:

Thank you for the links! These are all tracer integrations though, not metrics. I'd be happy to use OpenTelemetry for metrics as well, and it looks like their metrics support is currently in the alpha phase.

I'm a bit wary of rolling our own interface, mostly for https://xkcd.com/927/ reasons, as @aschmahmann pointed out.

@guseggert
Copy link
Contributor

guseggert commented Mar 16, 2022

Unfortunately there isn't a good answer to OpenTelemetry vs. OpenCensus. OpenCensus is on a deprecation path, but the Go implementation of the OpenTelemetry Metrics API is not yet stable, so pick your poison.

I don't think creating our own wrapper would be a good idea. OpenTelemetry is already a wrapper and enables writing bridges to unknown future implementations. And those are likely to be written by the experts instead of us. There is an OpenCensus -> OpenTelemetry metrics bridge, but it has some limitations around the metric types that are supported due to complexities in translating. If the experts can't write a complete bridge, then I don't think we should burden ourselves with that too.

Note that the the Metrics API Spec for OpenTelemetry is marked as "stable". It is the Go implementation that is not yet stable. Here is the milestone for the stable release of the Metrics API for opentelemetry-go: https://github.com/open-telemetry/opentelemetry-go/milestone/14 (it is scheduled for March 31). Also the metrics Collector API and wire format (OTLP) are stable.

It's pretty clear that industry resources are now devoted to OpenTelemetry, and a stable Go Metrics API is imminent, so I think we should hop on that bandwagon.

@BigLep
Copy link
Contributor

BigLep commented Mar 16, 2022

Good discussion here - thanks folks for contributing.

One thing that will be useful to get defined is the metrics we're going to measure. Things that I think are important:

  1. names - ideally the names (or other dimensions of the metric) make clear whether this is a metric that is consistently implemented across libp2p implementations or whether it's an implementation-specific metrics (e.g., metric implemented uniquely for go-libp2p). In addition, the names should be as self-documenting as possible.
  2. definitions - be clear about what the metrics are measuring. Especially for any timing metrics, we need to be clear about when the stopwatch starts and when it ends. I don't want to leave that up for interpretation.

In terms of where this kind of info lives, I leave it up to others but maybe this warrants a spec or a docs page?

Good stuff!

@vyzo
Copy link
Contributor

vyzo commented Mar 17, 2022

I would like to propose something different.
How about a monitoring interface, that can also be used for collecting metrics?
That way the users can do whatever they like, and we can provide implementations for prometheus, open* and whatnot.

@lanzafame
Copy link
Contributor

As the person who introduced OpenCensus, some background thoughts that I had at the time.

  • I didn't overly care about tracing at the time but appreciated the fact that tracing was bundled in.
  • I was mostly concerned about having something that was backend agnostic because I was working on ipfs-cluster at the time and didn't want to have to force our uses to choose prometheus over influx for example (granted things have come along way since then and everyone supports prometheus scraping even in push-model systems)
  • I really appreciated the concept of each library exposing the metrics that they cared about but the top level application was able to choose which it wanted to actually record. This is why I pushed hard on the Views being exposed as well as there being a DefaultViews slice being exposed so that if an application dev only wanted 1 of 10 metrics in DefaultViews they had the ability to just register that 1 metric.

Personally, I really don't like where OpenTelemetry took metrics, last I looked, OTel did away with a lot of the nice things about OC metrics. Personally, from what I have followed of OTel, they didn't really care about the metrics side of things and still don't. My guess was ego and NIH syndrome are involved. On top of that the spec is very java-esk and that has resulted in less than ideal situations in the go implementation.

Currently, still in the position of consuming metrics from libp2p applications and libraries, the things I care about are consistent prefixes, i.e. libp2p_<subsytem>_<metric>, and not having to put PRs up to remove peer ids from metric labels because it kills prometheus performance and costs money when using cloud prometheus offerings.

How about a monitoring interface, that can also be used for collecting metrics?
That way the users can do whatever they like, and we can provide implementations for prometheus, open* and whatnot.

@vyzo Please don't. It just means another package needs to be written and then the ability to discover that package will be reduced. i.e. Estuary just implemented the metrics for the libp2p resource manager, but they are now specific to the estuary codebase and anyone else wanting to have resource manager metrics has to duplicate the exact same code. There is zero user benefit to that course of action.

@vyzo
Copy link
Contributor

vyzo commented Mar 24, 2022

@lanzafame i didnt suggest that the application implements the interface, the library would provide the default implementation.

I just want to avoid polluting the code with metrics invocations, which would be very ugly and tied to the particular choice.

Our code should be agnostic of such things.

@marten-seemann
Copy link
Contributor Author

OpenTelemetry Metrics support has shipped: https://opentelemetry.uptrace.dev/guide/go-metrics.html

@lanzafame
Copy link
Contributor

@marten-seemann it still hasn't officially shipped, metrics are still listed as Alpha: https://github.com/open-telemetry/opentelemetry-go.

@achingbrain
Copy link
Member

FWIW js-libp2p has a metrics capability similar to what's being discussed here. It collects various named metrics as simple key/value pairs that are updated over time but not retained. There's no dependency on Prometheus or any other monitoring system which keeps it lightweight and flexible.

When there's a libp2p app that runs as a daemon, it makes the metrics available in some way. For example js-ipfs has a HTTP API that sets up an endpoint that Prometheus calls periodically to record the libp2p metrics, plus additional metrics about the environment - memory usage, CPU etc.

An app could write these stats out to a file instead and skip Prometheus integration entirely, whatever works for the app author.

@marten-seemann
Copy link
Contributor Author

Adding another consideration to the list: performance. As it turns out, OpenCensus generates tons of garbage: #1955.

@vyzo
Copy link
Contributor

vyzo commented Dec 15, 2022

once more i want to say i told ya so...

@vyzo
Copy link
Contributor

vyzo commented Dec 15, 2022

Master Fu you should listen to sometimes...

@marten-seemann
Copy link
Contributor Author

once more i want to say i told ya so...

@vyzo This is not really helpful, even if you did say so.

Which in this case you didn't.

You suggested to provide a default implementation, without specifying which library to use. Chances are we would have used OpenCensus for that.

Please don't take this the wrong way. I'm still happy about constructive comments that help us to come to a decision here.

@marten-seemann
Copy link
Contributor Author

To finish this project, we should have a libp2p folder in the PL Grafana instance. That would allow us to pull up the libp2p metrics for every PL-run libp2p node.

In addition, @sukunrt suggested to make metrics of one libp2p node (probably on the IPFS network) available publicly, possibly linked from libp2p.io. @dhuseby, wdyt?

@p-shahi
Copy link
Member

p-shahi commented May 11, 2023

That sounds like a good idea, there should be an accompanying blog post as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
effort/days Estimated to take multiple days, but less than a week exp/intermediate Prior experience is likely helpful kind/enhancement A net-new feature or improvement to an existing feature P1 High: Likely tackled by core team if no one steps up
Projects
Status: 🥞 Todo
Development

No branches or pull requests