Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The app_start label of octoprint_infos_info should be a separate metric #22

Open
hairyhenderson opened this issue Jan 30, 2021 · 9 comments

Comments

@hairyhenderson
Copy link

Hi, and first off - thanks for a useful plugin!

The octoprint_infos_info metric has a label app_start, which seems to be the time (in seconds since epoch) that the plugin started. That's handy but I'd like to be able to use it as a value.

Can I suggest a new metric: octoprint_app_start_seconds?

If that sounds like a good idea, I could issue a PR.

@tg44
Copy link
Owner

tg44 commented Feb 1, 2021

I think it should be a counter (uptime). And increase with x in every x ms. (As I remember cadvisor node_exporter and the others doing this.) It's an interesting question that "when" should it grow, but probably a 10s timer is good enough. (Also prometheus can "continue" counters if you write the queries well, so sum uptime and stuff like that can be extracted later.)

PRs always welcome!

@hairyhenderson
Copy link
Author

The convention is that timestamp-type metrics are a gauge - less effort that way, and more accurate. If you want to find out "how long has this been running?" you can do a query like time() - octoprint_app_start_seconds.

Tracking uptime is usually done with the synthetic up metric, which is generated by Prometheus when it scrapes successfully (i.e. if the process can be scraped, it's probably "up").

@tg44
Copy link
Owner

tg44 commented Feb 1, 2021

I checked what you said. I found out that almost all of my currently running exporters expose the process_start_time_seconds as a gauge (why gauge? why not a counter? we cant travel back in time...). This means If we expose it, we should name it like this(?). I can't found any source that says exactly how we should expose uptime/start time (but I'm interested if you can link any post about this. I also reread the prom. documentation about writing exporters without much success). So we could expose start time as process_start_time_seconds?

Also, the python lib has build in process exporter functionality which is also strange bcs the documentation discouraging to do this. So I'm a bit puzzled right now :D

Somewhat irrelevant but; I think up is a bad metric to calculate uptime from. There could be a lot of reasons why you can't scrape the endpoint (network error, or downtime on the scraper side), which has nothing to do with the actual application uptime.

@hairyhenderson
Copy link
Author

@tg44 ah - if app_start is the process start time, then indeed process_start_time_seconds makes more sense. I wasn't totally sure what it was 😉

why gauge? why not a counter? we cant travel back in time...

Counters are intended to be monotonically-increasing, which time generally isn't (think leap seconds, etc). Also counters imply that you can do rate calculations on them, and when the value goes down it's considered a reset. That doesn't make sense with a timestamp - a timestamp may decrease following an NTP time synchronization.

This leaves a Gauge as the best option, and especially when you consider that a value like process_start_time_seconds is static (the start time should never change during the life of a process), a counter makes no sense since there's no situation where the value should be increasing.

Also, the python lib has build in process exporter functionality which is also strange bcs the documentation discouraging to do this. So I'm a bit puzzled right now :D

Simply enabling the process collector is probably the best option - I thought it was enabled by default but I'm not seeing its metrics from octoprint.

As for that document discouraging the practice... IMO it's not super-well-worded - I think the intent is more to ensure that people don't duplicate node_exporter's functionality in-process. Since node_exporter doesn't report per-process cpu/mem/etc usage, there's no conflict here.

@tg44
Copy link
Owner

tg44 commented Feb 2, 2021

Then I think we should just enable the process collector. If we enable it, it will automatically add the process_start_time_seconds too. And we are good to go.

(think leap seconds, etc)

I really hate our time system...

Since node_exporter doesn't report per-process cpu/mem/etc usage, there's no conflict here.

Node exporter not, but cadvisor, for example, doing this. I think the main intention is to not duplicate data, and every collector should collect the right scope.

Thank you! I learned a lot from your comments!

@hairyhenderson
Copy link
Author

Node exporter not, but cadvisor, for example, doing this.

cAdvisor only works if you're running in a container, and even so, cAdvisor tracks in-container usage, not the usage of individual processes inside the container (of which there can be many).

Ok, sounds like enabling the process collector is the thing to do. It's enabled by default on the default registry, but this exporter is using a custom registry - is there a reason for that? Seems to me that simply removing _registry (and references to it) will do the trick.

@hairyhenderson
Copy link
Author

I've issued #23 for this, though feel free to take a different approach if you really want a custom registry!

@tg44
Copy link
Owner

tg44 commented Feb 2, 2021

We started to use the _registry to resolve #13 I'm not sure this is the wise approach tho.

@hairyhenderson
Copy link
Author

Ah... I see, that sort-of makes sense 😉 - I'll rework #23 to account for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants