Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

💡 [Feature] monitoring: Add a second Prometheus to scrap the first Prometheus to keep stats for long term at a lower granularity #277

Open
tlvu opened this issue Jan 17, 2023 · 8 comments · May be fixed by #461
Assignees
Labels
enhancement New feature or request

Comments

@tlvu
Copy link
Collaborator

tlvu commented Jan 17, 2023

Description

Currently component ./components/monitoring scrap every 5 minutes and keep the stats for 90 days.

If we want longer stats for high level trend keeping (ex: 5 years), we could use a second Prometheus to scrap the first Prometheus daily (averaging the values) so we can keep longer stats without consuming too much disk space.

This second Prometheus should be on a different machine than PAVICS so when PAVICS is down, we can still access those longterm stats.

To explore the federation feature of Prometheus whether it is simpler to implement/deploy or gives better longterm stats results.

@tlvu tlvu added the enhancement New feature or request label Jan 17, 2023
@tlvu tlvu self-assigned this Jan 17, 2023
@huard huard pinned this issue Apr 19, 2023
@huard
Copy link
Collaborator

huard commented Apr 5, 2024

I suspect that it is possible to simply add new rules that aggregate metrics at a lower frequency. That is, I don't think we need a second instance.

https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/

@fmigneault
Copy link
Collaborator

I would also prefer a lower frequency logging approach than having a duplicate instance.

@tlvu
Copy link
Collaborator Author

tlvu commented Apr 8, 2024

The reason I suggested a second Prometheus is because the retention policy is instance wide and not per metric, see

However, this has been a while ago. Maybe newer version of Prometheus allow for per metric data retention. To explore.

So yes we can lower the polling frequency, but if we can not increase the retention duration, then we still do not have long term stats, which is our ultimate goal.

Also, I suggested a second instance of Prometheus on a different machine. So it's not really "duplicated" because it's not the same role:

  • it can still provide stats even if the real PAVICS host is down (hardware failure, data corruption, ...)
  • it can aggregate all other PAVICS hosts (staging, tests, ...)

@fmigneault
Copy link
Collaborator

Ok. If it is a limitation of Prometheus, then let's try with a second one.

@huard
Copy link
Collaborator

huard commented Apr 10, 2024

Post on solutions to this problem, which in the jargon seems to be known as "downsampling":
https://last9.io/blog/downsampling-aggregating-metrics-in-prometheus-practical-strategies-to-manage-cardinality-and-query-performance/

@mishaschwartz
Copy link
Collaborator

Instead of a second prometheus to scrape the first we could also use one of the other technologies they recommend for longterm storage (https://prometheus.io/docs/operating/integrations/#remote-endpoints-and-storage). Both Thanos and M3 seem to be recommended elsewhere.

We could create recording rules for the metrics we care about storing in prometheus and then use an external tool to store those specific metrics over a longer term and query just those metrics over larger time-scales as needed.

@mishaschwartz
Copy link
Collaborator

I've been playing around with Thanos. One issue with Thanos is that is stores all metrics from a prometheus instance which can end up being a very large amount of data (even if the data is compacted somewhat: https://thanos.io/tip/thanos/quick-tutorial.md/#compactor).

As discussed here there are potential ways to store only those metrics that we care about which would reduce the amount of additional disk space needed to store the data.

It looks like we're likely going to need to introduce a second prometheus instance even if we also use Thanos so that we can select which metrics we store long term.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants