Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autorefresh limits #40

Open
jimfrey opened this issue Nov 30, 2016 · 11 comments
Open

Autorefresh limits #40

jimfrey opened this issue Nov 30, 2016 · 11 comments
Assignees

Comments

@jimfrey
Copy link

jimfrey commented Nov 30, 2016

Kentik is experiencing heavy load on our back end due to Grafana users that are setting up auto-refreshing dashboards. Two approaches to deal with this:

  1. The best answer for Kentik in the long term would be for Grafana to retain retrieved data, and then only pull incremental updates. Ideal incremental updates would be <=2 minutes, because we keep the last 2 mins in RAM and can serve responses very fast. I talked to Torkel about this, and he advised that this is not possible at present, due to no method for saving data and calculating averages, etc. We understand that this requires long term work.
  2. The near term request is this: In the Kentik Plug-in, please disable all auto-refresh options that are less than 1 minute.
@jimfrey
Copy link
Author

jimfrey commented Dec 16, 2016

Updates. For the last several days, one of our customers has been absolutely hammering the Kentik back end through the Grafana plugin.

The issue is actually pretty crazy. They are hitting us with what looks like 45 dashboards, most of which are 1 month's worth of data, some may be shorter. But they have refresh intervals set at either 30s or 1 min - hard to tell for sure. We just see this regular hammer hitting every minute in regular repeated batches at :00, :15, :45 seconds. Each time, the month-long queries spawn around 2.5 million subqueries against our back end.

They are the only ones hitting so bleeping hard, but clearly anyone using the plug in could do the same, so what we propose is this: limit auto-refresh on a sliding scale.

  • For queries that are 1 month or longer, no faster than 1h refresh
  • For queries that are between 1 day and one month, no faster than 15 min refresh
  • For queries less than 1 full day, no faster than 5 min refresh

This would need to be implemented on the Grafana side. This would give our platform some breathing room as adoption ramps, until such time as incremental querying can be implemented within Grafana.

@alexanderzobnin
Copy link
Contributor

There's no way to change auto-refresh behaviour in grafana now. But I can implement that through a incremental queries feature. I'll add proxy (with cache) layer for the queries in plugin. This layer will store data from previous requests. And also I can add these limits to the part which invokes api queries.

For example:

  1. Grafana invoke panel refresh.
  2. Kentik plugin looks into limits.
  3. If not enough time has passed, get data from cache (show previously returned data).
  4. Else, invoke api request and write data to cache.

Then I can expand this pattern to incremental update - add incremental query to step 3.

@jimfrey
Copy link
Author

jimfrey commented Dec 17, 2016

Awesome strategy - please proceed!

@alexanderzobnin
Copy link
Contributor

@jimfrey try to test incremental-data-update branch. I've added simple data caching and auto refresh limits.

@alexanderzobnin
Copy link
Contributor

alexanderzobnin commented Dec 21, 2016

@jimfrey I'm working on incremental queries and want to discuss a question.
Let's assume we're querying data for last 24 hours. I use Average aggregation for time series data. When I refresh panel, for example after 10 mins, I get average data for last 10 mins, but previous set is average for the slices for 24 hours (1 hour interval for this period). After few updates I'll get graph with 10 min average slices, this is incorrect.
What do you think about it?

kentik agg 001

@alexanderzobnin
Copy link
Contributor

I think, for the incremental queries we should know aggregation slice size (for the given time range). How can I get info about it?

@jimfrey
Copy link
Author

jimfrey commented Jan 4, 2017

Tests completed - looks like the incremental pull is working as requested. Thanks! I'd suggest we close this issue, and pick up your questions above about aggregation slice size on a separate thread. If you agree, feel free to close.

@alexanderzobnin
Copy link
Contributor

@jimfrey It still isn't an incremental pull, just requests caching. We need more time for true incremental queries, but hope, these changes help to reduce load to your servers.

@nopzor1200
Copy link

This is currently under internal discussion. It's an important feature for Kentik, but will require some non insignificant work to implement incremental/chunked/staggered (these are deliberately hazy words, for now) query capabilities in data sources. Part of our discussion is whether this functionality applies to other data sources (eg. Splunk, InfluxDB), and how to best abstract it out while still meeting Kentiks needs.

@jimfrey
Copy link
Author

jimfrey commented Feb 7, 2018

11 months have passed since this request. Can we get an update? Thanks.

@alexanderzobnin
Copy link
Contributor

@mattttt do we have any estimates for this?

@briangann briangann self-assigned this Oct 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants