-
-
Notifications
You must be signed in to change notification settings - Fork 718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Prometheus metrics that tracks cumulative counts of handler calls #8696
Comments
We should keep an eye onto cardinality here if we exposed all the stream handlers. Right now, I'm counting about 80 on the scheduler excluding extensions. So if we exposed all handlers that'd be easily ~100 counters I guess the most valuable would be to instrument the scheduler so having this once per cluster might be fine but having this for an entire cluster might be too much @ntabris any strong reactions to these numbers? |
Do we know if it's possible (and common practice) to filter Prometheus metrics upon collection by dimensions? I think it would be nice if we could expose metrics for all handlers (more importantly, I'm also talking about worker handlers where we'd also see an explosion due to the worker count) and let the user decide which handlers are important to them. |
IIUC scrape_config allows one to only scrape parts of the metrics but I would prefer to not make it too difficult for users to hook up with prometheus. |
If the main goal is not to make it too difficult for users to hook up Dask with Prometheus, my suggestion is to provide either a We don't know which handlers will be useful, so I wouldn't want to limit what users can see artificially but rather give them the ability to see everything and pick and choose themselves. |
To make
distributed
more observable and allows users to see what's happening under the hood, it would be nice to count handler calls and expose those counts as Prometheus metrics. This would help understand various scenarios, e.g., frequentwho_has
calls if data goes missing.The text was updated successfully, but these errors were encountered: