Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Prometheus metrics that tracks cumulative counts of handler calls #8696

Open
hendrikmakait opened this issue Jun 13, 2024 · 4 comments
Open
Labels
diagnostics feature Something is missing

Comments

@hendrikmakait
Copy link
Member

To make distributed more observable and allows users to see what's happening under the hood, it would be nice to count handler calls and expose those counts as Prometheus metrics. This would help understand various scenarios, e.g., frequent who_has calls if data goes missing.

@hendrikmakait hendrikmakait added diagnostics feature Something is missing labels Jun 13, 2024
@fjetter
Copy link
Member

fjetter commented Jun 13, 2024

We should keep an eye onto cardinality here if we exposed all the stream handlers. Right now, I'm counting about 80 on the scheduler excluding extensions. So if we exposed all handlers that'd be easily ~100 counters

I guess the most valuable would be to instrument the scheduler so having this once per cluster might be fine but having this for an entire cluster might be too much

@ntabris any strong reactions to these numbers?

@hendrikmakait
Copy link
Member Author

We should keep an eye onto cardinality here if we exposed all the stream handlers. Right now, I'm counting about 80 on the scheduler excluding extensions. So if we exposed all handlers that'd be easily ~100 counters

Do we know if it's possible (and common practice) to filter Prometheus metrics upon collection by dimensions? I think it would be nice if we could expose metrics for all handlers (more importantly, I'm also talking about worker handlers where we'd also see an explosion due to the worker count) and let the user decide which handlers are important to them.

@fjetter
Copy link
Member

fjetter commented Jun 13, 2024

IIUC scrape_config allows one to only scrape parts of the metrics but I would prefer to not make it too difficult for users to hook up with prometheus.

@hendrikmakait
Copy link
Member Author

IIUC scrape_config allows one to only scrape parts of the metrics but I would prefer to not make it too difficult for users to hook up with prometheus.

If the main goal is not to make it too difficult for users to hook up Dask with Prometheus, my suggestion is to provide either a scrape_config with sane defaults or to implement some customization mechanism that allows users to configure which metric they want to Dask to expose in the first place. The latter should also choose sane defaults.

We don't know which handlers will be useful, so I wouldn't want to limit what users can see artificially but rather give them the ability to see everything and pick and choose themselves.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
diagnostics feature Something is missing
Projects
None yet
Development

No branches or pull requests

2 participants