Add Prometheus metrics that tracks cumulative counts of handler calls #8696

hendrikmakait · 2024-06-13T11:36:42Z

To make distributed more observable and allows users to see what's happening under the hood, it would be nice to count handler calls and expose those counts as Prometheus metrics. This would help understand various scenarios, e.g., frequent who_has calls if data goes missing.

The text was updated successfully, but these errors were encountered:

fjetter · 2024-06-13T11:43:26Z

We should keep an eye onto cardinality here if we exposed all the stream handlers. Right now, I'm counting about 80 on the scheduler excluding extensions. So if we exposed all handlers that'd be easily ~100 counters

I guess the most valuable would be to instrument the scheduler so having this once per cluster might be fine but having this for an entire cluster might be too much

@ntabris any strong reactions to these numbers?

hendrikmakait · 2024-06-13T11:47:37Z

We should keep an eye onto cardinality here if we exposed all the stream handlers. Right now, I'm counting about 80 on the scheduler excluding extensions. So if we exposed all handlers that'd be easily ~100 counters

Do we know if it's possible (and common practice) to filter Prometheus metrics upon collection by dimensions? I think it would be nice if we could expose metrics for all handlers (more importantly, I'm also talking about worker handlers where we'd also see an explosion due to the worker count) and let the user decide which handlers are important to them.

fjetter · 2024-06-13T12:02:11Z

IIUC scrape_config allows one to only scrape parts of the metrics but I would prefer to not make it too difficult for users to hook up with prometheus.

hendrikmakait · 2024-06-13T12:16:58Z

IIUC scrape_config allows one to only scrape parts of the metrics but I would prefer to not make it too difficult for users to hook up with prometheus.

If the main goal is not to make it too difficult for users to hook up Dask with Prometheus, my suggestion is to provide either a scrape_config with sane defaults or to implement some customization mechanism that allows users to configure which metric they want to Dask to expose in the first place. The latter should also choose sane defaults.

We don't know which handlers will be useful, so I wouldn't want to limit what users can see artificially but rather give them the ability to see everything and pick and choose themselves.

hendrikmakait added diagnostics feature Something is missing labels Jun 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Prometheus metrics that tracks cumulative counts of handler calls #8696

Add Prometheus metrics that tracks cumulative counts of handler calls #8696

hendrikmakait commented Jun 13, 2024

fjetter commented Jun 13, 2024

hendrikmakait commented Jun 13, 2024

fjetter commented Jun 13, 2024

hendrikmakait commented Jun 13, 2024

Add Prometheus metrics that tracks cumulative counts of handler calls #8696

Add Prometheus metrics that tracks cumulative counts of handler calls #8696

Comments

hendrikmakait commented Jun 13, 2024

fjetter commented Jun 13, 2024

hendrikmakait commented Jun 13, 2024

fjetter commented Jun 13, 2024

hendrikmakait commented Jun 13, 2024