Possibility of clearing metrics every X seconds (memory problem) #280

gciria · 2023-10-31T10:24:30Z

I am using version v0.9.2, with the variables CE_WORKER_TIMEOUT and CE_PURGE_OFFLINE_WORKER_METRICS modified, the time was changed to 20 seconds.

In my structure every X minutes, several nodes in batches are started in Kubernetes with dozens of pods/celery consuming X queues.
Prometheus scrapes the metrics from the celery-exporter (9808/metrics) and stores them.
Apparently the purge variables don't work very well in my structure. In the logs I see purge of 1, 2 pods after many hours.

Would you like to know if there is a possibility to add a new parameter to purge all /metrics every X seconds? Or any tips for another solution.

Thanks and crongrats on the great project.

The text was updated successfully, but these errors were encountered:

danihodovic · 2023-10-31T14:05:11Z

@adinhodovic

adinhodovic · 2023-11-06T12:18:12Z

If your workers go offline (rotate) metrics should be quickly cleaned up. Works fine for us with up to ~100 pods. On new releases all metrics get cleaned quite quickly. We do it every 5 minute and a worker times out at 2.5 minutes. You are not seeing the purge message enough?

Maybe CE_GENERIC_HOSTNAME_TASK_SENT_METRIC=true will help with cardinality aswell?

we dont have an option to clean all metrics atm.

DvdChe · 2024-01-23T16:59:56Z

Hey,

I have same problem on my side,

I tried to activate CE_GENERIC_HOSTNAME_TASK_SENT_METRIC=true and some metrics has their hostname set as generic but there is still other that are labelled with pod name. I also tried to cutomize CE_PURGE_OFFLINE_WORKER_METRICS and CE_WORKER_TIMEOUT as well but there is no purge.

I tried to find how garbage collecting is working and I think i partially found the cause :

self.track_timed_out_workers() is called at every scrap.
This method will iterate on self.worker_last_seen to call self.purge_worker_metrics()

On my side, problem is that self.worker_last_seen remains empty and it never get updated so metrics are never purged.

If your workers go offline (rotate) metrics should be quickly cleaned up. Works fine for us with up to ~100 pods. On new releases all metrics get cleaned quite quickly. We do it every 5 minute and a worker times out at 2.5 minutes. You are not seeing the purge message enough?

Maybe CE_GENERIC_HOSTNAME_TASK_SENT_METRIC=true will help with cardinality aswell?

we dont have an option to clean all metrics atm.

What do you mean by go offline ? Is it a gracefull disconnection made by workers or something like that ? ( sorry for this question but I absolutely know nothing about celery )

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possibility of clearing metrics every X seconds (memory problem) #280

Possibility of clearing metrics every X seconds (memory problem) #280

gciria commented Oct 31, 2023

danihodovic commented Oct 31, 2023

adinhodovic commented Nov 6, 2023

DvdChe commented Jan 23, 2024 •

edited

Loading

Possibility of clearing metrics every X seconds (memory problem) #280

Possibility of clearing metrics every X seconds (memory problem) #280

Comments

gciria commented Oct 31, 2023

danihodovic commented Oct 31, 2023

adinhodovic commented Nov 6, 2023

DvdChe commented Jan 23, 2024 • edited Loading

DvdChe commented Jan 23, 2024 •

edited

Loading