Review and improve/correct our Celery task retry logic #1391

ccostino · 2024-11-04T23:07:30Z

With the increased amount of Celery worker processes and memory available to each of them, we're starting to notice a new pattern emerge with the task retries in Celery for trying to send messages or retrieve their information.

There's a fairly consistent spike of celery.exceptions:Retry exceptions being thrown at set intervals (see New Relic for more details), and this points to our handling of retrying tasks. Looking into the Celery docs around message (task) sending, it would seem that tasks should generally be retried automatically without needing to do a lot on our part. This is a bit contradictory with how to configure a task though, so I looked a little further into this.

What I found in this Stack Overflow post seems to point to us doing a bit too much with our retry logic in the Celery tasks. A bit more investigation led me to this article about Celery retry logic, which points to a few different approaches (and some based on the version you're using - we're on 5.4.x).

We need to make some adjustments to make sure we're letting Celery properly handle task retries and that we're not causing additional churn with extra exception handling.

Implementation Sketch and Acceptance Criteria

Look further into these articles and other sources to see what would be the right approach for handling retry logic of our Celery tasks, especially accounting for things like max retries, retry delays, and retry back off/anti-jitter (especially for things like S3 interactions)
Take a look at our code in app/celery (especially tasks.py and provider_tasks.py) and see where we might need to make adjustments
Create a PR with the proposed adjustments so we can talk through them and determine if they'd be appropriate, and how we're going to test them and validate that the changes will have the desired effect

Security Considerations

We want to make sure our asynchronous tasks are handled correctly so we don't introduce additional errors or performance issues in the system that cause instability.

The text was updated successfully, but these errors were encountered:

xlorepdarkhelm · 2024-12-10T18:09:52Z

@terrazoon and I are in agreement, this probably is a lower priority for now. Not really sure this will resolve any issues we have at this time, and there is so many complex issues getting this to work, that it is taking too much time/effort, when there are better choices for priority, and this can be addressed later. Unlike SQLAlchemy 2.0 upgrade where things will be deprecated/removed, it does not look like celery is doing the same thing with this retry logic. And... I am still not certain that this will solve the problems we currently are seeing in the way the application is running.

xlorepdarkhelm · 2024-12-10T18:10:30Z

This is being put into blocked/waiting to be addressed later when we have some available time to go down these rabbit holes and get it working.

ccostino added the engineering label Nov 4, 2024

ccostino added this to Notify.gov product board Nov 4, 2024

github-project-automation bot moved this to 🌱 New in Notify.gov product board Nov 4, 2024

ccostino moved this from 🌱 New to ⬇ Up-Next in Notify.gov product board Nov 4, 2024

ecayer mentioned this issue Nov 8, 2024

Address instability from report download caching GSA/notifications-admin#2094

Open

7 tasks

xlorepdarkhelm self-assigned this Nov 18, 2024

xlorepdarkhelm moved this from ⬇ Up-Next to 🏗 In progress (WIP: ≤ 3 per person) in Notify.gov product board Nov 18, 2024

ccostino mentioned this issue Nov 25, 2024

Prepare for Census 100,000 message send GSA/notifications-admin#2142

Closed

12 tasks

ecayer mentioned this issue Dec 5, 2024

Message send performance and stability (issues from Census send) GSA/notifications-admin#2181

Open

14 tasks

xlorepdarkhelm moved this from 🏗 In progress (WIP: ≤ 3 per person) to ⏱ Blocked/Waiting in Notify.gov product board Dec 10, 2024

xlorepdarkhelm mentioned this issue Dec 10, 2024

api-1391: Rewriting the celery retry logic #1471

Draft

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review and improve/correct our Celery task retry logic #1391

Review and improve/correct our Celery task retry logic #1391

ccostino commented Nov 4, 2024

xlorepdarkhelm commented Dec 10, 2024

xlorepdarkhelm commented Dec 10, 2024

Review and improve/correct our Celery task retry logic #1391

Review and improve/correct our Celery task retry logic #1391

Comments

ccostino commented Nov 4, 2024

Implementation Sketch and Acceptance Criteria

Security Considerations

xlorepdarkhelm commented Dec 10, 2024

xlorepdarkhelm commented Dec 10, 2024