Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate 'adaptive' mode for retries and improve concurrency of message sends from multiple services #1452

Closed
terrazoon opened this issue Dec 3, 2024 · 1 comment
Assignees

Comments

@terrazoon
Copy link
Contributor

terrazoon commented Dec 3, 2024

Right now when we do large sends we are hitting some ThrottlingExceptions with a "max retries exceeded: 4" message. This is not our celery retries, this is built into AWS. This occurs both for publishing messages and for calling FilterLogEvents when we are getting the delivery receipts.

Apparently there is a new, and currently experimental, 'adaptive' mode for AWS retries which takes into account service limits and hopefully reduces the number of AWS retries:

https://boto3.amazonaws.com/v1/documentation/api/latest/guide/retries.html

Investigate switching over to this adaptive mode when it becomes less experimental.

Some additional guidance is also available here:

Ultimately, we also want to make sure that a large batch from one partner doesn't become a blocker for all other work to go through the system from other partners.

@terrazoon terrazoon converted this from a draft issue Dec 3, 2024
@ccostino ccostino changed the title Investigate 'adaptive' mode for retries Investigate 'adaptive' mode for retries and improve concurrency of message sends from multiple services Dec 6, 2024
@terrazoon terrazoon self-assigned this Jan 2, 2025
@terrazoon terrazoon moved this from Issue Backlog (More than 3 Months) to 🏗 In progress (WIP: ≤ 3 per person) in Notify.gov product board Jan 2, 2025
@terrazoon terrazoon moved this from 🏗 In progress (WIP: ≤ 3 per person) to 👀 In review in Notify.gov product board Jan 16, 2025
@terrazoon
Copy link
Contributor Author

So adaptive mode is beta-ish, and we don't really have problems with retries themselves as we currently have them. The problem was the 'incomplete jobs' task was flagging jobs still running after 30 minutes as incomplete and trying to reprocess, which we have already fixed.

Regarding improving concurrency we've done all that and had a successful load test with 5 jobs running 2000 messages each simultaneously. So closing as done.

@github-project-automation github-project-automation bot moved this from 👀 In review to ✅ Done in Notify.gov product board Jan 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

1 participant