-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flow run with many concurrent tasks intermittently crashing, ECS Task doesn't spin down #9837
Comments
Hm that last log Lines 2407 to 2414 in 179afa0
Should result in our process exiting. It seems very weird that the ECS task would not exit. Can you see the status of the container on AWS? |
I'm also curious about the container logs for the ecs task - is the issue here just that tasks keeps going after a crash, or do you think Prefect has something to do with the tasks crashing in the first place? |
Well Prefect is definitely crashing due to an error in |
I do think Prefect is responsible for the tasks crashing, and that's part of my issue here. Prefect's concurrent task runner very reliably has trouble with more than a few hundred concurrent tasks submitted at once. Isee this kind of crash about half the time I try and run any flow that sends more than a few hundred tasks to a concurrent task runner. What I sent above is the container logs for the ECS task. The traceback does not show up in the Prefect logs. Let me know if you're asking for something else that I'm not understanding. |
I'll post a container status next time I see one of these crashes - I've got to catch it before my nightly script spins down hanging containers. |
Ok, I've got one. The container status is "Running". Logs look essentially the same as above. |
Just as a note, the task batcher in this package is solving my problem with crashes caused by submitting too many tasks at once, so that particular bugginess is less urgent on my end. Still seeing most crashed tasks fail to spin down. |
This issue is stale because it has been open 30 days with no activity. To keep this issue open remove stale label or comment. |
This issue was closed because it has been stale for 14 days with no activity. If this issue is important or you have more to add feel free to re-open it. |
Commenting here as I've recently run into the same issues using the |
@WillRaphaelson would you mind reopening this issue or pointing me in the direction of a related issue that's open? |
Hi @austinweisgrau, we've added this to our backlog, but would also welcome a contributor. |
This issue is different from - but has the same resolution as - #10149. Essentially, there is a problem in the lower level library we use to handle HTTP/2. As a temporary measure you can set |
First check
Bug summary
A prefect flow that submits several hundred concurrent Prefect tasks for execution intermittently crashes due an exception raised in the Prefect engine/runner. This flow runs in prefect_aws.ECSTask infrastructure, which normally spins down and deregisters after a task finishes, but the ECS Task stays running indefinitely after the crash. The final exception shows up in the Prefect logs, but the stack trace does not, although it does show up in the cloudwatch/ECS logs.
Note, not sure if it's relevant, but I'm using my semaphore implementation described here to rate limit the execution of these concurrent Prefect tasks - only 3 Prefect tasks execute at a time, but all ~200-500 are submitted initially.
Reproduction
Error
Additional context
No response
The text was updated successfully, but these errors were encountered: