-
Notifications
You must be signed in to change notification settings - Fork 231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENT Batches] - Problem with Batches getting stuck and never completing #486
Comments
Tell me about your Redis setup. Are you using a remote Redis? Can you show me the |
We're using a remote Redis provided by Redislabs, hosted in the same AZ and Datacenter as our Faktory Server. Info OutputRedis RTT?: 497.387 µs
|
Your Redis looks perfect. How often does this problem happen? Every time you run this workflow? The other tricky thing to look at is deployments. Do you see problems popping up around the times when you start/stop your worker processes? |
It happens intermittently. Its not correlated to the workflow itself, but the workflow runs long enough for it to occur, as once a batch finishes another one is instantly created and enqueued ad infinitum.
It might be the case, as our k8s cluster uses a mix of regular and spot instances as k8s nodes, so both the Faktory server & the worker processes can be rescheduled to other nodes regularly. Another little thing we found in logs is we have many io/timeout errors have been logged both for
|
Is your k8s cluster is shutting down Faktory? You should not shut down Faktory until all worker processes are gone. |
Hmm by the nature of our k8s cluster we don't have any way to control that. Does Faktory provides any way to perform "graceful shutdowns" to counter node rescheduling? |
That's just it, you need to treat Faktory like a database. You can't arbitrarily take it down unless all workers/clients are down too. I think this is called a StatefulSet? and your worker pods should declare a dependency on the Faktory service so any worker pods are shut down before Faktory. |
Hmm okay, we'll try to configure our workers as a dependency of Faktory server and will keep you posted.
I have a followup question about this, though. If Faktory's state is stored in Redis, why does the shutdown order matter? Wouldn't that scenario be similar to a transient networking error for example? The logged errors that originates from this issue are network errors afterall. |
I've been reviewing another Batch that failed today for the same reason, and this time the Server pod hasn't been rescheduled; as both the This time the Batch is stuck with 6 jobs pending, all of them appear in consecutive error logs printed by Faktory server.
|
@Tylerian I would make sure you are shutting down your Faktory worker processes cleanly. If they are being killed, it's possible there's a bug in Faktory leaving job data in a half-baked state. |
All our worker processes are configured to shutdown gracefully. We've set a Bear in mind neither the Faktory Server process nor the Application Worker processes were killed by k8s the last time this issue popped up, which hints it could be triggered by transient network errors such as connection timeouts. |
Do you think the issue is transient network errors between Client <-> Faktory or Faktory <-> Redis? I would think the latter, I will review the batch code and make sure we're using MULTI where necessary to ensure transactional changes. |
I don't know to be honest, both Faktory Server and Redis IP addresses have been reported in What I keep observing is Faktory fails to acknowledge the result of jobs within the Batch when a transient error occurs, be it a k8s deployment or a network error. Even though the jobs are successfully processed later and removed from Faktory's queue, which might hint some kind of bug in the Batching internals logic. Sorry for the lack of specifity, but it's all I can do as a mere observer from the outside. |
I've not been able to track this down, I'll need more data if we want to solve this. |
Checklist
Faktory v1.9.0
faktory_workers_go v1.9.0
Are you using an old version?
No
Have you checked the changelogs to see if your issue has been fixed in a later version?
Yes
Context
We're running a bulk process with Faktory which triggers millions of individual Jobs wrapped in Batches to split the work into manageable chunks.
Problem
Sometimes the Batches UI page shows 1 pending Job which is neither running nor waiting to be processed in the queue, leaving the Batch stuck and never completing. The success/complete callbacks on the Batch aren't being called neither.
When finding for logs, there is little to be seen. The most I've managed to find are networking error logs like the following:
Worker logs:
Unable to report JID Qtjl6r_Ifxh32kUQ result to Faktory: read tcp 172.16.114.109:56062->10.100.183.42:7419: i/o timeout
Server logs:
Unable to process timed job: cannot retry reservation: Job not found Qtjl6r_Ifxh32kUQ
No such job to acknowledge Qtjl6r_Ifxh32kUQ
The text was updated successfully, but these errors were encountered: