-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stuck Jobs #343
Comments
Well, are you setting timeouts for any jobs? #301 only applies if you are using timeouts. If you are, then my advice is don't. If you don't, then you shouldn't encounter that problem with gearmand. Personally, I have my workers wrap commands which do heavy work and may potentially exceed certain time constraints with Other questions needed for clarification:
|
the latest. 1.1.19.1+ds
no. it's been over 2 years since the latest came out. right? :) of course in that time there's been multiple major debian upgrades, etc etc. but no, no changes to gearmand.
Now THAT's a good question. Besides the debian upgrades mentioned above, starting a year ago we started submitting jobs from a new application, and added 2 new queues/functions. It's unclear but it seems the "stuck" jobs are only happening with those 2 new queues. So one theory is that the new client software is constructing the job data in some problematic way. but it's so intermittent that it seems like only part of the explanation... btw the "6 months or so" is so vague because reports of this come from a junior sysadmin with a literal brain injury and he has a terrible memory problem and so it's just not clear how long this has been going on. It might have been for a year. hard to say. More and more i'm thinking it must be a bug in our client code but i just wanted to check that there's not some known issue with jobs "sticking," like something the job data is known for causing gearmand to choke on or something. like "definitely don't ever put a unicode umlaut in a job" or somesuch... Anyway i'm so happy someone is actively dealing with this repo! thanks. |
Appreciate all the answers, but you didn't answer the most important question of whether you're setting timeouts on any of your jobs. Again, I recommend that you don't since that's a known (and still unresolved) issue. Is there anything notable in your gearmand.log file when a job gets stuck? gearmand uses the OS's libevent*.so libraries for important aspects of it operation, so any changes to libevent could impact gearmand. So that's always a possibility, I feel, when OS upgrades are involved. You could try downgrading to gearmand 1.1.18, I suppose. Frankly, that's the version I've been using in production for ~4 years now, and it's been rock solid. I haven't had any issues, so I never bothered to upgrade my production environment to 1.1.19.1. (I do use 1.1.19.1 in my development environment though and haven't encountered any problems there either.) That said, I don't do non-basic things like setting timeouts on jobs or using any of the (quasi-deprecated) persistence plugins or dynamically scaling the number of workers. If a worker is disconnected from gearmand (i.e., killed) while a job is in the middle of processing, I think it might be possible for the job to get stuck in that scenario. Can you test that? I recommend avoiding killing workers while they are in the middle of working on a job and only killing workers when they are idle. Sometimes you have to, of course. I'm just saying I would try to avoid that whenever feasible. |
I don't think we're using timeouts. They are not set by default, right? You've given us several interesting things to think about, thank you. A few other answers:
|
Correct.
I do the same thing, I think. I have my workers die after processing a certain number of jobs and then I spawn new workers to replace them. |
So upon further digging into our code and adding more logging, I found that that part of the problem is that sometimes the worker child was dying quietly and never returning info to the gearman server, so the job never gets "released." And it was dying because of some "garbage" characters in the data it gets from gearman. oddly tho, now that i've changed that code to catch the error and return properly, the job gets tried a second time and works just fine., so that "garbage" is not inherent to the job data. is it possible that gearmand is putting it at the end of the data, but only intermittently? the garbage data is: a newline, then the characters "RES", and then another newline and a singlequote and a semicolon. these characters are not visible in the gearman persistent store (we're using sqllite) data for the job. |
Could you post a full packet dump? RES should always be preceded by a null:
4 byte magic code - This is either "\0REQ" for requests or "\0RES"
for responses
See http://gearman.org/protocol/
…On Fri, Jul 8, 2022, 12:12 PM Steev Hise ***@***.***> wrote:
So upon further digging into our code and adding more logging, I found
that that part of the problem is that sometimes the worker child was dying
quietly and never returning info to the gearman server, so the job never
gets "released." And it was dying because of some "garbage" characters in
the data it gets from gearman. oddly tho, now that i've changed that code
to catch the error and return properly, the job gets tried a second time
and works just fine., so that "garbage" is not inherent to the job data. is
it possible that gearmand is putting it at the end of the data, but only
intermittently? the garbage data is: a newline, then the characters "RES",
and then another newline and a singlequote and a semicolon. these
characters are not visible in the gearman persistent store (we're using
sqllite) data for the job.
—
Reply to this email directly, view it on GitHub
<#343 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADS6YA573YUIBHQDGBQTXLVTB4S7ANCNFSM52XCWQ3Q>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
This is so intermittent that it's difficult to get a packet dump - is there a way to make Gearmand log raw packets it sends? From our worker logs it looks a lot like sometimes there's a RES at the end of job data, and sometimes not... but actually maybe this is an issue with the perl Gearman::Worker module (our worker code uses that to connect to the server)... maybe it's not chopping off the magic code from the end of the data properly, sometimes? thanks! |
That would be difficult for me to believe. Are you using Gearman::Worker version 2.004.015? 2.004.013 (from 2018-02-05) fixed a bug with That said, all of my workers are implemented using Gearman::Worker (2.004.015), and I've never seen that in many thousands of processed jobs over 4+ years. That said, I'm using SSL connections and passing strictly ASCII payloads. Is there a possibility some of your jobs have UTF-8 payloads? (Just spit-ballin'.) |
That is totally possible. But the real mystery is why would it only happen sometimes? If there's bad data, it seems like it would have the same problem each time it tries the job. but the second try always sails through no problem... We'll check the Gearman::Worker version, tomorrow. thanks! |
That's a good point about payloads. Not sure if it helps but as a Gearman fan and really happy it's being continued. @esabol is an absolute legend. So I will add my 50 cents. The way to avoid any issues on any payloads is to send the payload to file storage and reference it by a GUID. You have to remember what Gearman is for. It's not it's job of handling payloads... It's to pass reference to a payload to use. @steevhise good luck and hope you get it sorted. |
That's very kind of you to say, @tomcoxcreative, but I'm standing on the shoulders of giants and @SpamapS is much, much taller than me. To be honest, multithreaded event loop programming isn't my forté. I'm really only good at fixing portability issues and analyzing bug reports. 😄 @steevhise wrote:
Yeah, I don't know. Maybe the UTF-8 encoding is invalid and that could screw up the parsing? I'm just trying to find differences in what you're doing and what I'm doing and I know handling Unicode in Perl can be tricky. I recently watched this talk about that from the most recent Perl Conference: https://youtu.be/yH5IyYyvWHU. I base64-encode a lot of my job payloads to avoid having to deal with non-ASCII characters. |
So far looking at the jobs that have been stuck, i haven't seen any unicode. One additional trickiness about our use case is that these 2 new queues are supplied jobs by Nodejs clients, whereas previously everything was perl. we're using Abraxas as the node client library. One other thing different is that these new queues' jobs contain a lot more data, so @tomcoxcreative 's point is well taken. previous queues only put a single id that gets looked up in a database by the workers. these new ones have lots of unique data encoded as JSON. Maybe that's a mistake, but I didn't see any warnings against that in the docs, IIRC... |
On Mon, Jul 11, 2022 at 1:49 PM Steev Hise ***@***.***> wrote:
So far looking at the jobs that have been stuck, i haven't seen any
unicode. One additional trickiness about our use case is that these 2 new
queues are supplied jobs by Nodejs clients, whereas previously everything
was perl. we're using Abraxas as the node client library.
That the jobs come from newer/different clients could be a clue. Check to
make sure that client doesn't set timeouts silently, and also check what
REQ commands it is sending. Gearmand shouldn't allow a weird or newish
client to break older, stable workers but it's possible that something deep
in the protocol isn't mapping properly.
One other thing different is that these new queues' jobs contain a lot
more data, so @tomcoxcreative <https://github.com/tomcoxcreative> 's
point is well taken. previous queues only put a single id that gets looked
up in a database by the workers. these new ones have lots of unique data
encoded as JSON. Maybe that's a mistake, but I didn't see any warnings
against that in the docs, IIRC...
I think you're ok with 1MB or less of data. It just gets ugly because each
bit of data is a contiguous section of memory from the heap and that can
get pretty expensive especially if your workload is intermittent.
Thanks everyone for all the help, and I echo Tom's sentiments, I'm really
glad there's such talented folks still working on Gearman!
I'm just here to ship releases, something I haven't done lately but
hopefully will get going again soon. In the meantime, @esabol is a humble
rockstar and I think I speak for both of us when I say we really appreciate
the kind words and your feedback as users.
… —
Reply to this email directly, view it on GitHub
<#343 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADS6YA3LELUATHQMHRNFDLVTSCENANCNFSM52XCWQ3Q>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@steevhise if you are sending json in the messages then it will be very difficult to measure your payload size. Like @SpamapS mentions the memory use for large payloads will end up exceeding your use case and probably randomly as you are seeing. So in front of Gearman if you can send payload to file storage then read in the payload from your Gearman trigger you shouldn't see the random issues around the payload size. Also you can tidy up by using the OS's tmp directory or removing the payloads as part of your trigger. |
I got stuck jobs very simply:
My test function is called 'diameter': The work-around is to run gearmand without threads ( Update: that was on 1.1.18+ds-3.1ubuntu1 |
So we're good then? 😄 My guess as to why you would see this issue with 1.1.18 but not 1.1.19.1 is this commit: 45bd954 |
Well my case is using 1.1.19.1 so must be 2 different issues. And, we're still seeing it and still haven't figured out why. Additional weirdness observed: sometimes a stuck job seems to make the whole server get wacked out and hold onto other queues' jobs too or just not hand them out as fast (different than the function where the first stuck job appeared). Also while investigating this I noticed that sometimes a stuck job could be taken by a worker (and child thread is just hanging doing nothing) but also still on the queue. My understanding of the protocol is that this wouldnt happen, that a job taken by worker would mean the server takes it out of the persistent store... Am I wrong? |
"So we're good then?" was directed to @derde. Not dismissing your issue @steevhise !
I don't think you're wrong, but I do wonder if the problem you're seeing is a problem with the persistent storage layer and the whole interaction there. As you may know, the persistent storage support in gearmand has been quasi-deprecated in favor of a design pattern where job persistence is implemented by workers. It scales better and is generally more robust. (There are two frameworks for implementing such a system, one is called Gearstore and another is called Garavini. You might want to look into them. It's also straightforward to implement your own persistent storage tasks once you understand the design pattern.) I say "quasi-deprecated" as they are not officially deprecated and we still accept PRs for maintaining the persistent storage layers, but we've stopped active development on them. |
For us it's scaled pretty well to just have multiple gearman servers (other than this particular problem). and we have some database things to make sure the same job never gets given to more than one of those. it seems against the basic idea of a queue server to offload storage of the jobs to the workers. i would think that the "authority" that hands out jobs should also be the arbiter of what each job is. otherwise if a worker takes a job and screws it up, it just goes away, potentially, and the server never knows what happened. Sorry, going off on philosphical tangent... |
Gearman isn't really a queue server, it just gets used that way because
background jobs do queue. The state backends were added in part because so
many people misused gearman and painted themselves into a corner with
background jobs.
However, there are very few guarantees in the protocol, as it was designed
for efficient job distribution and low latency, not guarantees typically
offered by pure messaging systems, it can't really do "at least one" or
"only once".
The protocol does have a way to use multiple gearman servers without
assignment from a database, which is up hash the unique ID on the client
side. When you submit a job, if the unique already exists, no new job is
created and the same handle will be returned to the client. This is one of
gearman's super powers. You can read more here:
http://fewbar.com/2020/04/gearman-is-misunderstood/
…On Tue, Sep 6, 2022, 10:54 AM Steev Hise ***@***.***> wrote:
For us it's scaled pretty well to just have multiple gearman servers
(other than this particular problem). and we have some database things to
make sure the same job never gets given to more than one of those. it seems
against the basic idea of a queue server to offload storage of the jobs to
the workers. i would think that the "authority" that hands out jobs should
also be the arbiter of what each job is. otherwise if a worker takes a job
and screws it up, it just goes away, potentially, and the server never
knows what happened. Sorry, going off on philosphical tangent...
—
Reply to this email directly, view it on GitHub
<#343 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADS6YA3ZKAMGRP3Z6RTXK3V46AOFANCNFSM52XCWQ3Q>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I missed this the first time. Yes you are mistaken. The point of the persistent store is to hold on to the job until it has been completed in case there is a loss of state in the gearmand itself. When background jobs are assigned to workers, they are marked as such only in memory on the gearmand server. They are only removed from the persistent queue when they are done, which is only when a WORK_FAIL, WORK_COMPLETE, or WORK_EXCEPTION packet has been received from a worker, or the job has exceeded the retry count. And likewise, they are only retrieved from the queue backend if gearmand is starting up, trying to recover its in-memory state as best it can. |
I also meet this kind of intermittently stuck.
The worker is idle, but server does not wake up the worker or send job to the worker. GDB shows the worker strace as follow (gdb) bt
#0 0x00007f479d729dc0 in __poll_nocancel () from /lib64/libc.so.6
#1 0x00007f477fba9c93 in gearman_wait(gearman_universal_st&) () from /lib64/libgearman.so.8
#2 0x00007f477fbac891 in gearman_worker_grab_job () from /lib64/libgearman.so.8
#3 0x00007f477fbad377 in gearman_worker_work () from /lib64/libgearman.so.8
#4 0x00007f477fdcfd7d in zif_gearman_worker_work () from /usr/lib64/php/modules/gearman.so
#5 0x000000000064eadc in ZEND_DO_FCALL_SPEC_RETVAL_UNUSED_HANDLER ()
#6 0x00000000005fcb1b in execute_ex () Even I run the worker with Totally comfused... [root@WRDNMS tasks]# gearmand -V
gearmand 1.1.20 - https://github.com/gearman/gearmand/issues
[root@WRDNMS tasks]# php --ri gearman
gearman
gearman support => enabled
extension version => 2.0.6
libgearman version => 1.1.19.1
Default TCP Host => localhost
Default TCP Port => 4730 |
Hello, @HBSpy!
That's out of date. Our latest release is 1.1.20. Your PHP extension is also very out of date. I recommend using the master branch of the https://github.com/php/pecl-networking-gearman. (Don't use the current PECL release since it's old and does not include many bug fixes in the unreleased master branch.)
Also, your backtrace is for the worker. What is actually relevant to us is the server. If you are up to compiling your own gearmand and libgearman, please try this patch and report your findings here: |
Thx, @esabol With all components including the gearmand-cpu-spinning-patch updated, It still stuck
Other jobs are normal, so gdb or strace to gearmand is quite a mass. Temporarily I add a $worker = new GearmanWorker();
$worker->addServer($this->config->gearman->host);
$worker->setId('user');
$worker->setTimeout(30 * 1000);
$worker->addFunction('update_user', [$this, 'update_user_func']);
do {
$worker->work();
$returnCode = $worker->returnCode();
if ($returnCode == GEARMAN_TIMEOUT) {
echo "timeout" . PHP_EOL;
continue;
}
if ($returnCode != GEARMAN_SUCCESS) {
echo "returncode:" . $returnCode . PHP_EOL;
break;
}
} while(true); |
I strongly do not recommend setting a timeout. Timeouts are kind of broken. That's a separate, known issue: #301. But maybe the PHP |
@esabol the implementation calls If that is the case that we cant use it, does that not present an issue... $gmworker->setTimeout($gminterval);
gc_disable();
$strat = new \STS\Backoff\Strategies\LinearStrategy(1);
$strat_count = 1;
Signal::init();
while($gmworker->work() || $gmworker->returnCode() == GEARMAN_TIMEOUT){
if (Signal::isExit()) {
$logger->log($this->getScriptName()."|EXIT CALLED via SIGNAL", Zend_Log::INFO);
$this->doExit(0, $exitCallback);
}elseif($maxrun>0 && (($this->started->getDuration() / 1000) > $maxrun)){
$logger->log($this->getScriptName()."|TIMED EXIT", Zend_Log::INFO);
$this->doExit(3, $exitCallback);
}
switch($gmworker->returnCode()){
case GEARMAN_NO_ACTIVE_FDS:
sleep($strat->getWaitTime($strat_count));
$strat_count++;
break;
case GEARMAN_TIMEOUT:
$strat_count=0;
if(is_callable($timeoutCallback)) {
$timeoutCallback();
gc_collect_cycles();
}
break;
case GEARMAN_SUCCESS:
case GEARMAN_WORK_FAIL:
case GEARMAN_WORK_EXCEPTION:
$strat_count=0;
\Zend_Registry::getInstance()->get('db')->closeConnection();
// tidy after completion
gc_collect_cycles();
// single process
if (isset($options['single'])) {
exit(0);
}
break;
default:
$logger->log($this->getScriptName()."|retcode: ".$gmworker->returnCode(), Zend_Log::ALERT);
if(is_callable($exitCallback)) $exitCallback();
break 2;
}
} This is example of how we use the timeout so it does not hang on |
Sure. But I might have confused this timeout with CAN_DO_TIMEOUT tasks? I don't remember anymore if those are the same things or not. CAN_DO_TIMEOUT is what is broken. Personally, I do not use any gearmand-level timeouts, and, after having tried to debug issue #301 for many days, I advise everyone else to not use gearmand's timeouts either. What I did instead was that I implemented timeouts for jobs in my workers, rather than relying on gearmand to do it. The worker itself is responsible for enforcing a timeout on the work being done (often by simply using Anyway, that's just what I've done to avoid gearmand's problematic implementation of timeouts and/or jobs taking forever. YMMV. Frankly, I couldn't determine, after reading your post multiple times, if you were reporting a problem with timeouts or reporting that timeouts are working well for you. |
We've been using Gearman for years and years, but lately, like over the last 6 months or so, we sometimes see jobs get "stuck". They will sit on a queue for hours or days if left alone, as soon as gearmand is restarted, or one of the workers for that queue, they job goes out right away... worker logs show no intital attempts to do the job, a theory that the worker died and the server still thinks it's working on the job is not the case - the server just never hands out the job, apparently.
I've always thought of Gearmand as this rock-solid piece of infrastructure that I don't have to worry about and the problem is always with our client or worker code, but this seems like something else....
What could cause this? Is it related to #301 ?
The text was updated successfully, but these errors were encountered: