Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CPU load of gearman worker and daemon #232

Open
zacek opened this issue Mar 28, 2019 · 56 comments
Open

High CPU load of gearman worker and daemon #232

zacek opened this issue Mar 28, 2019 · 56 comments

Comments

@zacek
Copy link

zacek commented Mar 28, 2019

After some time (hours) both my workers and the gearman daemon start using CPU even though workers are mainly in waiting loop. It is very hard to reproduce but I've compared the standard situation with the berserk mode and can share some details. My workers are connected to 2 gearman daemons.

Standard worker in waiting loop - it is waiting for a new job with timeout of 5 second. When no job arrives within this period, some stats are written and the worker calls gearman_worker_work() again etc. This is the strace of one loop:

01 06:53:02.018383 clock_gettime(CLOCK_REALTIME, {1553773982, 18402886}) = 0
02 06:53:02.018449 sendto(12, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
03 06:53:02.018530 recvfrom(12, 0xdad079, 8192, 16384, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
04 06:53:02.018619 poll([{fd=12, events=POLLIN}, {fd=11, events=POLLIN}, {fd=9, events=POLLIN}], 3, 5000) = 1 ([{fd=12, revents=POLLIN}])
05 06:53:02.018690 recvfrom(12, "\0RES\0\0\0\n\0\0\0\0", 8192, MSG_NOSIGNAL, NULL, NULL) = 12
06 06:53:02.018751 sendto(11, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
07 06:53:02.018811 recvfrom(11, 0xda8a59, 8192, 16384, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
08 06:53:02.018870 poll([{fd=11, events=POLLIN}, {fd=9, events=POLLIN}], 2, 5000) = 1 ([{fd=11, revents=POLLIN}])
09 06:53:02.018938 recvfrom(11, "\0RES\0\0\0\n\0\0\0\0", 8192, MSG_NOSIGNAL, NULL, NULL) = 12
10 06:53:02.019001 sendto(12, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
11 06:53:02.019072 sendto(11, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
12 06:53:02.019127 poll([{fd=12, events=POLLIN}, {fd=11, events=POLLIN}, {fd=9, events=POLLIN}], 3, 5000) = 0 (Timeout)
13 06:53:07.023813 clock_gettime(CLOCK_REALTIME, {1553773987, 24649765}) = 0

the clock_gettime (lines 1 and 13) is called directly before and after the gearman_worker_work() method (lines 2 - 12). Descriptors 11 and 12 are tcp connections to 2 different gearman daemons (G1 and G2). As you can see, it works as expected:

02: GRAB_JOB_ALL from G2
03: read response - failure (not ready yet)
04: wait for the response from G2
05: response from G2: NO_JOB
06-09: the same as 02-05 with G1
10 and 11: report sleep mode to G1 and G2
12: wait for new job with 5 second timeout (and it really timed out - no job arrived within 5 s)
13: see the clock_gettime is called exactly 5 seconds after step 12

But later this changes. I did not find why and what the trigger is (if any) but this is the new strace in the berserk mode (here the G1 and G2 are descriptors 12 and 13, it is a different process):

     1  07:51:14.237032 sendto(12, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
     2  07:51:14.237108 recvfrom(12, 0x25e74a9, 8192, 16384, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
     3  07:51:14.237237 poll([{fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=10, events=POLLIN}], 3, 5000) = 1 ([{fd=12, revents=POLLIN}])
     4  07:51:14.244517 recvfrom(12, "\0RES\0\0\0\n\0\0\0\0", 8192, MSG_NOSIGNAL, NULL, NULL) = 12
     5  07:51:14.244757 sendto(13, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
     6  07:51:14.244828 sendto(12, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
     7  07:51:14.245146 sendto(13, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
     8  07:51:14.245214 poll([{fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=10, events=POLLIN}], 3, 5000) = 1 ([{fd=13, revents=POLLIN}])
     9  07:51:14.245629 sendto(12, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
    10  07:51:14.245701 recvfrom(12, 0x25e74a9, 8192, 16384, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
    11  07:51:14.245935 poll([{fd=12, events=POLLIN}, {fd=10, events=POLLIN}], 2, 5000) = 1 ([{fd=12, revents=POLLIN}])
    12  07:51:14.255829 recvfrom(12, "\0RES\0\0\0\n\0\0\0\0", 8192, MSG_NOSIGNAL, NULL, NULL) = 12
    13  07:51:14.255896 sendto(13, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
    14  07:51:14.255980 sendto(12, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
    15  07:51:14.256065 sendto(13, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
    16  07:51:14.256128 poll([{fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=10, events=POLLIN}], 3, 5000) = 1 ([{fd=13, revents=POLLIN}])
    17  07:51:14.256257 sendto(12, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
    18  07:51:14.256333 recvfrom(12, 0x25e74a9, 8192, 16384, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
    19  07:51:14.256450 poll([{fd=12, events=POLLIN}, {fd=10, events=POLLIN}], 2, 5000) = 1 ([{fd=12, revents=POLLIN}])
    20  07:51:14.266611 recvfrom(12, "\0RES\0\0\0\n\0\0\0\0", 8192, MSG_NOSIGNAL, NULL, NULL) = 12
    21  07:51:14.266677 sendto(13, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
    22  07:51:14.266748 sendto(12, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
    23  07:51:14.267042 sendto(13, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
    24  07:51:14.267154 poll([{fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=10, events=POLLIN}], 3, 5000) = 1 ([{fd=13, revents=POLLIN}])
    25  07:51:14.267333 sendto(12, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
    26  07:51:14.267426 recvfrom(12, 0x25e74a9, 8192, 16384, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
    27  07:51:14.267487 poll([{fd=12, events=POLLIN}, {fd=10, events=POLLIN}], 2, 5000) = 1 ([{fd=12, revents=POLLIN}])
    28  07:51:14.272557 recvfrom(12, "\0RES\0\0\0\n\0\0\0\0", 8192, MSG_NOSIGNAL, NULL, NULL) = 12
    29  07:51:14.272622 sendto(13, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
    30  07:51:14.272692 sendto(12, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
    31  07:51:14.272766 sendto(13, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
    32  07:51:14.272827 poll([{fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=10, events=POLLIN}], 3, 5000) = 1 ([{fd=13, revents=POLLIN}])
    33  07:51:14.272940 sendto(12, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
    34  07:51:14.273166 recvfrom(12, 0x25e74a9, 8192, 16384, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
    35  07:51:14.273264 poll([{fd=12, events=POLLIN}, {fd=10, events=POLLIN}], 2, 5000) = 1 ([{fd=12, revents=POLLIN}])
    36  07:51:14.278751 recvfrom(12, "\0RES\0\0\0\n\0\0\0\0", 8192, MSG_NOSIGNAL, NULL, NULL) = 12
    37  07:51:14.278815 sendto(13, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
    38  07:51:14.278943 sendto(12, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
    39  07:51:14.279020 sendto(13, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
    40  07:51:14.279090 poll([{fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=10, events=POLLIN}], 3, 5000) = 1 ([{fd=13, revents=POLLIN}])
    41  07:51:14.279306 sendto(12, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
    42  07:51:14.279468 recvfrom(12, 0x25e74a9, 8192, 16384, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
    43  07:51:14.279663 poll([{fd=12, events=POLLIN}, {fd=10, events=POLLIN}], 2, 5000) = 1 ([{fd=12, revents=POLLIN}])
    44  07:51:14.285060 recvfrom(12, "\0RES\0\0\0\n\0\0\0\0", 8192, MSG_NOSIGNAL, NULL, NULL) = 12
    45  07:51:14.285177 sendto(13, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
    46  07:51:14.285256 recvfrom(13, "\0RES\0\0\0\n\0\0\0\0\0RES\0\0\0\n\0\0\0\0\0RES\0\0\0\n"..., 8192, MSG_NOSIGNAL, NULL, NULL) = 60
    47  07:51:14.285331 sendto(12, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
    48  07:51:14.285420 sendto(13, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
    49  07:51:14.285486 poll([{fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=10, events=POLLIN}], 3, 5000) = 1 ([{fd=13, revents=POLLIN}])
    50  07:51:14.285648 sendto(12, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
    51  07:51:14.285722 recvfrom(12, 0x25e74a9, 8192, 16384, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
    52  07:51:14.285913 poll([{fd=12, events=POLLIN}, {fd=10, events=POLLIN}], 2, 5000) = 1 ([{fd=12, revents=POLLIN}])
    53  07:51:14.291351 recvfrom(12, "\0RES\0\0\0\n\0\0\0\0", 8192, MSG_NOSIGNAL, NULL, NULL) = 12
    54  07:51:14.291438 sendto(13, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
    55  07:51:14.291513 sendto(12, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
    56  07:51:14.291587 sendto(13, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
    57  07:51:14.291649 poll([{fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=10, events=POLLIN}], 3, 5000) = 1 ([{fd=13, revents=POLLIN}])
...
  3648  07:51:17.203002 poll([{fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=10, events=POLLIN}], 3, 5000) = 1 ([{fd=13, revents=POLLIN}])
  3649  07:51:17.203081 sendto(12, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
  3650  07:51:17.203151 recvfrom(12, 0x25e74a9, 8192, 16384, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
  3651  07:51:17.203207 poll([{fd=12, events=POLLIN}, {fd=10, events=POLLIN}], 2, 5000) = 1 ([{fd=12, revents=POLLIN}])
  3652  07:51:17.208592 recvfrom(12, "\0RES\0\0\0\n\0\0\0\0", 8192, MSG_NOSIGNAL, NULL, NULL) = 12
  3653  07:51:17.208660 sendto(13, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
  3654  07:51:17.208837 recvfrom(13, "\0RES\0\0\0\n\0\0\0\0\0RES\0\0\0\n\0\0\0\0\0RES\0\0\0\n"..., 8192, MSG_NOSIGNAL, NULL, NULL) = 72
  3655  07:51:17.208933 sendto(12, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
  3656  07:51:17.209006 sendto(13, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
  3657  07:51:17.209116 poll([{fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=10, events=POLLIN}], 3, 5000) = 0 (Timeout)

As you can see, lines 1 - 4 are normal (communication with G1) but after line 5 (GRAB_JOB_ALL on G2) there is no recvfrom(13, ...) call (not waiting for response from G2, why?) and directly it goes to sleep mode (lines 6 for G1 and 7 for G2) and waits for any input on both connections. G2 sends back the NO JOB response (line 8) which causes that the poll() call on line 8 returns immediately (diff time between line 9 and 8) and it again starts with polling G1, sending request to G2, but again without waiting for response etc. This repeats several times and then finally the descriptor of G2 is read (line 46) returning accumulated NO JOB responses (60 characters) and it goes this way for some time until probably everything G2 has sent is read (last 72 bytes on line 3654) and poll call finally wait without CPU load for the timeout.

In this mode, both the worker and gearman daemon use lots of CPU even though doing nothing because the gearman daemon G2 is called too often only to learn there is nothing to do. This is the summary comparison (10 seconds), in normal mode:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
  0.00    0.000000           0         5           poll
  0.00    0.000000           0         8           sendto
  0.00    0.000000           0         8         4 recvfrom

and in berserk mode:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
  0.00    0.000000           0        98           poll
  0.00    0.000000           0       199           sendto
  0.00    0.000000           0       148        49 recvfrom

The program works, there is no disruption but the CPU load on the gearman daemon server goes up with every worker that goes berserk. The workers use 1.5% CPU each (normally it is not measurable) and gearman daemon can go to several hundreds (multi CPU server).

@zacek
Copy link
Author

zacek commented Mar 28, 2019

I forgot to mention that I've experienced this problem with both 1.1.18+6 (commit fa2a4ea) and 1.1.18+137 (commit 64716ae).

@esabol
Copy link
Member

esabol commented Mar 28, 2019

Operating system/distro/kernel versions?
libevent version?
gcc version?

@zacek
Copy link
Author

zacek commented Mar 29, 2019

OS: linux
distro: debian 8.11
kernel: 3.16.0-7-amd64 #1 SMP Debian 3.16.59-1 (2018-10-03) x86_64 GNU/Linux
libevent-2.0-5: 2.0.21-stable-2+deb8u1
gcc version 4.9.2 (Debian 4.9.2-10+deb8u2)

@zacek
Copy link
Author

zacek commented Apr 1, 2019

I thought that the problem could be caused by the fact that my clients are connected to 2 gearman servers. So I've connected some of them to one gearman server, some clients to the other and some to both. It did not help - after a few hours it went mad again and the strace of the clients connected to one server only looked as the example above, just with one connection. Additionally, some clients are written in C++, some in PHP and both are affected so the problem must be in the gearman library or the daemon itself.

@esabol
Copy link
Member

esabol commented Apr 1, 2019

I forgot to mention that I've experienced this problem with both 1.1.18+6 (commit fa2a4ea) and 1.1.18+137 (commit 64716ae).

I'm guessing you have, but, just to be clear, have you tried the straight up 1.1.18 release? Or any other releases?

I thought that the problem could be caused by the fact that my clients are connected to 2 gearman servers.

You mean workers, I presume?

So I've connected some of them to one gearman server, some clients to the other and some to both. It did not help - after a few hours it went mad again and the strace of the clients connected to one server only looked as the example above, just with one connection. Additionally, some clients are written in C++, some in PHP and both are affected so the problem must be in the gearman library or the daemon itself.

My workers are written in Perl, and the jobs are submitted via PHP or Perl. Fwiw, I'm not seeing this. I'm on CentOS 7 though, not Debian.

@zacek
Copy link
Author

zacek commented Apr 2, 2019

I've built the 1.1.18 version and installed it. I'll let you know if it works or not. The problem takes usually a few hours or 1 day to show.

@SpamapS
Copy link
Member

SpamapS commented Apr 2, 2019

Thanks for working through it @zacek . I had something similar a while back and found a path through the state machine that ended in a busy wait loop in some cases. I wonder if your problem is related to #230 .. do you know if any signals are being delivered to gearmand, like maybe systemd sending it a HUP?

@zacek
Copy link
Author

zacek commented Apr 3, 2019

Hi @SpamapS, thanks for your feedback.

I have several types of workers that deliver part of the application functionality. There is always one "manager" process that manages workers of one type - starts the workers, checks if they are still running (using signal 0 sent to worker), restarting missing workers and collecting and sending statistics that workers write to redis. The manager is not connected to gearman. When manager receives the SIGINT, SIGERM or SIGHUP, it sends the signal to all running workers and waits for all of them to finish. All workers do install signal handlers for SIGINT, SIGTERM and SIGHUP and if such signal arrives they finish their job (if any) and then exit.

It is possible, that some manager is restarted and thus signals workers connected to gearman daemon while the other workers are running. I do not send any signals to the gearman daemon. I'm only regularly (in cron) using gearadmin to control the queue sizes. This is not sending any signals, AFAIK.

But the problem you've reported is in the common libgearman code and could affect both daemon and workers. So my answer is yes, IMHO, it could be related.

@zacek
Copy link
Author

zacek commented Apr 3, 2019

The stable version 1.1.18 has the same problem - after approx. 14 hours of normal operation the CPU is high again.

@zacek
Copy link
Author

zacek commented Apr 3, 2019

cpu_usage

@zacek
Copy link
Author

zacek commented Apr 5, 2019

I have to re-verify the 1.1.18 stable build - I've found out that some clients were running on another server and used still the latest version of libgearman. Now I've been running the stable 1.1.18 build for 16 hours and the problem did not appear yet. I'll check it again after the weekend and let you know.

@zacek
Copy link
Author

zacek commented Apr 15, 2019

I've tested the 1.1.18 stable build for the whole week an the problem did not re-appear.

That's fine but it doesn't prove anything. We are running 3 environments - alpha, staging and production. All of them were running the same version of gearman (1.1.18 + 6 commits) but the problem was detected in production only. We were not able to reproduce it anywhere else. After some time the problem appeared on alpha as well. I have full access there and could do some problem analysis. I've also built and tested the latest gearman version (master branch - 1.1.18 + 136 commits) but it did not help - still broken. I must admit - we do not know what the trigger is, staging is still ok (for several months), production not.

We try to deploy the stable version to production with the next release.

@SpamapS
Copy link
Member

SpamapS commented Apr 15, 2019

Thanks for all of the awesome data zacek. This might be a tough one to nail down, and all the data we can get, we'll take.

@esabol
Copy link
Member

esabol commented Apr 15, 2019

Could this issue be related to https://bugs.launchpad.net/bugs/1176324 ?

@SpamapS
Copy link
Member

SpamapS commented Apr 15, 2019

@esabol it's possible, but seems unlikely. That one was centered around an incorrect state machine handling of a race, which was corrected.

@tsmgeek
Copy link

tsmgeek commented Aug 24, 2021

I seem to be getting this problem as well, I have had the same issue for years but have just restarted nodes as needed, server does not seem to need restarting.

For the most part it seems to happen after periods of high persistent load, sometimes it just clears itself up other times I have to restart the offending workers to clear it.

The high CPU is also accompanied with high network load between server/workers, I did a pcap and its just stuck in a very fast loop.

What I found odd though is the worker nodes that it can affect may be completely dormant and just waiting for jobs, by that I mean I can say for certain that there will be no jobs of the task that is assigned to workers on that node, but the high network traffic will still affect it.

image

@gfto
Copy link

gfto commented Nov 11, 2021

See this strace which demonstrates the problem: https://gist.github.com/gfto/474c25c3585fa20f24f0f95b9afa8120

The issue is that the client starts looping (the poll() is being interrupted) if it does not receive the answers to the two packets that it always sends in single response packet:

It goes like this - the client always sends:

GEARMAN_COMMAND_GRAB_JOB - sendto(75, "\0REQ\0\0\0\t\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
GEARMAN_COMMAND_PRE_SLEEP - sendto(75, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12

then it goes to sleep:

poll([{fd=75, events=POLLIN}, {fd=19, events=POLLIN}], 2, 10000) = 1 ([{fd=75, revents=POLLIN}])

once it's woken up by the server it reads the answer but if the answer does not contain two responses it starts looping until it gets two responses in single packet.

This is OK as response from the server:

recvfrom(75, "\0RES\0\0\0\n\0\0\0\0\0RES\0\0\0\n\0\0\0\0", 8192, MSG_NOSIGNAL, NULL, NULL) = 24

This response causes looping in the client (and high cpu usage in the client and in the server):

1636624475.880816 recvfrom(75, "\0RES\0\0\0\n\0\0\0\0", 8192, MSG_NOSIGNAL, NULL, NULL) = 12

See https://gist.github.com/gfto/474c25c3585fa20f24f0f95b9afa8120 it is from a normal worker that just waits for jobs. See how sometimes the cycle takes just a single iteration and sometimes how it loops until it gets the two response packets as single network packet.

@gfto
Copy link

gfto commented Nov 11, 2021

The following fixes the issue for good:

diff --git a/libgearman-server/server.cc b/libgearman-server/server.cc
index cc9be0b..b88be9a 100644
--- a/libgearman-server/server.cc
+++ b/libgearman-server/server.cc
@@ -579,18 +579,6 @@ gearmand_error_t gearman_server_run_command(gearman_server_con_st *server_con,
         /* Remove any timeouts while sleeping */
         gearman_server_con_delete_timeout(server_con);
       }
-      else
-      {
-        /* If there are jobs that could be run, queue a NOOP packet to wake the
-          worker up. This could be the result of a race codition. */
-        ret= gearman_server_io_packet_add(server_con, false,
-                                          GEARMAN_MAGIC_RESPONSE,
-                                          GEARMAN_COMMAND_NOOP, NULL);
-        if (gearmand_failed(ret))
-        {
-          return gearmand_gerror("gearman_server_io_packet_add", ret);
-        }
-      }
     }

     break;

@tsmgeek
Copy link

tsmgeek commented Nov 11, 2021

@gfto why does it only receive one response, does that not also indicate an issue on the server side sending the message, as its TCP we should not be seeing a loss of these messages ide have thought?

@esabol
Copy link
Member

esabol commented Nov 11, 2021

@gfto wrote:

The issue is that the client starts looping ...

You mean the worker, not the client, right? Using client and worker interchangeably was confusing me for a minute there. 😄

The following fixes the issue for good:

It's not clear to me why sending a NOOP to a worker who has just sent a PRE_SLEEP message to the server would cause a problem. Why do you think it's freaking out?

once it's woken up by the server it reads the answer but if the answer does not contain two responses it starts looping until it gets two responses in single packet.

Why does the worker require two responses in a single packet? I'm showing my ignorance here. Is this a requirement of the Gearman/worker protocol? If not, shouldn't the worker be able to handle this? Does this indicate a problem with the worker or rather the library used to implement the worker? Not saying the server (i.e., gearmand) shouldn't be able to handle this more gracefully as well. I'm just wondering if the issue is multi-faceted.

What language is this worker implemented in?

@SpamapS
Copy link
Member

SpamapS commented Nov 12, 2021

I seem to be getting this problem as well, I have had the same issue for years but have just restarted nodes as needed, server does not seem to need restarting.

For the most part it seems to happen after periods of high persistent load, sometimes it just clears itself up other times I have to restart the offending workers to clear it.

@tsmgeek seems to have also found that there's some connection between worker interactions and the server. I wonder if it's as simple as a worker and server sort of getting into a game of "give me a job, I'll sleep till you do" and then the server saying "Nope, no jobs" is misconstrued by the client to mean "I should send another GRAB JOB and the loop just goes on forever.

@gfto seems to have a found that the opposite is also part of it.. the server says "I DO have jobs" and the client expects a JOB_ASSIGN but doesn't get it and sends a GRAB_JOB just as it's about to receive JOB_ASSIGN and now they're out of sync.

These can probably be tested, but it feels like we might have a state machine race. Ugh.

@gfto
Copy link

gfto commented Nov 12, 2021

@gfto why does it only receive one response, does that not also indicate an issue on the server side sending the message, as its TCP we should not be seeing a loss of these messages ide have thought?

The client does receive all the responses but in different network packets. The second one triggers wake-up and then the client asks the server again and this leads to the issue at hand.

I guess the server receives the first packet and responds then receives the second one and responds.

Sometimes the server receives the two packets and processes them responding with a single packet that contains the two responses. At least that is what I'm see while strace-ing the client and the server (unfortunately the code is hard to follow, contains too many abstractions for my liking and it's just plain weird in the places where the bug is).

@gfto
Copy link

gfto commented Nov 12, 2021

These can probably be tested, but it feels like we might have a state machine race. Ugh.

It's exactly that, unfortunately. I've tried to fix the client and the server state machines but the way they're coded makes it pretty hard. I've stared at the code for several days and still can not understand it :/ See for example https://github.com/gearman/gearmand/blob/master/libgearman/worker.cc#L768 (simplified bellow):

  while (1)
    switch (worker->state)
      case GEARMAN_WORKER_STATE_START:
        for (worker->con= (&worker->universal)->con_list; worker->con; worker->con= worker->con->next_connection())
          if (worker->con->socket_descriptor_is_valid() == false)
            for (worker->function= worker->function_list; worker->function; worker->function= worker->function->next)
              case GEARMAN_WORKER_STATE_CONNECT:

or this https://github.com/gearman/gearmand/blob/master/libgearman/worker.cc#L830
I've never seen such "creative" use of switch/case in C/C++. I even re-indented the code but it doesn't help, it's just weird.

The good news is that the patch fixes the biggest issue which just wastes CPU cycles in both server and the client. The bad news is that the fix uncovered other problems related to EAGAIN handling in server io routines. I'm working on fixing that.

I'm lucky because the great QA guys I'm working with are able to reproduce the problems using small VM and multiple stress-ng processes which after running for some time cause gearmand to start spinning more and more (much less with the patch).

@gfto
Copy link

gfto commented Nov 12, 2021

By "client" I mean the worker (the one that calls gearman_worker_work() function). In my setup I have one worker per gearmand client that's why I'm using client interchangeably with worker.

It's not clear to me why sending a NOOP to a worker who has just sent a PRE_SLEEP message would cause a problem. Why do you think it's freaking out?

Because it interrupts the poll() used for sleeping in the client (the PRE_SLEEP phase).

Once the poll() it's interrupted (because there is data in the socket) the client/worker send again "GRAB_JOB/PRE_SLEEP" to the server and goes to sleep. The problem happens when the client receives the answer to only GRAB_JOB and then another answer to PRE_SLEEP which wakes the client again and the loop happens. See the strace.

This is kind of OK:

1636624445.850040 poll([{fd=75, events=POLLIN}, {fd=19, events=POLLIN}], 2, 10000


) = 0 (Timeout)
1636624455.859470 sendto(75, "\0REQ\0\0\0\t\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
1636624455.859560 sendto(75, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
1636624455.859617 poll([{fd=75, events=POLLIN}, {fd=19, events=POLLIN}], 2, 10000) = 1 ([{fd=75, revents=POLLIN}])
1636624455.859665 sendto(75, "\0REQ\0\0\0\t\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
1636624455.859712 recvfrom(75, "\0RES\0\0\0\n\0\0\0\0\0RES\0\0\0\n\0\0\0\0", 8192, MSG_NOSIGNAL, NULL, NULL) = 24
1636624455.859778 sendto(75, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
1636624455.859842 poll([{fd=75, events=POLLIN}, {fd=19, events=POLLIN}], 2, 10000

This is a problem:

) = 0 (Timeout)
1636624505.914500 sendto(75, "\0REQ\0\0\0\t\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
1636624505.914604 sendto(75, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
1636624505.914659 poll([{fd=75, events=POLLIN}, {fd=19, events=POLLIN}], 2, 10000) = 1 ([{fd=75, revents=POLLIN}])
1636624505.914836 sendto(75, "\0REQ\0\0\0\t\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
1636624505.914954 recvfrom(75, "\0RES\0\0\0\n\0\0\0\0", 8192, MSG_NOSIGNAL, NULL, NULL) = 12
1636624505.915062 sendto(75, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
1636624505.915187 poll([{fd=75, events=POLLIN}, {fd=19, events=POLLIN}], 2, 10000) = 1 ([{fd=75, revents=POLLIN}])
1636624505.915245 sendto(75, "\0REQ\0\0\0\t\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
1636624505.915374 recvfrom(75, "\0RES\0\0\0\n\0\0\0\0", 8192, MSG_NOSIGNAL, NULL, NULL) = 12
1636624505.915432 sendto(75, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
1636624505.915549 poll([{fd=75, events=POLLIN}, {fd=19, events=POLLIN}], 2, 10000) = 1 ([{fd=75, revents=POLLIN}])
1636624505.915661 sendto(75, "\0REQ\0\0\0\t\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
1636624505.915722 recvfrom(75, "\0RES\0\0\0\n\0\0\0\0", 8192, MSG_NOSIGNAL, NULL, NULL) = 12
1636624505.915777 sendto(75, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
1636624505.915912 poll([{fd=75, events=POLLIN}, {fd=19, events=POLLIN}], 2, 10000) = 1 ([{fd=75, revents=POLLIN}])
1636624505.916515 sendto(75, "\0REQ\0\0\0\t\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
1636624505.916648 recvfrom(75, "\0RES\0\0\0\n\0\0\0\0\0RES\0\0\0\n\0\0\0\0", 8192, MSG_NOSIGNAL, NULL, NULL) = 24
1636624505.916708 sendto(75, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
1636624505.916760 poll([{fd=75, events=POLLIN}, {fd=19, events=POLLIN}], 2, 10000

After the patch the poll() isn't getting interrupted.

@SpamapS
Copy link
Member

SpamapS commented Nov 12, 2021

@gfto is there any way you can try out https://github.com/SpamapS/rustygear in your QA scenario? It's not 100% functionally equivalent to gearmand.. but if it can be slotted in.. it might tell us something. Or if you can contribute a test case (can just be a documentation file) that shows us how to reproduce, I can try that.

I'm with you that the state machine is really hard to read and debug. I tried to make rustygear easier to read for that reason.

@esabol
Copy link
Member

esabol commented Nov 13, 2021

After the patch the poll() isn't getting interrupted.

OK, but is there any downside to removing that code? If I'm reading the code correctly, the intention here is to inform the worker that it shouldn't sleep because there are jobs waiting. So what happens if the server doesn't do that? The worker goes to sleep, right? And the job sits in the queue potentially longer until a worker (probably a different worker) grabs it. So this might increase job latency in some scenarios, right? I could live with that if it prevents this bug, but it seems like the ideal solution is to fix the race condition.... (Easier said than done.) I 100% agree that the case statements at different levels of indentation are dizzying.

@SpamapS
Copy link
Member

SpamapS commented Nov 13, 2021

I agree, @esabol that it doesn't make sense to leave the worker asleep if there are jobs. The whole point of PRE_SLEEP is basically to let the server know that it needs waking up if there are jobs.

One thing is that this may also be an accident of the protocol. There's not a lot of flow control in this multiplexed protocol, and it's possible that when gearman's protocol was designed, the slowness of systems and networks just didn't allow this scenario to happen very often if ever. I've oft wondered if we should redesign gearman around websockets and this is at least a data point for trying out something like that.

@gfto
Copy link

gfto commented Nov 15, 2021

I'm running gearmand using:

runuid -s gearmand \
  /usr/sbin/gearmand -L 127.0.0.1 -p 4730 --verbose NOTICE --log-file stderr \
	--keepalive --keepalive-idle 120 --keepalive-interval 120 --keepalive-count 3 \
	--queue-type libsqlite3 --libsqlite3-db=/var/lib/gearmand/gearmand.queue.db \
	--store-queue-on-shutdown \
	--libsqlite3-table gearman_queue

To solve the issue reported here I ended up applying the following patch: https://georgi.unixsol.org/programs/gearmand-cpu-spinning.patch

In my testing this reduced the CPU usage after multple OOM in the VM in the workers and gearmand at least 10 times. I know that adding usleep() is a nasty hack but in this case it works.

Here is what the process is consuming during normal operation before any OOM killer action has taken place:

10:05:31 AM       PID    %usr %system  %guest    %CPU   CPU  Command
10:06:01 AM      4259    0.00    0.03    0.00    0.03     6  gearmand
10:06:31 AM      4259    0.03    0.03    0.00    0.07     6  gearmand
10:07:01 AM      4259    0.00    0.07    0.00    0.07     6  gearmand
10:07:31 AM      4259    0.03    0.03    0.00    0.07     6  gearmand
10:08:01 AM      4259    0.03    0.03    0.00    0.07     6  gearmand
10:08:31 AM      4259    0.00    0.03    0.00    0.03     7  gearmand
10:09:01 AM      4259    0.03    0.07    0.00    0.10     7  gearmand
10:09:31 AM      4259    0.03    0.03    0.00    0.07     4  gearmand
10:10:01 AM      4259    0.03    0.07    0.00    0.10     4  gearmand
10:10:31 AM      4259    0.00    0.07    0.00    0.07     1  gearmand
10:11:01 AM      4259    0.03    0.03    0.00    0.07     1  gearmand
10:11:31 AM      4259    0.00    0.07    0.00    0.07     1  gearmand

Here is the consumption of the idle gearmand service, without the applied fixes, after serious OOM killing:

09:53:30 AM       PID    %usr %system  %guest    %CPU   CPU  Command
09:54:00 AM     31379    4.27   20.17    0.00   24.43     7  gearmand
09:54:30 AM     31379    3.27   10.97    0.00   14.23     6  gearmand
09:55:00 AM     31379    5.63   17.67    0.00   23.30     6  gearmand
09:55:30 AM     31379    3.03    9.63    0.00   12.67     7  gearmand
09:56:00 AM     31379    3.23    9.53    0.00   12.77     7  gearmand
09:56:30 AM     31379    1.20    3.63    0.00    4.83     5  gearmand
09:57:00 AM     31379    6.00   18.13    0.00   24.13     5  gearmand
09:57:30 AM     31379    2.67    9.03    0.00   11.70     5  gearmand
09:58:00 AM     31379    4.77   15.90    0.00   20.67     5  gearmand
09:58:30 AM     31379    3.00   10.13    0.00   13.13     5  gearmand
09:59:00 AM     31379    2.63    8.70    0.00   11.33     5  gearmand
09:59:30 AM     31379    3.77   11.67    0.00   15.43     5  gearmand

And, finally, this is the consumption with the applied fix (again after after serious OOM kill action):

10:25:43 AM       PID    %usr %system  %guest    %CPU   CPU  Command
10:26:13 AM      4259    0.40    1.73    0.00    2.13     4  gearmand
10:26:43 AM      4259    0.33    1.03    0.00    1.37     4  gearmand
10:27:13 AM      4259    0.27    0.77    0.00    1.03     4  gearmand
10:27:43 AM      4259    0.27    0.80    0.00    1.07     4  gearmand
10:28:13 AM      4259    0.27    0.83    0.00    1.10     4  gearmand
10:28:43 AM      4259    0.30    0.73    0.00    1.03     5  gearmand
10:29:13 AM      4259    0.27    0.80    0.00    1.07     5  gearmand
10:29:43 AM      4259    0.40    1.60    0.00    2.00     6  gearmand
10:30:13 AM      4259    0.30    1.37    0.00    1.67     6  gearmand
10:30:43 AM      4259    0.27    0.83    0.00    1.10     6  gearmand
10:31:13 AM      4259    0.30    0.73    0.00    1.03     6  gearmand
10:31:43 AM      4259    0.33    0.77    0.00    1.10     6  gearmand

As can be seen, the gearmand service's CPU consumption still gets affected after OOM kills on the node, but the effect seems to be at least 10 times lighter.

@tsmgeek
Copy link

tsmgeek commented Nov 15, 2021

This is the command my server is running with.
gearmand -d --log-file none --syslog --job-retries=0 --round-robin --file-descriptors 4096 --http-port=4731 --keepalive --keepalive-idle=30 --keepalive-interval=10 --keepalive-count=6

@SpamapS
Copy link
Member

SpamapS commented Nov 16, 2021

Neither of you mentioned whether or not these particular functions have new jobs constantly coming in or not. If there aren't, it doesn't make sense that the NOOP is coming out consistently, and it makes me think that there's a bug in the _peek function.

Anyway, absent a _peek bug, I see clients behaving weirdly like this:
client -> GRAB_JOB
server -> NOOP /* From some previous interaction, causes fallthrough here: https://github.com/gearman/gearmand/blob/master/libgearman/worker.cc#L856 /
client -> PRE_SLEEP
poll()
server -> NO_JOB /
responding to GRAB_JOB, this wakes up the poll() the way NOOP is supposed to /
client -> GRAB_JOB
server -> NOOP /
From the PRE_SLEEP before, cycle repeats from here */

...

The code for waiting for a packet after GRAB_JOB is hard to follow, but the only two ways it breaks out into PRE_SLEEP that I can tell is disconnection or receiving a NOOP when we were expecting JOB_ASSIGN or NO_JOB. I wonder if this patch, then would work by simply ignoring the NOOP while in this state and waiting for another packet (which should be a NO_JOB, and thus will force the actual PRE_SLEEP cycle to reset):

index 21019b61..c6573f00 100644
--- a/libgearman/worker.cc
+++ b/libgearman/worker.cc
@@ -856,6 +856,7 @@ gearman_job_st *gearman_worker_grab_job(gearman_worker_st *worker_shell,
                 if (worker->job()->impl()->assigned.command == GEARMAN_COMMAND_NOOP)
                 {
                   gearman_log_debug(worker->universal, "Received NOOP");
+                  break;
                 }
 
                 if (worker->job()->impl()->assigned.command == GEARMAN_COMMAND_JOB_ASSIGN or

I'd love it if somebody with a reproducing environment could try that patch out. I also wonder if we can construct a functional test for this, as it would be good to confirm this is the sequence.

@SpamapS
Copy link
Member

SpamapS commented Nov 16, 2021

On second thought don't try that patch, it makes clients lock up. Damn. I'll keep looking for this answer though.

@SpamapS
Copy link
Member

SpamapS commented Nov 16, 2021

Ok I'll be honest here, debugging this is beyond unfun. If any of you can figure out how to do it without crippling gearmand or the clients (usleeps and removing NOOP-on-job-available are both crippling to workloads IMO) I'll be forever thankful.

However, a few years back I set out to rewrite gearmand in Rust precisely because this weird mixed bag of C and C++ in gearmand is just impossibly complex.

If anyone affected by this is feeling bold, please do see if rustygeard will work for you. I'll be quite happy to help. I created a Matrix chat room for Gearman (I'll be socializing on the mailing lists and website soonish) so come by and say hi and we'll see if we can't get something simpler going for you. https://matrix.to/#/#gearman:spamaps.ems.host

@ant-SR
Copy link

ant-SR commented Mar 28, 2022

Unfortunately i'm finding myself in a similar situation to this...

High load on the instances running the daemon and workers.
Load is reduced when workers are running on another instances but network traffic goes through the roof because of the repeated calls for work. Load goes away when workers are killed off.

Setup:

3x gearmand ( currently 1.1.18 ) sat behind 1 GCP load balancer
Redis persistence setup and configured ( known working, but more on that later )
25x workers ( perl based via libgearman ) - tested with both X connected workers and locally connected ones too

Some workers are numerous ( up to 6 with the same name ) some are singular. All workers are managed by systemd.

Everything is centos7 based and is mainly up to date with packages.

Oddly this only seemed to pick up about 1600 on the 21st - we think we've ruled out any possible changes for this in the application code however.

I'm not sure if this is entirely related but...

Noticing a segfault's occurring roughly about the same time the workers appear to be stuck in the request loop:

[96238.995244] perl[14910]: segfault at 868 ip 00007f307de0a4e0 sp 00007ffc94be42b8 error 4 in libgearman.so.8.0.0[7f307de04000+27000]

When querying the segfault's it's always in the same location ( 0x64e0 ) which resolves to

gearman_aggegator_context
/usr/src/debug/gearmand-1.1.18/libgearman/allocator.cc:52

regarding the redis persistence stuff - all 3 gearmand's have been using either the same remote redis instance, local only and no persistence at some point during the tests - this doesn't seem to have any bearing on this issue ... I was hoping it could be used as a message bus to solve the timeouts we're seeing when submitting the job's when the daemon itself is busy.

Extra threads to help lower the load a little but the issue is still present

Are there any other idea's or things to try here to try and get more about the why it started doing this ?

@esabol
Copy link
Member

esabol commented Mar 29, 2022

Clearly, gearmand doesn't handle the scenario of workers crashing very well.

25x workers ( perl based via libgearman )

You're not using the CPAN module, Gearman::Worker? That's surprising to me. How did you implement a Perl worker using libgearman? I've never heard of anyone doing that.

Are there any other idea's or things to try here to try and get more about the why it started doing this ?

Well, I implemented my workers using Gearman::Worker and Parallel::ForkManager. If you did the same, that would eliminate the segfault since it's pure Perl. Of course, you might just end up with a different error that still kills the worker.

https://metacpan.org/pod/Gearman::Worker
https://metacpan.org/pod/Parallel::ForkManager

@ant-SR
Copy link

ant-SR commented Mar 30, 2022

From what I can see its plumbed in via Gearman::XS and Gearman::mesh which appears to be hooking the libgearman stuff

I'm not a writer of perl stuff so apologies if my answers in that area aren't on target

We've been running further tests today and there's no change in the behaviour when 1.1.19 is used

The only time there's been difference is when we switched up from 2 core AMD Milan to 8 core
the load has gone away and it appears to be working "normally" with minimal complaints

The issue is also present on intel based machines too

@SpamapS
Copy link
Member

SpamapS commented Mar 30, 2022 via email

@esabol
Copy link
Member

esabol commented Apr 26, 2022

@gfto wrote:

To solve the issue reported here I ended up applying the following patch:
https://georgi.unixsol.org/programs/gearmand-cpu-spinning.patch

Should we consider merging this patch? It's not an optimal solution, but, if it prevents the gearmand server from freaking out, maybe it's better than nothing.

@gfto
Copy link

gfto commented Apr 27, 2022

The patch solves the issue, so in absence of a better fix I guess it's better to be merged.

@esabol
Copy link
Member

esabol commented Jul 6, 2022

The patch solves the issue, so in absence of a better fix I guess it's better to be merged.

Could you submit a PR?

@easterism
Copy link

easterism commented Oct 27, 2023

@gfto wrote:

To solve the issue reported here I ended up applying the following patch: https://georgi.unixsol.org/programs/gearmand-cpu-spinning.patch

I did it for 1.1.20 and now works much better. But one subprocess still using 100% without any jobs

image

@easterism
Copy link

Just noticed...
Go back to version 1.1.20 without any patch. Run a clean gearmand -d. Run GearmanManager with 50 workers

  1. Run 500 background jobs. Each is a simple sleep for 5 seconds
  2. Run 200 background jobs. Each is a simple sleep for 5 sec
  3. For the next hour gearman uses about 70% of the CPU (spread across 4 cores) without any jobs. After that everything goes back to normal.

image

P.S. Additionaly you need to self-control is job already in progress, because suddenly you can catch the same job several times

@esabol
Copy link
Member

esabol commented Oct 28, 2023

P.S. Additionaly you need to self-control is job already in progress, because suddenly you can catch the same job several times

Huh? I don't understand what you wrote there, @easterism. Could you clarify what you meant?

Your findings are very interesting. If you repeat the exact same procedure but using the patched version of 1.1.20, what happens?

I'd also be interested in the same test procedure with the patch but leaving libgearman-server/server.cc unchanged. (In other words, keep the code that sends the NOOP.) Would you be willing to try that?

@easterism
Copy link

easterism commented Oct 28, 2023

@esabol If it looks like offtop, I'm sorry... I just trying to find an answer why gearman use so much CPU without any jobs. This thread is only one.
So maybe this behavior is not a bug. Furthermore, if it's normal gearman behavior, run gearman on dedicated server is preferenced way.
I tried to make this test on patched version. Unfortunately, sometimes GearmanClient hanged during this test, and i don't know why, so noone attepts were finished.

P.S. Additionaly you need to self-control is job already in progress, because suddenly you can catch the same job several times

Well... I use gearman v1.1.20, PHP wrapper to libgearman v2.1.0 and GearmanManager

Case 1
Run simple client which call 10 same background jobs for 5 same workers (managed by GearmanManager). Display handlers each of them. You will see something like this

H:example.com:1
H:example.com:2
...
H:example.com:10

At the same time inside worker receive and display job handler GearmanJob::handle. If each job is fast you will see the same result as inside client

H:example.com:1
H:example.com:2
...
H:example.com:10

Case 2

Add 5sec sleep inside job from Case 1 and run it. Now you may see duplicate handler inside worker, like this:

H:example.com:1
H:example.com:2
H:example.com:2
H:example.com:3
H:example.com:4
H:example.com:4
H:example.com:4
...
H:example.com:10
H:example.com:10

I wasted a day to find out why... and i have no clue (

@esabol
Copy link
Member

esabol commented Oct 28, 2023

I wouldn't say it's normal behavior at all. It's not something I see on my gearmand server, but my setup is a bit idiosyncratic.

@easterism
Copy link

@esabol Could you share your gearmand? What version you use? Is it already binary? For what platform?

@esabol
Copy link
Member

esabol commented Oct 29, 2023

@easterism asked:

Could you share your gearmand? What version you use? Is it already binary? For what platform?

gearmand 1.1.19.1 in an Ubuntu 16.04.6 LTS Docker container. I compiled it myself using the Dockerfiles provided in the repository. It's been running for 5 months straight and handles a couple hundred jobs per day.

The thing I've noticed is that everyone (I think) who has reported this problem is using PHP for their workers. I use Perl for our workers. (Our clients are mostly PHP though.) The PHP library is very buggy, especially so for the release version as opposed to the GItHub repository. Multiple people have been begging the maintainer to release a new version for over a year, and there's been no reply. I recommend compiling it from the GitHub repo source and I would advise using the libgearman.so from the version of gearmand you've compiled instead of from whatever package manager you are getting it from.

Oh, I also don't do any background jobs. My jobs are all foreground.

What platform are you on?

@easterism
Copy link

What platform are you on?

Bare Debian 9,10,11. Thank you for advise. For me is no reason to use foreground jobs at all. I use gearman client in web-apps and common case is: get data->process each of data->bring to client and do it fast
I will try to compile from source, and make some tests with different php versions.
Thank you again.

@SpamapS
Copy link
Member

SpamapS commented Oct 29, 2023 via email

@easterism
Copy link

easterism commented Oct 30, 2023

@SpamapS Yes, the job submited by PHP client

How do you get the result back to the client with background jobs?

No need to back result at all. You can choose:

  1. Ask job about their status, when client (php script) call from browser next time. Not realy usefull for web app, but sometimes its possible
  2. Tell worker to store (to MySQL whatever) result of each job by unique job id. So client can check job and get it's result in anytime later

@domel5
Copy link

domel5 commented Nov 30, 2023

Hi,
I have the same problem with REQ & RES death loop . In my opinion we should focus on the following lines in worker.cc:

            *ret_ptr= worker->con->send_packet(worker->grab_job, true);
            if (gearman_failed(*ret_ptr))
            {
              if (*ret_ptr == GEARMAN_IO_WAIT)
              {
                worker->state= GEARMAN_WORKER_STATE_GRAB_JOB_SEND;
              }
              else if (*ret_ptr == GEARMAN_LOST_CONNECTION)
              {
                continue;
              }

              assert(*ret_ptr != GEARMAN_MAX_RETURN);
              return NULL;
            }

            if (worker->job() == NULL)
            {
              assert(job == NULL);
              worker->job(gearman_job_create(worker, job));
              if (worker->job() == NULL)
              {
                *ret_ptr= GEARMAN_MEMORY_ALLOCATION_FAILURE;
                return NULL;
              }
              assert(worker->job()->impl());
            }

worker->con->send_packet sends the GRAB_JOB_ALL request. If the next if's are true the loop is interrupted without taking the reply from the socket (no recvfrom() syscall). Next the worker sends PRE_SLEEP request and is going to sleep on poll(). The NO_JOB reply still waits on the socket, so the poll() takes it and the loop starts again and again.

There is the problem in a nutshell:

17:34:29 sendto(6, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
17:34:29 sendto(6, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
17:34:29 poll([{fd=6, events=POLLIN}, {fd=4, events=POLLIN}], 2, 5000) = 1 ([{fd=6, revents=POLLIN}])
17:34:29 sendto(6, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
17:34:29 recvfrom(6, "\0RES\0\0\0\n\0\0\0\0\0RES\0\0\0\n\0\0\0\0", 8192, MSG_NOSIGNAL, NULL, NULL) = 24
17:34:29 sendto(6, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
17:34:29 poll([{fd=6, events=POLLIN}, {fd=4, events=POLLIN}], 2, 5000) = 0 (Timeout)
17:34:29.015913 IP 127.0.0.1.20086 > 127.0.0.1.4730: Flags [P.], seq 240:252, ack 121, win 32264, length 12
E..4z.@.@..     ........Nv.z*Y&.....P.~..(...REQ...'....
17:34:29.015977 IP 127.0.0.1.4730 > 127.0.0.1.20086: Flags [.], ack 252, win 22, length 0
E..(."@.@.#..........zNv....*Y&.P.......
17:34:29.016124 IP 127.0.0.1.4730 > 127.0.0.1.20086: Flags [P.], seq 121:133, ack 252, win 22, length 12
E..4.#@.@.#..........zNv....*Y&.P....(...RES...
....
17:34:29.016176 IP 127.0.0.1.20086 > 127.0.0.1.4730: Flags [P.], seq 252:264, ack 133, win 32252, length 12
E..4z.@[email protected]*Y&.....P.}..(...REQ........
17:34:29.016601 IP 127.0.0.1.20086 > 127.0.0.1.4730: Flags [P.], seq 264:276, ack 133, win 32252, length 12
E..4z.@[email protected]*Y&.....P.}..(...REQ...'....
17:34:29.016688 IP 127.0.0.1.4730 > 127.0.0.1.20086: Flags [.], ack 276, win 22, length 0
E..(.$@.@.#..........zNv....*Y&(P.......
17:34:29.016781 IP 127.0.0.1.4730 > 127.0.0.1.20086: Flags [P.], seq 133:145, ack 276, win 22, length 12
E..4.%@.@.#..........zNv....*Y&(P....(...RES...
....
17:34:29.017094 IP 127.0.0.1.20086 > 127.0.0.1.4730: Flags [P.], seq 276:288, ack 145, win 32252, length 12
E..4z.@[email protected]*Y&(....P.}..(...REQ........
17:34:29.057982 IP 127.0.0.1.4730 > 127.0.0.1.20086: Flags [.], ack 288, win 22, length 0
E..(.&@.@.#..........zNv....*Y&4P.......

Between 1st and 2nd send() should always be a recvfrom() / poll() as it is when the code runs correctly:

gettimeofday({tv_sec=1701365334, tv_usec=434255}, NULL) = 0
sendto(6, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
recvfrom(6, 0x16ceafeb51a9, 8192, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=6, events=POLLIN}, {fd=4, events=POLLIN}], 2, 5000) = 1 ([{fd=6, revents=POLLIN}])
recvfrom(6, "\0RES\0\0\0\n\0\0\0\0", 8192, MSG_NOSIGNAL, NULL, NULL) = 12
sendto(6, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
poll([{fd=6, events=POLLIN}, {fd=4, events=POLLIN}], 2, 5000) = 0 (Timeout)

Every death loop I've seen started with two following GRAB_JOB_ALL and PRE_SLEEP reqs.

@esabol
Copy link
Member

esabol commented Nov 30, 2023

@domel5 wrote:

worker->con->send_packet sends the GRAB_JOB_ALL request. If the next if's are true the loop is interrupted without taking the reply from the socket (no recvfrom() syscall). Next the worker sends PRE_SLEEP request and is going to sleep on poll(). The NO_JOB reply still waits on the socket, so the poll() takes it and the loop starts again and again.

That's some insightful analysis, @domel5 ! It makes sense. Do you have a suggestion for how to fix it? A PR would be most welcome!

@domel5
Copy link

domel5 commented Dec 1, 2023

I have no clue at the moment, I'm not familiar with C++. It requires to debug what is going on after worker->con->send_packet when it starts looping.

@SpamapS
Copy link
Member

SpamapS commented Dec 10, 2023

Hey, @domel5 thanks for bringing fresh eyes to this very old, very annoying issue.

I think what you say is plausible. It's important to note that this code is in libgearman, and as such, that would suggest that only clients who use libgearman would cause this death spiral. I don't know if we have established that, but it could also be that all clients implement similar logic and thus it wouldn't be limited to libgearman.

For the libgearman case though, it's also important to note that it's flagging GEARMAN_IO_WAIT as a/the culprit, and that narrows it a bit further to non-blocking usage, though IIRC that is the default.

So, if you're following this issue and you're not using libgearman in non-blocking mode, please speak up! But for those who are...

I think that the assumption that the code is making is that any GEARMAN_IO_WAIT means nothing was sent, and thus, we should repeat the send. So I've gone looking for all the ways this call site could result in a GEARMAN_IO_WAIT to see if there's any time that's not true or maybe doesn't work as intended. I didn't find any.

send_packet calls into the connection flush method, which is essentially a big state machine, and as such, there are many opportunities to return GEARMAN_IO_WAIT. This state seems to come only on GRAB_JOB though, which comes after successful CAN_DO generally, and as such, would eliminate any connecting states, so we can jump to GEARMAN_CON_UNIVERSAL_CONNECTED. The SSL code doesn't seem to work in no-blocking mode, so I'm jumping to the cleartext sending:

write_size= ::send(fd, send_buffer_ptr, send_buffer_size, MSG_NOSIGNAL);
}
if (write_size == 0) // Zero value on send()
{ }
else if (write_size == SOCKET_ERROR)
{
switch (errno)
{
case ENOTCONN:
#if defined(EWOULDBLOCK) && EWOULDBLOCK != EAGAIN
case EWOULDBLOCK:
#endif
case EAGAIN:
{
set_events(POLLOUT);
if (gearman_universal_is_non_blocking(universal))
{
return gearman_gerror(universal, GEARMAN_IO_WAIT);
}
gearman_return_t gret= gearman_wait(universal);
if (gearman_failed(gret))
{
return gret;
}
continue;

This gets into send.c, which, frankly, is really complicated. I do recall @BrianAker telling me about how the hostile random socket protection works a long, long time ago, but, it has very few code comments. So, let's call that "a little Sus". If I'm reading libhostile correctly, it would print some things in stderr to suggest it had "hostilized" the server (I guess it has something to do with assuming the server is doing something evil). This would actually set errno to ECONNRESET, which calls gearman_perror, which doesn't seem to have any path to GEARMAN_IO_WAIT, so I don't think this is what's happening.

But, just in case, affected users following along, have you seen this type of error in your stderr?:

https://github.com/gearman/gearmand/blob/master/libhostile/function.c#L102-L105

So, assuming libhostile wasn't the culprit, the POSIX sendto() call is called with MSG_NOSIGNAL, and if we're in non-blocking mode, we can get an EAGAIN here, but the packet will not be sent if that is the case. The way send works, either you can send the whole message, or not. So yes, we can get a GEARMAN_IO_WAIT here, but it should not elicit a NO_JOB ever. It's entirely likely that a busy worker would get into this state, sending a large WORK_COMPLETE and then immediately asking for more work, the send buffer may still be flushing. When this happens, we return quickly, and the main entry point for workers, gearman_worker_work, is pretty quick to return as well.

At that point the docs suggest one should call gearman_continue() and if it returns true, call gearman_worker_work() again, which will poll the socket. There's no reason, from the client side, to expect a NO_JOB, but nonetheless, the client gets a NO_JOB. But, because it is not in a state expecting that, it sends GRAB_JOB, eliciting a second NO_JOB, then process the first NO_JOB as usual, by sending a PRE_SLEEP and goes in to sleep state. But this immediately wakes up because of the second NO_JOB, and thus being in SLEEP state, the worker asks for more work... boom.

There's some kind of state bug in all of this. Why are we getting NO_JOB when we're in GRAB_JOB state? It's possible we're double-sending GRAB_JOB's due to some state-machine bug. But it could also be that the server is confused or delayed in some way.

So, perhaps what might fix this is with a defensive measure. Perhaps workers should ignore NO_JOB if they are in a SLEEP or GRAB_JOB_SEND state.

I'll ponder on this and look at making a patch in the next few days.

@esabol
Copy link
Member

esabol commented Dec 10, 2023

@SpamapS : All of my workers are implemented in Perl using p-alik's Perl library, and I've never encountered this (or any) problem. That's why it made sense to me that the problem would be in libgearman's worker code.

@SpamapS
Copy link
Member

SpamapS commented Dec 10, 2023

Thanks Ed, Reading that code, it doesn't really implement a state machine, it reads when it expects a response, and errors if it gets something unexpected. So for instance, if it got a NO_JOB while in PRE_SLEEP, it would just fall through to handling it normally, which is to just send PRE_SLEEP again and mark the connection as sleeping.

https://github.com/p-alik/perl-Gearman/blob/master/lib/Gearman/Worker.pm#L299-L317

That is what we want libgearman to do, so I'm more convinced that we should make libgearman resilient to this unexpected sequence(even though I'm still not sure why it's happening).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants