Draining of concourse worker taking very long #44

Marc-Assmann · 2019-07-05T14:11:34Z

Hi there!

Bug Report

When deploying an update of concourse (using concourse/bosh) the update/recreation of the first worker was not finished for more than 4 hours.
Bosh shows the respective worker as failing (using bosh instances).
Logging in to the worker via bosh ssh, we observed:

there is a process running the drain script
under the worker parent process, there were still new jobs / child processes started. There were no old jobs running 4hrs.
in the log of the worker, there is a log message containing "retiring-worker", indicating the USR2 signal was received by the worker here: https://github.com/concourse/concourse/blob/1b1b9ef171f7bc3fbb81e27fa7afa256a3c707b0/worker/beacon.go#L134

So for some reason, the drain behaviour described here https://concourse-ci.org/concourse-worker.html#gracefully-removing-a-worker seems not to work.

We then issued another USR2 signal to the worker process manually using kill -USR2 <worker pid>.
This made the worker finish running jobs and it shut down.
You can see the log down here, the worker recreation update took 4hrs29min:

12:53:26 Task 1708127 | 10:53:20 | Preparing deployment: Preparing deployment (00:00:03)
12:53:26 Task 1708127 | 10:53:23 | Preparing deployment: Rendering templates (00:00:02)
12:53:26 Task 1708127 | 10:53:25 | Preparing package compilation: Finding packages to compile (00:00:00)
12:53:26 Task 1708127 | 10:53:25 | Compiling packages: btrfs_tools/797f8df53d2881f034366408b3f043a57f8f4c51
12:53:26 Task 1708127 | 10:53:25 | Compiling packages: postgres-9.6.10/04ecac16e7e53e17d1a1799c0fe874f262f1960ba37514da1b3a30d1c58c13c0
12:53:26 Task 1708127 | 10:53:25 | Compiling packages: postgres-common/9e812f515167406f22e2f983a6c325b0a54e1bd6128aa44e1b8f8bc44034d01f
12:54:22 Task 1708127 | 10:53:25 | Compiling packages: postgres-11.3/c0604a42bdaa3ce61d1b13f7b1017005794c18bb1307cabb30cacb49f30b36ac
12:54:24 Task 1708127 | 10:53:25 | Compiling packages: concourse/faaac11289457bdd4fb8d177051a7d8f03d9ff63
12:56:31 Task 1708127 | 10:54:22 | Compiling packages: postgres-common/9e812f515167406f22e2f983a6c325b0a54e1bd6128aa44e1b8f8bc44034d01f (00:00:57)
12:57:45 Task 1708127 | 10:54:24 | Compiling packages: btrfs_tools/797f8df53d2881f034366408b3f043a57f8f4c51 (00:00:59)
12:57:56 Task 1708127 | 10:56:31 | Compiling packages: concourse/faaac11289457bdd4fb8d177051a7d8f03d9ff63 (00:03:06)
12:58:31 Task 1708127 | 10:57:45 | Compiling packages: postgres-9.6.10/04ecac16e7e53e17d1a1799c0fe874f262f1960ba37514da1b3a30d1c58c13c0 (00:04:20)
12:58:31 Task 1708127 | 10:57:55 | Compiling packages: postgres-11.3/c0604a42bdaa3ce61d1b13f7b1017005794c18bb1307cabb30cacb49f30b36ac (00:04:30)
12:58:31 Task 1708127 | 10:58:31 | Updating instance worker-maintenance: worker-maintenance/d8c3c95a-0353-45d8-85ad-43ba00809758 (0) (canary)
12:58:32 Task 1708127 | 10:58:31 | Updating instance db: db/e768317d-f8f0-4462-a1c3-c9c51c76385e (0) (canary)
13:00:55 Task 1708127 | 10:58:31 | Updating instance web: web/f9fe22ad-e517-4816-8495-41dd69a92e4e (0) (canary)
13:01:18 Task 1708127 | 10:58:31 | Updating instance worker: worker/e1e957d6-355e-43a6-8cde-3610b98fb1dd (0) (canary)
13:01:18 Task 1708127 | 11:00:54 | Updating instance db: db/e768317d-f8f0-4462-a1c3-c9c51c76385e (0) (canary) (00:02:23)
13:01:19 Task 1708127 | 11:01:18 | Updating instance web: web/f9fe22ad-e517-4816-8495-41dd69a92e4e (0) (canary) (00:02:47)
13:01:19 Task 1708127 | 11:01:18 | Updating instance web: web/f9fe22ad-e517-4816-8495-41dd69a92e4e (0) (canary) (00:02:47)
16:01:38 Task 1708127 | 11:01:18 | Updating instance web: web/c10c5e72-0d53-474a-9a75-be897df157df (1)
17:28:03 Task 1708127 | 11:01:18 | Updating instance web: web/c10c5e72-0d53-474a-9a75-be897df157df (1) (00:03:29)
17:28:03 Task 1708127 | 14:01:37 | Updating instance worker-maintenance: worker-maintenance/d8c3c95a-0353-45d8-85ad-43ba00809758 (0) (canary) (03:03:06)
17:31:32 Task 1708127 | 15:28:03 | Updating instance worker: worker/e1e957d6-355e-43a6-8cde-3610b98fb1dd (0) (canary) (04:29:32)

Concourse version: v5.2.0
Deployment type (BOSH/Docker/binary): BOSH
Infrastructure/IaaS: AWS/GCP/Openstack

The text was updated successfully, but these errors were encountered:

deniseyu · 2019-09-10T20:05:35Z

hi @Marc-Assmann , sorry it's taken a while for someone to get back to you.

Without being able to poke around at this environment now it's difficult to say what exactly went wrong. there are questions that I would about the deployment but I can appreciate that since this was 2 months ago, that data might not be around anymore! have you run into this problem since/is this an issue your team is still facing?

alexbakar · 2019-09-30T08:10:54Z

I was able to reproduce this issue several times on 5.2.0 and 5.5.0. Our setup contains two web nodes and six workers. I have analyzed the issue and this is my theory for what happens:

a redeployment of Concourse is triggered (on Bosh level)
when the draining starts the signal USR2 triggers the retirement of the worker. A corresponding retire command is correctly sent to the TSA and then it can be forwarded it to the ATC (both running on a web node);
as the whole Concourse deployment is being redeployed it might happen that the web and worker nodes are being redeployed at the same time;
in some cases the web node VM (to which the retire command has been sent) might be recreated; Note that the ATC on this old web node has not marked the worker as "retiring" in the DB;
during the recreate of the web node there seems to be no meaningful response to the calling worker node; thus the worker's drain logic doesn't react;
the drain logic on the worker node just waits until the timeout passes (in our case it is 12 hours). In that period new jobs are being scheduled on this worker (and executed). After the timeout the drain script signals the worker process with TERM, QUIT, and finally KILL signal and after that the VM can be recreated.

I think concourse/concourse#2523 might be the same.

alexbakar · 2019-10-01T06:58:25Z

The same issue also happens if the draining performs "landing", not "retiring". I am wondering how to get more details from the web node, for example the logs from the "old" web node (as the old VM is being destroyed the logs are gone). Any ideas?

radoslav-tomov · 2019-10-14T13:02:21Z

Hi,

I've debugged this and managed to reproduce it locally via docker-compose. It happens when

SIGUSR1/SIGUSR2 signal has been sent to the worker process
the worker receives the signal, but the call from the TSA client fails

Then the workerBeacon fails with some error and it is restarted.

Bottom line is, that in the case of a Bosh deployment the drain script continues to run till it reaches the drain timeout, then followed by TERM/15/QUIT/2/KILL.

radoslav-tomov · 2019-10-15T08:11:28Z

Seems that the behavior can be explained by the web and worker instance groups being updated in parallel. In some cases, the timing is enough to simultaneously update a web node and worker node, which might cause the issue if the worker node is connected to the specific web node being updated. In our case having just 2 web nodes makes the probability of hitting this exact race condition quite high.

It can be avoided, by not running updates of both instance groups in parallel.

deniseyu · 2019-10-17T20:51:32Z

thanks @radoslav-tomov and @alexbakar for spending time investigating this!

I am wondering how to get more details from the web node, for example the logs from the "old" web node (as the old VM is being destroyed the logs are gone).

I think the only way to get logs from a destroyed VM is to enable log forwarding at deploy time and set them up to export them to somewhere like Papertrail 😕

Next part is speculative, not necessarily a request for contributors to implement:

I wonder, since BOSH introduced a new lifecycle hook for pre_stop, if there's another way to address the timing issue without turning off parallel deployment. Maybe web nodes can wait (up to a configurable threshold) until there are no more registered workers before restarting for example. That should enable some time overlap for the slowest BOSH operations and not prolong the overall deployment time by too much I think..

cc @cirocosta and @pivotal-jamie-klassen for additional input!

alexbakar · 2019-11-11T07:40:02Z

Thanks @deniseyu for the help.

For the analysis of the issue I've forwarded the logs and later checked them using Kibana.
I've checked the pre_stop stage description. I wonder do we always know the timing scheme of the redeployment (i.e. when the recreation of the group of web instances is started and when the group of workers). Can we miss some of the workers when the pre_stop of the web is called? I mean what is the chance for a worker recreation (and draining respectfully) to be triggered right after the pre_stop of the web node has finished? If there is no such chance, then the approach with using pre_stop is fine. Otherwise the race condition might still happen in some cases.

alexbakar mentioned this issue Oct 15, 2019

Changed instance groups update mechanism to serial concourse/concourse-bosh-deployment#187

Closed

xtreme-sameer-vohra mentioned this issue Feb 26, 2020

Worker ConnectionDrainTimeout isn't guaranteed to drain within the allotted time concourse/concourse#5251

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draining of concourse worker taking very long #44

Draining of concourse worker taking very long #44

Marc-Assmann commented Jul 5, 2019

deniseyu commented Sep 10, 2019

alexbakar commented Sep 30, 2019 •

edited

Loading

alexbakar commented Oct 1, 2019

radoslav-tomov commented Oct 14, 2019

radoslav-tomov commented Oct 15, 2019

deniseyu commented Oct 17, 2019

alexbakar commented Nov 11, 2019

Draining of concourse worker taking very long #44

Draining of concourse worker taking very long #44

Comments

Marc-Assmann commented Jul 5, 2019

Bug Report

deniseyu commented Sep 10, 2019

alexbakar commented Sep 30, 2019 • edited Loading

alexbakar commented Oct 1, 2019

radoslav-tomov commented Oct 14, 2019

radoslav-tomov commented Oct 15, 2019

deniseyu commented Oct 17, 2019

alexbakar commented Nov 11, 2019

alexbakar commented Sep 30, 2019 •

edited

Loading