Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draining of concourse worker taking very long #44

Open
Marc-Assmann opened this issue Jul 5, 2019 · 7 comments
Open

Draining of concourse worker taking very long #44

Marc-Assmann opened this issue Jul 5, 2019 · 7 comments

Comments

@Marc-Assmann
Copy link

Hi there!

Bug Report

When deploying an update of concourse (using concourse/bosh) the update/recreation of the first worker was not finished for more than 4 hours.
Bosh shows the respective worker as failing (using bosh instances).
Logging in to the worker via bosh ssh, we observed:

So for some reason, the drain behaviour described here https://concourse-ci.org/concourse-worker.html#gracefully-removing-a-worker seems not to work.

We then issued another USR2 signal to the worker process manually using kill -USR2 <worker pid>.
This made the worker finish running jobs and it shut down.
You can see the log down here, the worker recreation update took 4hrs29min:

12:53:26 Task 1708127 | 10:53:20 | Preparing deployment: Preparing deployment (00:00:03)
12:53:26 Task 1708127 | 10:53:23 | Preparing deployment: Rendering templates (00:00:02)
12:53:26 Task 1708127 | 10:53:25 | Preparing package compilation: Finding packages to compile (00:00:00)
12:53:26 Task 1708127 | 10:53:25 | Compiling packages: btrfs_tools/797f8df53d2881f034366408b3f043a57f8f4c51
12:53:26 Task 1708127 | 10:53:25 | Compiling packages: postgres-9.6.10/04ecac16e7e53e17d1a1799c0fe874f262f1960ba37514da1b3a30d1c58c13c0
12:53:26 Task 1708127 | 10:53:25 | Compiling packages: postgres-common/9e812f515167406f22e2f983a6c325b0a54e1bd6128aa44e1b8f8bc44034d01f
12:54:22 Task 1708127 | 10:53:25 | Compiling packages: postgres-11.3/c0604a42bdaa3ce61d1b13f7b1017005794c18bb1307cabb30cacb49f30b36ac
12:54:24 Task 1708127 | 10:53:25 | Compiling packages: concourse/faaac11289457bdd4fb8d177051a7d8f03d9ff63
12:56:31 Task 1708127 | 10:54:22 | Compiling packages: postgres-common/9e812f515167406f22e2f983a6c325b0a54e1bd6128aa44e1b8f8bc44034d01f (00:00:57)
12:57:45 Task 1708127 | 10:54:24 | Compiling packages: btrfs_tools/797f8df53d2881f034366408b3f043a57f8f4c51 (00:00:59)
12:57:56 Task 1708127 | 10:56:31 | Compiling packages: concourse/faaac11289457bdd4fb8d177051a7d8f03d9ff63 (00:03:06)
12:58:31 Task 1708127 | 10:57:45 | Compiling packages: postgres-9.6.10/04ecac16e7e53e17d1a1799c0fe874f262f1960ba37514da1b3a30d1c58c13c0 (00:04:20)
12:58:31 Task 1708127 | 10:57:55 | Compiling packages: postgres-11.3/c0604a42bdaa3ce61d1b13f7b1017005794c18bb1307cabb30cacb49f30b36ac (00:04:30)
12:58:31 Task 1708127 | 10:58:31 | Updating instance worker-maintenance: worker-maintenance/d8c3c95a-0353-45d8-85ad-43ba00809758 (0) (canary)
12:58:32 Task 1708127 | 10:58:31 | Updating instance db: db/e768317d-f8f0-4462-a1c3-c9c51c76385e (0) (canary)
13:00:55 Task 1708127 | 10:58:31 | Updating instance web: web/f9fe22ad-e517-4816-8495-41dd69a92e4e (0) (canary)
13:01:18 Task 1708127 | 10:58:31 | Updating instance worker: worker/e1e957d6-355e-43a6-8cde-3610b98fb1dd (0) (canary)
13:01:18 Task 1708127 | 11:00:54 | Updating instance db: db/e768317d-f8f0-4462-a1c3-c9c51c76385e (0) (canary) (00:02:23)
13:01:19 Task 1708127 | 11:01:18 | Updating instance web: web/f9fe22ad-e517-4816-8495-41dd69a92e4e (0) (canary) (00:02:47)
13:01:19 Task 1708127 | 11:01:18 | Updating instance web: web/f9fe22ad-e517-4816-8495-41dd69a92e4e (0) (canary) (00:02:47)
16:01:38 Task 1708127 | 11:01:18 | Updating instance web: web/c10c5e72-0d53-474a-9a75-be897df157df (1)
17:28:03 Task 1708127 | 11:01:18 | Updating instance web: web/c10c5e72-0d53-474a-9a75-be897df157df (1) (00:03:29)
17:28:03 Task 1708127 | 14:01:37 | Updating instance worker-maintenance: worker-maintenance/d8c3c95a-0353-45d8-85ad-43ba00809758 (0) (canary) (03:03:06)
17:31:32 Task 1708127 | 15:28:03 | Updating instance worker: worker/e1e957d6-355e-43a6-8cde-3610b98fb1dd (0) (canary) (04:29:32)
  • Concourse version: v5.2.0
  • Deployment type (BOSH/Docker/binary): BOSH
  • Infrastructure/IaaS: AWS/GCP/Openstack
@deniseyu
Copy link
Contributor

hi @Marc-Assmann , sorry it's taken a while for someone to get back to you.

Without being able to poke around at this environment now it's difficult to say what exactly went wrong. there are questions that I would about the deployment but I can appreciate that since this was 2 months ago, that data might not be around anymore! have you run into this problem since/is this an issue your team is still facing?

@alexbakar
Copy link

alexbakar commented Sep 30, 2019

I was able to reproduce this issue several times on 5.2.0 and 5.5.0. Our setup contains two web nodes and six workers. I have analyzed the issue and this is my theory for what happens:

  • a redeployment of Concourse is triggered (on Bosh level)
  • when the draining starts the signal USR2 triggers the retirement of the worker. A corresponding retire command is correctly sent to the TSA and then it can be forwarded it to the ATC (both running on a web node);
  • as the whole Concourse deployment is being redeployed it might happen that the web and worker nodes are being redeployed at the same time;
  • in some cases the web node VM (to which the retire command has been sent) might be recreated; Note that the ATC on this old web node has not marked the worker as "retiring" in the DB;
  • during the recreate of the web node there seems to be no meaningful response to the calling worker node; thus the worker's drain logic doesn't react;
  • the drain logic on the worker node just waits until the timeout passes (in our case it is 12 hours). In that period new jobs are being scheduled on this worker (and executed). After the timeout the drain script signals the worker process with TERM, QUIT, and finally KILL signal and after that the VM can be recreated.

I think concourse/concourse#2523 might be the same.

@alexbakar
Copy link

The same issue also happens if the draining performs "landing", not "retiring". I am wondering how to get more details from the web node, for example the logs from the "old" web node (as the old VM is being destroyed the logs are gone). Any ideas?

@radoslav-tomov
Copy link

Hi,

I've debugged this and managed to reproduce it locally via docker-compose. It happens when

  • SIGUSR1/SIGUSR2 signal has been sent to the worker process
  • the worker receives the signal, but the call from the TSA client fails

Then the workerBeacon fails with some error and it is restarted.

Bottom line is, that in the case of a Bosh deployment the drain script continues to run till it reaches the drain timeout, then followed by TERM/15/QUIT/2/KILL.

@radoslav-tomov
Copy link

Seems that the behavior can be explained by the web and worker instance groups being updated in parallel. In some cases, the timing is enough to simultaneously update a web node and worker node, which might cause the issue if the worker node is connected to the specific web node being updated. In our case having just 2 web nodes makes the probability of hitting this exact race condition quite high.

It can be avoided, by not running updates of both instance groups in parallel.

@deniseyu
Copy link
Contributor

thanks @radoslav-tomov and @alexbakar for spending time investigating this!

I am wondering how to get more details from the web node, for example the logs from the "old" web node (as the old VM is being destroyed the logs are gone).

I think the only way to get logs from a destroyed VM is to enable log forwarding at deploy time and set them up to export them to somewhere like Papertrail 😕

Next part is speculative, not necessarily a request for contributors to implement:

I wonder, since BOSH introduced a new lifecycle hook for pre_stop, if there's another way to address the timing issue without turning off parallel deployment. Maybe web nodes can wait (up to a configurable threshold) until there are no more registered workers before restarting for example. That should enable some time overlap for the slowest BOSH operations and not prolong the overall deployment time by too much I think..

cc @cirocosta and @pivotal-jamie-klassen for additional input!

@alexbakar
Copy link

Thanks @deniseyu for the help.

For the analysis of the issue I've forwarded the logs and later checked them using Kibana.
I've checked the pre_stop stage description. I wonder do we always know the timing scheme of the redeployment (i.e. when the recreation of the group of web instances is started and when the group of workers). Can we miss some of the workers when the pre_stop of the web is called? I mean what is the chance for a worker recreation (and draining respectfully) to be triggered right after the pre_stop of the web node has finished? If there is no such chance, then the approach with using pre_stop is fine. Otherwise the race condition might still happen in some cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants