Lifecycle hooks can make the agent unresponsive #337

ionphractal · 2024-11-04T16:33:53Z

Bosh-agent itself is already running with higher priority than BOSH/monit jobs to mitigate CPU-intensive workloads blocking the agent <-> director communication, see cloudfoundry/bosh-linux-stemcell-builder@00054bd .

However, as it seems lifecycle hooks like pre-start scripts can as well have the same negative effect on the communication with the director because they are started by the bosh-agent itself and hence run with the same priority. At least this is my assumption because I wasn't able to find a line of code that lowers that priority and looking at a VM while it is running a pre-start reveals that the pre-start script with all sub-processes runs with the same priority as the agent.

In our case cloning a lot of data from the remaining part of a BOSH-managed PostgreSQL cluster can trigger this issue inconsistently, which in extreme situations extends downtime unnecessarily because the bosh task itself errors with an agent timeout and the pre-start has to run from scratch again.

Of course as a quick mitigation we could for example renice the priority in our pre-start script. Yet I would see benefit as well as consistency and hence predictability if bosh agent starts external scripts/binaries with lower priority than itself.

The text was updated successfully, but these errors were encountered:

rkoster · 2024-11-07T16:21:01Z

@ionphractal this seems like a good idea! Happy to review a PR.

cf-foundation-community-automation bot moved this to Inbox in Foundational Infrastructure Working Group Nov 4, 2024

cf-foundation-community-automation bot added this to Foundational Infrastructure Working Group Nov 4, 2024

rkoster moved this from Inbox to Waiting for Changes | Open for Contribution in Foundational Infrastructure Working Group Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lifecycle hooks can make the agent unresponsive #337

Lifecycle hooks can make the agent unresponsive #337

ionphractal commented Nov 4, 2024

rkoster commented Nov 7, 2024

Lifecycle hooks can make the agent unresponsive #337

Lifecycle hooks can make the agent unresponsive #337

Comments

ionphractal commented Nov 4, 2024

rkoster commented Nov 7, 2024