You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Bosh-agent itself is already running with higher priority than BOSH/monit jobs to mitigate CPU-intensive workloads blocking the agent <-> director communication, see cloudfoundry/bosh-linux-stemcell-builder@00054bd .
However, as it seems lifecycle hooks like pre-start scripts can as well have the same negative effect on the communication with the director because they are started by the bosh-agent itself and hence run with the same priority. At least this is my assumption because I wasn't able to find a line of code that lowers that priority and looking at a VM while it is running a pre-start reveals that the pre-start script with all sub-processes runs with the same priority as the agent.
In our case cloning a lot of data from the remaining part of a BOSH-managed PostgreSQL cluster can trigger this issue inconsistently, which in extreme situations extends downtime unnecessarily because the bosh task itself errors with an agent timeout and the pre-start has to run from scratch again.
Of course as a quick mitigation we could for example renice the priority in our pre-start script. Yet I would see benefit as well as consistency and hence predictability if bosh agent starts external scripts/binaries with lower priority than itself.
The text was updated successfully, but these errors were encountered:
Bosh-agent itself is already running with higher priority than BOSH/monit jobs to mitigate CPU-intensive workloads blocking the agent <-> director communication, see cloudfoundry/bosh-linux-stemcell-builder@00054bd .
However, as it seems lifecycle hooks like pre-start scripts can as well have the same negative effect on the communication with the director because they are started by the bosh-agent itself and hence run with the same priority. At least this is my assumption because I wasn't able to find a line of code that lowers that priority and looking at a VM while it is running a pre-start reveals that the pre-start script with all sub-processes runs with the same priority as the agent.
In our case cloning a lot of data from the remaining part of a BOSH-managed PostgreSQL cluster can trigger this issue inconsistently, which in extreme situations extends downtime unnecessarily because the bosh task itself errors with an agent timeout and the pre-start has to run from scratch again.
Of course as a quick mitigation we could for example renice the priority in our pre-start script. Yet I would see benefit as well as consistency and hence predictability if bosh agent starts external scripts/binaries with lower priority than itself.
The text was updated successfully, but these errors were encountered: