-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Competing HA takeover results in both instances becoming active #787
Comments
A strange HA behavior was reported in [icingadb-787], resulting in both instances being active. The logs contained an entry of the previous active instance exiting the HA.realize() method successfully after 1m9s. This, however, should not be possible as the method's context is deadlined to a minute after the heartbeat was received. With introducing Settings.QuickContextExit for retry.WithBackoff in [icinga-go-library-69] and using it here, the function directly returns a context.DeadlineExceeded error the moment the context has expired. Doing so allows directly handing over, while the other instance can now take over due to the expired heartbeat in the database. As an related change, the HA.insertEnvironment() method was inlined into the retryable function to use the deadlined context. Otherwise, this might block afterwards, as it was used within HA.realize(), but without the passed context. In addition, the main loop select cases for hactx.Done() and ctx.Done() were unified, as hactx is a derived ctx. A closed ctx case may be lost as the hactx case could have been chosen. [icinga-go-library-69]: Icinga/icinga-go-library#69 [icingadb-787]: #787
A strange HA behavior was reported in [icingadb-787], resulting in both instances being active. The logs contained an entry of the previous active instance exiting the HA.realize() method successfully after 1m9s. This, however, should not be possible as the method's context is deadlined to a minute after the heartbeat was received. With introducing Settings.QuickContextExit for retry.WithBackoff in [icinga-go-library-69] and using it here, the function directly returns a context.DeadlineExceeded error the moment the context has expired. Doing so allows directly handing over, while the other instance can now take over due to the expired heartbeat in the database. As a related change, the HA.insertEnvironment() method was inlined into the retryable function to use the deadlined context. Otherwise, this might block afterwards, as it was used within HA.realize(), but without the passed context. In addition, the main loop select cases for hactx.Done() and ctx.Done() were unified, as hactx is a derived ctx. A closed ctx case may be lost as the hactx case could have been chosen. [icinga-go-library-69]: Icinga/icinga-go-library#69 [icingadb-787]: #787
In the last days, I was digging both into the HA realization logic and the context lifecycle. Regarding In the context domain, there is unfortunately no direct relationship between the HA context and the used context for database interactions. Due to the architecture, it is not so easy to let all database queries honor a derived HA context, as I initially hoped when building Icinga/icinga-go-library#68 therefore. Details about the derived contexts
|
A strange HA behavior was reported in [icingadb-787], resulting in both instances being active. The logs contained an entry of the previous active instance exiting the HA.realize() method successfully after 1m9s. This, however, should not be possible as the method's context is deadlined to a minute after the heartbeat was received. With introducing Settings.QuickContextExit for retry.WithBackoff in [icinga-go-library-69] and using it here, the function directly returns a context.DeadlineExceeded error the moment the context has expired. Doing so allows directly handing over, while the other instance can now take over due to the expired heartbeat in the database. As a related change, the HA.insertEnvironment() method was inlined into the retryable function to use the deadlined context. Otherwise, this might block afterwards, as it was used within HA.realize(), but without the passed context. In addition, the main loop select cases for hactx.Done() and ctx.Done() were unified, as hactx is a derived ctx. A closed ctx case may be lost as the hactx case could have been chosen. [icinga-go-library-69]: Icinga/icinga-go-library#69 [icingadb-787]: #787
A strange HA behavior was reported in #787, resulting in both instances being active. The logs contained an entry of the previous active instance exiting the HA.realize() method successfully after 1m9s. This, however, should not be possible as the method's context is deadlined to a minute after the heartbeat was received. However, as it turns out, executing COMMIT on a database transaction is not bound to the transaction's context, allowing to survive longer. To mitigate this, another context watch was introduced. Doing so allows directly handing over, while the other instance can now take over due to the expired heartbeat in the database. As a related change, the HA.insertEnvironment() method was inlined into the retryable function to use the deadlined context. Otherwise, this might block afterwards, as it was used within HA.realize(), but without the passed context. Since the retryable HA function may be executed a few times before succeeding, the inserted heartbeat value will be directly outdated. The heartbeat logic was slightly altered to always use the latest heartbeat time value. In addition, the main loop select cases for hactx.Done() and ctx.Done() were unified, as hactx is a derived ctx. A closed ctx case may be lost as the hactx case could have been chosen.
A strange HA behavior was reported in #787, resulting in both instances being active. The logs contained an entry of the previous active instance exiting the HA.realize() method successfully after 1m9s. This, however, should not be possible as the method's context is deadlined to a minute after the heartbeat was received. However, as it turns out, executing COMMIT on a database transaction is not bound to the transaction's context, allowing to survive longer. To mitigate this, another context watch was introduced. Doing so allows directly handing over, while the other instance can now take over due to the expired heartbeat in the database. As a related change, the HA.insertEnvironment() method was inlined into the retryable function to use the deadlined context. Otherwise, this might block afterwards, as it was used within HA.realize(), but without the passed context. Since the retryable HA function may be executed a few times before succeeding, the inserted heartbeat value will be directly outdated. The heartbeat logic was slightly altered to always use the latest heartbeat time value. In addition, the main loop select cases for hactx.Done() and ctx.Done() were unified, as hactx is a derived ctx. A closed ctx case may be lost as the hactx case could have been chosen.
A strange HA behavior was reported in #787, resulting in both instances being active. The logs contained an entry of the previous active instance exiting the HA.realize() method successfully after 1m9s. This, however, should not be possible as the method's context is deadlined to a minute after the heartbeat was received. However, as it turns out, executing COMMIT on a database transaction is not bound to the transaction's context, allowing to survive longer. To mitigate this, another context watch was introduced. Doing so allows directly handing over, while the other instance can now take over due to the expired heartbeat in the database. As a related change, the HA.insertEnvironment() method was inlined into the retryable function to use the deadlined context. Otherwise, this might block afterwards, as it was used within HA.realize(), but without the passed context. Since the retryable HA function may be executed a few times before succeeding, the inserted heartbeat value will be directly outdated. The heartbeat logic was slightly altered to always use the latest heartbeat time value. In addition, the main loop select cases for hactx.Done() and ctx.Done() were unified, as hactx is a derived ctx. A closed ctx case may be lost as the hactx case could have been chosen.
A strange HA behavior was reported in #787, resulting in both instances being active. The logs contained an entry of the previous active instance exiting the HA.realize() method successfully after 1m9s. This, however, should not be possible as the method's context is deadlined to a minute after the heartbeat was received. However, as it turns out, executing COMMIT on a database transaction is not bound to the transaction's context, allowing to survive longer. To mitigate this, another context watch was introduced. Doing so allows directly handing over, while the other instance can now take over due to the expired heartbeat in the database.
I'm probably having the same issue with 2 1st instance
2nd instance
|
Hi @deric, thanks for reporting! Your logs actually look fine except for the last crash. Note that both instances are allowed to write history data concurrently, i.e. while the active instance logs |
@deric: Thanks for posting your logs and please excuse the trouble. If possible, please supply the logs with additional fields via journalctl's JSON output ( Even if #800 also has a MySQL spin, it addresses some general HA realization topics. Is it possible to deploy a custom build in your environment? There is also #830 with more PostgreSQL specifics. How often does this happen? Was this the first time or it this a more occasional event? Was there any additional load to the database during the event? |
@oxzi It happens quite frequently (e.g. daily), the database runs on a dedicated physical server. There's basically no load at all until queries from both icingadb version Yes, I can deploy a custom build. Do you have pre-release packages or should I compile it from a specific branch? yet another crash:
|
Describe the bug
Apparently there is a race condition in our HA realize handlings that can occur from time to time, causing both Icinga DB instances to become active/responsible at the same time. It is possible for the following to happen if both instances access the
icingadb_instance
table at the same time.Since both instances are trying to update that table quite often1, it is to be expected that they will deadlock each other, as can be seen in the logs.
However, this retry mechanism also prevents new heartbeat events from being consumed from the channel and results in such weird situation.
In order to prevent it from being retried endlessly, though, we have a deadline2 for the entire HA realize thing.
icingadb/pkg/icingadb/ha.go
Lines 220 to 221 in f9ce7a5
Unfortunately, if all this retry error mess then lasts for exactly one minute, and barely manages to survive the context deadline, it becomes in the end a wasted time, as after this delay the heartbeat of a given instance is already outdated, but has nevertheless been written to the database. Obviously, the current instance remains responsible, it just has written an outdated heartbeat timestamp to the database because of the deadlocks. When the passive instance finally reads that timestamp, it thinks that the other instance is no longer available and takes over the HA responsibility.
icingadb/pkg/icingadb/ha.go
Lines 319 to 321 in f9ce7a5
Having two active instances, however, does not produce any good results and brings down one of the two services.
Active IcingaDB Instance
Take a closer look at the
ICINGADB_AFTER
field of the above log entry"ICINGADB_AFTER": "1m9.144891965s"
originated from this part of the code.icingadb/pkg/icingadb/ha.go
Lines 410 to 413 in f9ce7a5
The other(passive) Instance uses the heartbeat timestamp of this Instance from the database to determine that this currently active Instance is absent, by verifying that our timestamp does not exceed the
1m
+ grace period of 5s limit. However, this instance has spent over a minute and 5 seconds (to be precise, it took1m9.144891965s
) to record the heartbeat, so that it immediately exceeded the aforementioned limit right after the upsert.icingadb/pkg/icingadb/ha.go
Line 319 in f9ce7a5
icingadb/pkg/icingadb/ha.go
Lines 25 to 28 in f9ce7a5
Subsequently, the other Instance takes over the HA responsibility, even though this one is actually still responsible.
Passive IcingaDB Instance
Afterwards, the passive instance also becomes active and starts its initial config sync and encounters numerous duplicate key and deadlock errors until one of the queries exceeds the retry limit of
5m
and crashes the Icinga DB process.ref/IP/48748
Additional context
Footnotes
It's a well-known fact that the Icinga 2 heartbeats arrives in Redis every 1 second, but we only read them every
3
seconds and these 3 seconds are the pacesetter for HA realize thing. ↩The overall Redis heartbeat timeout is set to
1m
, i.e. if we do not receive an Icinga 2 heartbeat for more than one minute, we hand over the HA responsibility. Seemingly, that very same timeout is also used as a deadline for the HA transaction retry's. ↩The text was updated successfully, but these errors were encountered: