Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot remove multiple IP addresses in a single check #30

Open
chrisrogerson opened this issue Mar 8, 2022 · 9 comments
Open

Cannot remove multiple IP addresses in a single check #30

chrisrogerson opened this issue Mar 8, 2022 · 9 comments

Comments

@chrisrogerson
Copy link

I have been implementing this project for a while with BIRD2 for a single IPv4 and IPv6 IP address using separate checks.
I have now run into a need to add more addresses.
I have tried using more checks and found that past 2 checks, the program doesn't withdraw and add the route properly.
This seems to be due to the "birdc configure" command being run 3 times in very quick succession as the my anycast-prefixes files do update properly.
I would like to be able to add all of the addresses (I could end up with 12 or more on a single server) to a single check and have it run the birdc configure only once to rectify this but can't seem to figure out how to do this.

@unixsurfer
Copy link
Owner

Having multiple IPs for a single check is something useful and practical.
But, I want to understand more the failure you are describing as it shouldn't happen. I have used it in environments with more than 50 checks and frequently updates and I never noticed this. Moreover, we update the prefix file atomically so there is no way that bird will see more than view of the content at a given time.

Could you please share some logs when the problem occurred ?

@chrisrogerson
Copy link
Author

I use a single server as an anycast backup for the same service run on multiple addresses (AS path postpending in the peered router makes this server less preferred).
I run the same script for multiple checks for multiple IP addresses (8 IPv4 and currently one IPv6)
When that script fails, the server has 9 checks fail and has to run the "birdc configure" command 9 times at the same time and that fails. This happens when the checks recover and attempt to re-add the addresses as well.
If I could add multiple IP's to a single check in the anycast-healthchecker config, it would likely resolve this issue for me as it would only be running the birdc configure command once when the check fails or recovers rather than the 9 times all at once.

@unixsurfer
Copy link
Owner

We run birdc configure only once at a given time not multiple times. There is a queue where service checks put the result and the main thread picks up item one by one, so in this way we ensure that run birdc configure once per a given time. We may run it multiple times within few seconds, but that hasn't been a problem.

Could you please share some logs where you notice multiple invocation of birdc configure?

@chrisrogerson
Copy link
Author

The issue is that I am not running that command multiple times in a few seconds. I am running it multiple times in one second.
I have pasted santitized logs below where you can see three checks removing their addresses at once. When this occurs, the first two checks will get removed but the 3rd will not. I have tried reordering the checks to see if that matters and it is always the third check that does not withdraw it's address. I can manually run the "birdc reconfigure" command after this and have the address withdrawn. This is why I would like to attach multiple addresses to a single check.

Sanitized Logs:
2022-04-01 11:03:46,665 anycast-healthchecker[2745600] INFO MainThread returned an item from the queue for CHECK1 with IP prefix [IPv4 Address 1] and action to delete from Bird configuration
2022-04-01 11:03:46,666 anycast-healthchecker[2745600] INFO MainThread withdrawing [IPv4 Address 1] for CHECK1
2022-04-01 11:03:46,666 anycast-healthchecker[2745600] DEBUG MainThread going to write to /etc/bird/1648829026.666751
2022-04-01 11:03:46,667 anycast-healthchecker[2745600] INFO MainThread Bird configuration for IPv4 is updated
2022-04-01 11:03:46,668 anycast-healthchecker[2745600] INFO MainThread reconfiguring BIRD by running sudo /usr/sbin/birdc configure
2022-04-01 11:03:46,675 anycast-healthchecker[2745600] INFO MainThread reconfigured BIRD daemon
2022-04-01 11:03:46,675 anycast-healthchecker[2745600] INFO MainThread returned an item from the queue for CHECK2 with IP prefix [IPv4 Address 2] and action to delete from Bird configuration
2022-04-01 11:03:46,675 anycast-healthchecker[2745600] INFO MainThread withdrawing [IPv4 Address 2] for CHECK2
2022-04-01 11:03:46,675 anycast-healthchecker[2745600] DEBUG MainThread going to write to /etc/bird/1648829026.675965
2022-04-01 11:03:46,676 anycast-healthchecker[2745600] INFO MainThread Bird configuration for IPv4 is updated
2022-04-01 11:03:46,676 anycast-healthchecker[2745600] WARNING MainThread Bird configuration doesn't have IP prefixes for any of the services we monitor! It means local node doesn't receive any traffic
2022-04-01 11:03:46,676 anycast-healthchecker[2745600] INFO MainThread reconfiguring BIRD by running sudo /usr/sbin/birdc configure
2022-04-01 11:03:46,682 anycast-healthchecker[2745600] INFO MainThread reconfigured BIRD daemon
2022-04-01 11:03:46,682 anycast-healthchecker[2745600] INFO MainThread returned an item from the queue for CHECK3 with IP prefix [IPv6 Address 1] and action to delete from Bird configuration
2022-04-01 11:03:46,682 anycast-healthchecker[2745600] INFO MainThread withdrawing [IPv6 Address 1] for CHECK3
2022-04-01 11:03:46,682 anycast-healthchecker[2745600] DEBUG MainThread going to write to /etc/bird/1648829026.6829078
2022-04-01 11:03:46,683 anycast-healthchecker[2745600] INFO MainThread Bird configuration for IPv6 is updated
2022-04-01 11:03:46,683 anycast-healthchecker[2745600] WARNING MainThread Bird configuration doesn't have IP prefixes for any of the services we monitor! It means local node doesn't receive any traffic
2022-04-01 11:03:46,683 anycast-healthchecker[2745600] INFO MainThread reconfiguring BIRD by running sudo /usr/sbin/birdc configure
2022-04-01 11:03:46,689 anycast-healthchecker[2745600] INFO MainThread reconfigured BIRD daemon

@unixsurfer
Copy link
Owner

Thanks for the log.
What do you see on bird log? Have you seen this problem only the IPv6 address?

For debugging purposes, can you add before https://github.com/unixsurfer/anycast_healthchecker/blob/master/anycast_healthchecker/healthchecker.py#L220

import time
time.sleep(1)

I am curious to see if that makes any difference.

Can you also try setting splay_startup but without the above code change?

@chrisrogerson
Copy link
Author

This is the BIRD log from that time:
2022-04-01 11:03:46.674 Reconfiguring
2022-04-01 11:03:46.674 Reconfigured
2022-04-01 11:03:46.681 Reconfiguring
2022-04-01 11:03:46.681 Reloading channel SW5LAB.ipv4
2022-04-01 11:03:46.681 Reconfigured
2022-04-01 11:03:46.688 Reconfiguring
2022-04-01 11:03:46.688 Reloading channel SW5LAB.ipv4
2022-04-01 11:03:46.688 Reconfigured

This does not just affect the IPv6, it affects any checks after the first two as ordered in the configuration regardless of IP version. I just happened to have the ipv6 check as the third in order in that log.

If we are being honest, I have no idea how to implement the debug you have suggested.

I did apply splay_startup = 50 (no idea what the units are) and that resolved the issue I was having.

@chrisrogerson
Copy link
Author

Actually, it would seem that the splay_startup command has the effect of randomizing failure. Depending on the amount of splay between tests, it can correct the issue or not.

@unixsurfer
Copy link
Owner

From the README:


    splay_startup Unset by default

The maximum time to delay the startup of service checks. You can use either integer or floating-point number as a value.

In order to avoid launching all checks at the same time, after anycast-healthchecker is started, we can delay the 1st check in random way. This can be useful in cases where we have a lot of service checks and launching all them at the same time can overload the system. We randomize the delay of the 1st check for each service and splay_startup sets the maximum time we can delay that 1st check.

The unit is in seconds, it is a doc bug that we don't mention it:-)

At least you now have a workaround.

@danpoltawski
Copy link

Just to add a +1 to allow a single check to impact multiple prefixes - it feels somewhat wasteful chealthchecking the same thing for this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants