Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redis probes list issues - Multi-instance API problems #120

Closed
jimaek opened this issue May 12, 2022 · 7 comments
Closed

Redis probes list issues - Multi-instance API problems #120

jimaek opened this issue May 12, 2022 · 7 comments
Assignees

Comments

@jimaek
Copy link
Member

jimaek commented May 12, 2022

This is a task to track the issue where sometimes an API instance shows a partial list of the connected probes.
It has happened in the simple CLI script and production API as well.

Need to verify that our API is 100% stable when running multiple instances with a central Redis DB.

@jimaek

This comment was marked as outdated.

@patrykcieszkowski
Copy link
Contributor

heres the problem:

  • server A manages X probes
  • server B manages Y probes
  • the load balancer redirects user to A or B
  • at random, server A or B wouldnt respond to probe list query (pub/sub)

@jimaek jimaek changed the title Redis probes list issues Redis probes list issues - Multi-instance API problems Aug 18, 2022
@alexey-yarmosh
Copy link
Member

Hey @patrykcieszkowski, I am trying to address that issue and Dmitriy told me that you had a script to get a list of probes from the specific node instance/process. Could you share it please? If you don't mind.

Also, if you have any info on how the issue can be steadily reproduced, that would be very helpful. Thanks!

@patrykcieszkowski
Copy link
Contributor

I don't recall writing such a script, but it should be as simple as adding a node identifier k/v to the probe data.

https://github.com/jsdelivr/globalping/blob/master/src/probe/builder.ts#L90-L105
https://github.com/jsdelivr/globalping/blob/master/src/probe/route/get-probes.ts#L16-L33

I also never figured out how to consistently replicate that issue. In fact, it never happened on my local network, even while running over 500 probes. One thing is for certain - even when connecting to the WS pool externally, and pulling the probe list while bypassing the HTTP server, the behaviour mentioned in the comment above was present. I came to the conclusion that some nodes would either never receive the pub requesting the data, or wouldn't respond to it on time (redis-adapter has a timeout).

@alexey-yarmosh
Copy link
Member

I was constantly requesting /probes endpoint from both APIs and what I am observing is:

  • under usual load diff between responses may be ~1-2 probes, because some probes are constantly reconnecting (IP limit). Current fetchSockets adapter firstly gets local probes, then asks to get remote ones. While awaiting for remote, some local probes may disconnect, but we already got the list - that is why unsync happens.
  • under the high load, redis operations (and pub/sub that adapter is using) takes more time + there are more porbes reconnecting, that is why diff may be ~1-10 probes.
  • also, under the load there is issue timeout reached while waiting for fetchSockets response #234 in that case 500 http error is returned.
  • also, under the load sometimes both APIs simultaneously respond only with their own probes.

As I see, the only thing we can do here is to try another adapter implementation and compare the behaviour. For some teams AMPQ adapter showed really good results.

@alexey-yarmosh
Copy link
Member

alexey-yarmosh commented Dec 1, 2022

AMPQ adapter does not support some of the required operations (e.g. fetchSockets()).
I've also tried NATS adapter, but fetchSockets there also works not as expected.

@alexey-yarmosh
Copy link
Member

alexey-yarmosh commented Jan 3, 2023

I think we can close that, as under usual load GET /probes works without issues. Only under high load (when redis operations start to take >30 sec) we can observe the issues as well as 500 error. So we should focus on the root cause (redis perf) in other gh issues, which we are already doing.

@jimaek jimaek closed this as completed Jan 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants