Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pool distributes load unevenly across connections. #3648

Open
JustinTRoss opened this issue Sep 26, 2024 · 9 comments
Open

Pool distributes load unevenly across connections. #3648

JustinTRoss opened this issue Sep 26, 2024 · 9 comments
Labels
bug Something isn't working Pool Features or fixes related to the `Pool` class

Comments

@JustinTRoss
Copy link

Bug Description

Pool distributes load unevenly across connections, prioritizing them by creation time. See here.

Where your connections are load-balanced across instances (pods/containers), this results in overuse of some instances and underuse of others.

Reproducible By

  1. Create a service, ServiceA, with an endpoint that takes 100ms - 200ms to respond.
  2. Using Kubernetes, deploy 5 pods with instances of ServiceA.
  3. In another service, ServiceB, create a Pool of connections to ServiceA, referencing it by Kubernetes Service address (https://service-name), such that we'll be load-balancing connections through kube-proxy.
  4. Send 1 request to ServiceB every 50ms for 1 minute.
  5. Observe balanced connections across ServiceA pods, but imbalanced request distribution and resource usage across ServiceA pods.

Thinking through, this may be otherwise observable plainly through internal diagnostics around amount of requests processed by each Client or something like that.

Expected Behavior

Roughly balanced request distribution and resource usage across Clients and ServiceA pods.

Environment

MacOS, Node v18

Additional context

There are a number of ways we could resolve this intelligently. Most simply/crudely though, we could track index of last used client and pick back up next scan from that index to roughly round-robin.

Something like:

...
const totalClients = clients.length;
let clientIndex = (this.lastUsedIndex + 1) % totalClients;

for (let i = 0; i < totalClients; i++) {
  const client = clients[clientIndex];

  if (!client[kNeedDrain]) {
    this.lastUsedIndex = clientIndex;
    return client
  }

  clientIndex = (clientIndex + 1) % totalClients;
}
@JustinTRoss JustinTRoss added the bug Something isn't working label Sep 26, 2024
@Uzlopak Uzlopak added the Pool Features or fixes related to the `Pool` class label Sep 26, 2024
@Uzlopak
Copy link
Contributor

Uzlopak commented Sep 26, 2024

Imho duplicate of #310

@mcollina
Copy link
Member

The problem stems from the use of Keep-Alive connections. Essentially, we prioritize "alive" connections as much as possible because they are the ones we know are lively and less likely to timeout/error due to network disconnections. This is suboptimal in case of low traffic but minimizes errors. In a high-traffic scenario, this is equivalent to round-robin.

Have you tried using the BalancedPool? It should be implementing a better algorithm that would work better for your use case.

@JustinTRoss
Copy link
Author

Thanks for the quick response.

I see the challenge of balancing against keeping connections alive, prioritizing sending requests through as few connections as possible to ensure maximum reuse. I don't understand the bit about being suboptimal for low traffic and round-robin for high-traffic. It seems the opposite to me.

As I reason about it, with low traffic, long request processing times, and short keep-alives, it will be round-robin, but there is an edge case in which we potentially fail to reuse a connection occasionally. With high traffic, we end up with a linear trend of more requests processed by earlier connections and fewer by later connections, with order of connection index use ending up something like: 1, 2, 3, 1, 4, 2, 1, 5, 2, 3, 1

I haven't used BalancedPool. I've looked at it, but it seemed to be for a case of multiple different upstream targets, such using it for one upstream would need to be something like:

new BalancedPool([
  "http://service-a",
  "http://service-a",
  "http://service-a",
  "http://service-a",
  "http://service-a",
]);

Is my understanding off?

@JustinTRoss
Copy link
Author

JustinTRoss commented Sep 26, 2024

I think I follow actually. We'd never let go of connections once created, because all connections would have roughly the same duration since the previous request. Then, once the pool got large enough and request volume dropped enough, it would timeout all connections at once.

Does that capture the reasoning?

@mcollina
Copy link
Member

I think I follow actually. We'd never let go of connections once created, because all connections would have roughly the same duration since the previous request. Then, once the pool got large enough and request volume dropped enough, it would timeout all connections at once.

Does that about sum it up?

Yes, exactly.


If we can improve BalancedPool to cater for the Kubernetes/load balancer case, please send a PR!

@JustinTRoss
Copy link
Author

If we can improve BalancedPool to cater for the Kubernetes/load balancer case, please send a PR!

I'd be happy to if it makes sense and we can identify a sensible approach. It doesn't seem intended for that case though, as I share above.

How so are you thinking it could be used to balance requests across a single upstream service?

@mcollina
Copy link
Member

I think the algorithm used in balanced pool could be extended to work "in front" of a load balancer.

@JustinTRoss
Copy link
Author

We can potentially steal inspiration from the http lib as well. Using Axios, the requests were distributed fairly evenly. We only noticed the change when switching to Undici. Request volume for this service is 500-1000 per second.

@mcollina
Copy link
Member

A PR would be highly welcomed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Pool Features or fixes related to the `Pool` class
Projects
None yet
Development

No branches or pull requests

3 participants