Pool distributes load unevenly across connections. #3648

JustinTRoss · 2024-09-26T15:05:44Z

Bug Description

Pool distributes load unevenly across connections, prioritizing them by creation time. See here.

Where your connections are load-balanced across instances (pods/containers), this results in overuse of some instances and underuse of others.

Reproducible By

Create a service, ServiceA, with an endpoint that takes 100ms - 200ms to respond.
Using Kubernetes, deploy 5 pods with instances of ServiceA.
In another service, ServiceB, create a Pool of connections to ServiceA, referencing it by Kubernetes Service address (https://service-name), such that we'll be load-balancing connections through kube-proxy.
Send 1 request to ServiceB every 50ms for 1 minute.
Observe balanced connections across ServiceA pods, but imbalanced request distribution and resource usage across ServiceA pods.

Thinking through, this may be otherwise observable plainly through internal diagnostics around amount of requests processed by each Client or something like that.

Expected Behavior

Roughly balanced request distribution and resource usage across Clients and ServiceA pods.

Environment

MacOS, Node v18

Additional context

There are a number of ways we could resolve this intelligently. Most simply/crudely though, we could track index of last used client and pick back up next scan from that index to roughly round-robin.

Something like:

...
const totalClients = clients.length;
let clientIndex = (this.lastUsedIndex + 1) % totalClients;

for (let i = 0; i < totalClients; i++) {
  const client = clients[clientIndex];

  if (!client[kNeedDrain]) {
    this.lastUsedIndex = clientIndex;
    return client
  }

  clientIndex = (clientIndex + 1) % totalClients;
}

The text was updated successfully, but these errors were encountered:

Uzlopak · 2024-09-26T15:08:38Z

Imho duplicate of #310

mcollina · 2024-09-26T15:31:37Z

The problem stems from the use of Keep-Alive connections. Essentially, we prioritize "alive" connections as much as possible because they are the ones we know are lively and less likely to timeout/error due to network disconnections. This is suboptimal in case of low traffic but minimizes errors. In a high-traffic scenario, this is equivalent to round-robin.

Have you tried using the BalancedPool? It should be implementing a better algorithm that would work better for your use case.

JustinTRoss · 2024-09-26T15:51:41Z

Thanks for the quick response.

I see the challenge of balancing against keeping connections alive, prioritizing sending requests through as few connections as possible to ensure maximum reuse. I don't understand the bit about being suboptimal for low traffic and round-robin for high-traffic. It seems the opposite to me.

As I reason about it, with low traffic, long request processing times, and short keep-alives, it will be round-robin, but there is an edge case in which we potentially fail to reuse a connection occasionally. With high traffic, we end up with a linear trend of more requests processed by earlier connections and fewer by later connections, with order of connection index use ending up something like: 1, 2, 3, 1, 4, 2, 1, 5, 2, 3, 1

I haven't used BalancedPool. I've looked at it, but it seemed to be for a case of multiple different upstream targets, such using it for one upstream would need to be something like:

new BalancedPool([
  "http://service-a",
  "http://service-a",
  "http://service-a",
  "http://service-a",
  "http://service-a",
]);

Is my understanding off?

JustinTRoss · 2024-09-26T15:54:18Z

I think I follow actually. We'd never let go of connections once created, because all connections would have roughly the same duration since the previous request. Then, once the pool got large enough and request volume dropped enough, it would timeout all connections at once.

Does that capture the reasoning?

mcollina · 2024-09-26T15:55:53Z

I think I follow actually. We'd never let go of connections once created, because all connections would have roughly the same duration since the previous request. Then, once the pool got large enough and request volume dropped enough, it would timeout all connections at once.

Does that about sum it up?

Yes, exactly.

If we can improve BalancedPool to cater for the Kubernetes/load balancer case, please send a PR!

JustinTRoss · 2024-09-26T15:59:02Z

If we can improve BalancedPool to cater for the Kubernetes/load balancer case, please send a PR!

I'd be happy to if it makes sense and we can identify a sensible approach. It doesn't seem intended for that case though, as I share above.

How so are you thinking it could be used to balance requests across a single upstream service?

mcollina · 2024-09-26T16:03:03Z

I think the algorithm used in balanced pool could be extended to work "in front" of a load balancer.

JustinTRoss · 2024-09-26T16:03:45Z

We can potentially steal inspiration from the http lib as well. Using Axios, the requests were distributed fairly evenly. We only noticed the change when switching to Undici. Request volume for this service is 500-1000 per second.

mcollina · 2024-09-26T16:10:49Z

A PR would be highly welcomed.

JustinTRoss added the bug Something isn't working label Sep 26, 2024

Uzlopak added the Pool Features or fixes related to the `Pool` class label Sep 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pool distributes load unevenly across connections. #3648

Pool distributes load unevenly across connections. #3648

JustinTRoss commented Sep 26, 2024

Uzlopak commented Sep 26, 2024

mcollina commented Sep 26, 2024

JustinTRoss commented Sep 26, 2024

JustinTRoss commented Sep 26, 2024 •

edited

Loading

mcollina commented Sep 26, 2024

JustinTRoss commented Sep 26, 2024

mcollina commented Sep 26, 2024

JustinTRoss commented Sep 26, 2024

mcollina commented Sep 26, 2024

Pool distributes load unevenly across connections. #3648

Pool distributes load unevenly across connections. #3648

Comments

JustinTRoss commented Sep 26, 2024

Bug Description

Reproducible By

Expected Behavior

Environment

Additional context

Uzlopak commented Sep 26, 2024

mcollina commented Sep 26, 2024

JustinTRoss commented Sep 26, 2024

JustinTRoss commented Sep 26, 2024 • edited Loading

mcollina commented Sep 26, 2024

JustinTRoss commented Sep 26, 2024

mcollina commented Sep 26, 2024

JustinTRoss commented Sep 26, 2024

mcollina commented Sep 26, 2024

JustinTRoss commented Sep 26, 2024 •

edited

Loading