Provide configurable option to queue requests when concurrency is limited with "max-concurrent-requests" #9229

vasanth-bhat · 2024-08-30T15:31:06Z

Environment Details

Helidon Version: 4.x
Helidon SE or Helidon MP. : SE & MP
JDK version: Java 21
OS: Linux
Docker version (if applicable):

In. Helidon 4.x with. Webserver that supports loom based virtual threads , uses the new thread per request model . So by design there is no longer server thread pool or any associated queues where requests get queued .

By default. there is no limit on concurrency and this can lead to issues when resources such as DB connections, external system integrations, and. other such downstream resources are limited. This can lead to performance degrade and also errors when requests timeout waiting for such resources.

To address this Helidon provides the "max-concurrent-requests" parameter on the Listener configuration. While it helps to limit the concurrency , the services are running into issues when trying to use this parameter to limit the concurrency

When the "max-concurrent-requests" parameter is set, any surge requests beyond the limit get rejected and fail with 503. There can be occasional surges that can cause the concurrency to go beyond the configured limit, and such cases teh requests would error out.
This behaviour is not consistent with the behavior in earlier versions of Helidon where under this situation the requests would get queued in the queue associated with Helidon's server thread pool.

It would be good have an additional configurable options in Helidon 4 , where. one can additionally enable queueing of requests , when a limit is configured for max-concurrent-requests"
Something like below

server :
max-concurrent-requests : 40
request-queue :
enable : true
max : 100

tomas-langer · 2024-09-04T19:40:31Z

This is not different from setting max-concurrent-requests to 140.
Virtual threads do not consume resources when waiting on a lock. As long the the data source supports queuing of requests, this will work as intended, and only in case you get 140 requests that access the same data sources and have to queue would you get server overloaded (as you would with queing on the server enabled).
Can you explain what is the advantage of queuing on server level?

barchetta · 2024-09-04T21:47:53Z

The Fault Tolerance Bulkhead feature (SE, MP) provides a mechanism for rate-limiting access to specific tasks. You control both parallelism and wait-queue length.

See the Helidon SE Rate Limiting example for examples of using a Bulkhead as well as a Java Semaphore for doing rate limiting.

I think of max-concurrent-request as a hard cap to protect the integrity of the server. Then use Bulkheads or Semaphores to have more fine grained control of rate limiting on individual tasks.

scottoaks17 · 2024-09-04T22:12:51Z

The bulkhead feature requires programmatic changes, where providing the queue via max-connection-requests would just be a config change to make old code still behave the same way.

romain-grecourt · 2024-09-09T18:02:11Z

You can setup bulkhead for all requests with a filter:

int rateLimit = Config.global().get("ratelimit").asInt().orElse(20);
Bulkhead bulkhead = Bulkhead.builder()
        .limit(rateLimit)
        .queueLength(rateLimit * 2)
        .build();
routing
        .addFilter((chain, req, res) -> {
            try {
                bulkhead.invoke(() -> {
                    chain.proceed();
                    return null;
                });
            } catch (BulkheadException ex) {
                res.status(Status.SERVICE_UNAVAILABLE_503).send();
            }
        })

vasanth-bhat · 2024-09-10T03:17:02Z

Yes, This is not same as having the ability in Helidon level, and at Helidon level behavior is not consistent with H3 and individual services have to make code changes to implement this.

lettrung99 · 2024-09-17T14:45:53Z

Hello team,

I am with @vasanth-bhat on this request as we also are experiencing the same issue on H4.

Our team has identified that implementing the Bulkhead API with a queue is necessary to effectively manage the spike in load.

However, we've encountered a challenge: each team is required to implement and maintain the same logic independently. This approach is not only time-consuming but also potentially leads to inconsistencies across teams.

To streamline our process and ensure uniformity, I propose implementing this solution at the Helidon level. This approach would:

Reduce redundant work across teams
Ensure consistency in implementation
Simplify maintenance and updates

I would greatly appreciate your thoughts on this proposal.

Thank you for your attention to this matter.

Best regards,
David

lettrung99 · 2024-09-24T19:29:03Z

Hi Team,
I propose implementing a Hierarchical Rate Limiting capability, taking into consideration the diverse characteristics of different applications and usage scenarios. Hierarchical Rate Limiting feature shall have:

Application Level (Whole App) Rate Limiting - Easier to implement and manage since you set a global limit for all requests.
Path Level Rate Limiting - Allows for different rate limits based on the sensitivity or resource intensity of different paths. For example, you might have a more lenient limit for read operations versus write operations.
Hybrid Rate Limiting - This approach works well where you have a global rate limit plus specific limits for particularly sensitive or high-traffic paths. This combines the simplicity of app-level control with the precision of path-level where needed.
Queue Buffer - Configurable per path or globally. Optimize queue capacity to handle load spikes, accommodate HPA scaling delays, and serve as a scaling metric.
Retry and Backoff Strategy - Configurable per path or globally. Configure retry attempts and backoff strategies based on typical network latencies or service response times.
Expose metrics for capacity planning and scaling decision
- Request Denial metric
- Queue Utilization metric

Hierarchical Rate Limiting intended to provide Multiple Levels of Limiting instead of a single rate limit applied uniformly across all requests, hierarchical rate limiting involves applying different limits at different levels or scopes. In this context:

Application Level (Whole App) Rate Limiting:

Simplicity: Easier to implement and manage since you set a global limit for all requests.
Consistency: Ensures a fair distribution of resources across all users and all parts of the application.
Security: Can help prevent large-scale attacks like DDoS more effectively since the total request volume is controlled.
Less Granular Control: You might restrict legitimate traffic on one path because of high usage on another, which could affect user experience or functionality where high throughput is expected.

Path Level Rate Limiting:

Granularity: Allows for different rate limits based on the sensitivity or resource intensity of different paths. For example, you might have a more lenient limit for read operations versus write operations.
Customization: You can tailor the limits to match expected usage patterns or importance of different endpoints, which can optimize the user experience.
Resource Management: More efficient management of resources when certain paths are known to be resource-heavy or critical.
Complexity: More complex to manage, especially in large applications with many endpoints. Requires more configuration and potentially more maintenance.

Hybrid Approach:

Sometimes, a hybrid approach works well where you have a global rate limit plus specific limits for particularly sensitive or high-traffic paths. This combines the simplicity of app-level control with the precision of path-level where needed.

Example Scenario:
Global Limit: Set to 1000 requests per second across the entire application.
Path Limits:
/api/search might be limited to 500 requests per second due to its resource intensity.
/api/user/info might have a limit of 200 requests per second because it's less resource-intensive but still needs protection.

In this setup, even if /api/search is not hitting its limit, the global limit could still throttle requests if the total across all paths exceeds 1000 requests per second.
Metrics Instrumentation:

Counters:

globalRequestsDenied : This counter tracks the number of requests denied at the global level due to rate limiting.
pathRequestsDenied : This counter tracks the number of requests denied for specific paths due to rate limiting.

Gauges:

globalQueueLengthPercentage: This gauge provides the current percentage of the global queue length in use, calculated as (current queue size / maximum queue length) * 100 .
pathQueueLengthPercentages : This is a map of gauges, where each gauge corresponds to a specific path and provides the percentage of that path's queue length currently in use.
Request Denial Tracking : By counting denied requests both globally and per path, you can monitor how often rate limits are hit, which can help in tuning the rate limiting thresholds or identifying paths that might need more resources.

Queue Utilization : The gauges for queue length percentages offer real-time information on how close the system is to reaching its rate limit capacity, both globally and for specific paths. This can be crucial for understanding system load and for capacity planning or scaling decisions.

Properly configuring the queue size offers several key benefits:
Handling Sudden Load Bursts:

A well-sized queue can effectively manage unexpected spikes in incoming requests.
It acts as a buffer, temporarily storing excess requests during peak periods.

Mitigating Horizontal Pod Autoscaler (HPA) Lag:

The queue provides a cushion during the time it takes for the HPA to detect increased load and initiate the scaling process.
This buffer ensures continued service availability while new pods are being provisioned.

Serving as a Key Scale-Out Indicator:

Queue depth serves as an excellent metric for triggering horizontal scaling.
As the queue fills, it signifies increasing demand, prompting the system to allocate additional resources.

By carefully tuning the queue size, you can enhance your system's resilience, responsiveness, and overall performance in the face of varying workloads.

CONCLUSION:

For small to medium-sized apps or where simplicity is key**: Application-wide rate limiting might be preferable due to ease of management.
For larger, more complex systems or where different paths have significantly different usage patterns or security needs**: Path-level or a hybrid model would be more appropriate.

# example configuration for hybrid rate limiting
app:
  rate-limit:
    global:
      max-concurrent: 100
      requests-queue: 500
      retry:
        max-attempts: 3
        initial-delay-ms: 100
        backoff-factor: 2.0
    paths:
      - path: "/api/expensive"
        max-concurrent: 20
        requests-queue: 50
        retry:
          max-attempts: 5
          initial-delay-ms: 200
          backoff-factor: 1.5

Thanks,
David

romain-grecourt · 2024-09-24T20:23:42Z

However, we've encountered a challenge: each team is required to implement and maintain the same logic independently.

You can create a shared module that implements ServerFeature to register a filter automatically.

lettrung99 · 2024-09-24T21:46:38Z

hi @romain-grecourt - Yes that is what we have to do, but in the interest of everyone else who is using Helidon outside of our organization. Don't you think it valuable to have it at Helidon level?

romain-grecourt · 2024-09-24T22:26:32Z

Yes that is what we have to do, but in the interest of everyone else who is using Helidon outside of our organization. Don't you think it valuable to have it at Helidon level?

The reasoning that this features must be provided by Helidon out of the box because using BulkHead requires programmatic changes or that it cannot be shared among projects is not correct.

It is reasonable to expect Helidon to provide a more sophisticated feature for concurrency limits, it is currently addressed by #8897.

This issue overlaps with #8897 and given the prescribed workarounds It isn't clear what it represents other than sharing one single class.

tomas-langer · 2024-09-27T14:38:18Z

There is now a PR for Helidon.
See #9295 for details - both on how it would be configured and how it is implemented.
Please provide feedback!

vasanth-bhat mentioned this issue Aug 30, 2024

Provide adaptive concurrency limits #8897

Open

tomas-langer linked a pull request Sep 27, 2024 that will close this issue

4.x: Concurrency limits module, and support in Helidon WebServer #9295

Open

m0mus assigned tomas-langer Oct 3, 2024

tomas-langer added webserver 4.x Version 4.x labels Oct 3, 2024

tomas-langer added this to the 4.2.0 milestone Oct 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide configurable option to queue requests when concurrency is limited with "max-concurrent-requests" #9229

Provide configurable option to queue requests when concurrency is limited with "max-concurrent-requests" #9229

vasanth-bhat commented Aug 30, 2024 •

edited

Loading

tomas-langer commented Sep 4, 2024

barchetta commented Sep 4, 2024 •

edited

Loading

scottoaks17 commented Sep 4, 2024

romain-grecourt commented Sep 9, 2024

vasanth-bhat commented Sep 10, 2024

lettrung99 commented Sep 17, 2024

lettrung99 commented Sep 24, 2024 •

edited

Loading

romain-grecourt commented Sep 24, 2024

lettrung99 commented Sep 24, 2024

romain-grecourt commented Sep 24, 2024

tomas-langer commented Sep 27, 2024 •

edited

Loading

Provide configurable option to queue requests when concurrency is limited with "max-concurrent-requests" #9229

Provide configurable option to queue requests when concurrency is limited with "max-concurrent-requests" #9229

Comments

vasanth-bhat commented Aug 30, 2024 • edited Loading

Environment Details

tomas-langer commented Sep 4, 2024

barchetta commented Sep 4, 2024 • edited Loading

scottoaks17 commented Sep 4, 2024

romain-grecourt commented Sep 9, 2024

vasanth-bhat commented Sep 10, 2024

lettrung99 commented Sep 17, 2024

lettrung99 commented Sep 24, 2024 • edited Loading

Application Level (Whole App) Rate Limiting:

Path Level Rate Limiting:

Hybrid Approach:

Counters:

Gauges:

CONCLUSION:

romain-grecourt commented Sep 24, 2024

lettrung99 commented Sep 24, 2024

romain-grecourt commented Sep 24, 2024

tomas-langer commented Sep 27, 2024 • edited Loading

vasanth-bhat commented Aug 30, 2024 •

edited

Loading

barchetta commented Sep 4, 2024 •

edited

Loading

lettrung99 commented Sep 24, 2024 •

edited

Loading

tomas-langer commented Sep 27, 2024 •

edited

Loading