Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide configurable option to queue requests when concurrency is limited with "max-concurrent-requests" #9229

Open
vasanth-bhat opened this issue Aug 30, 2024 · 11 comments · May be fixed by #9295
Assignees
Labels
4.x Version 4.x webserver
Milestone

Comments

@vasanth-bhat
Copy link

vasanth-bhat commented Aug 30, 2024

Environment Details

  • Helidon Version: 4.x
  • Helidon SE or Helidon MP. : SE & MP
  • JDK version: Java 21
  • OS: Linux
  • Docker version (if applicable):

In. Helidon 4.x with. Webserver that supports loom based virtual threads , uses the new thread per request model . So by design there is no longer server thread pool or any associated queues where requests get queued .

By default. there is no limit on concurrency and this can lead to issues when resources such as DB connections, external system integrations, and. other such downstream resources are limited. This can lead to performance degrade and also errors when requests timeout waiting for such resources.

To address this Helidon provides the "max-concurrent-requests" parameter on the Listener configuration. While it helps to limit the concurrency , the services are running into issues when trying to use this parameter to limit the concurrency

When the "max-concurrent-requests" parameter is set, any surge requests beyond the limit get rejected and fail with 503. There can be occasional surges that can cause the concurrency to go beyond the configured limit, and such cases teh requests would error out.
This behaviour is not consistent with the behavior in earlier versions of Helidon where under this situation the requests would get queued in the queue associated with Helidon's server thread pool.

It would be good have an additional configurable options in Helidon 4 , where. one can additionally enable queueing of requests , when a limit is configured for max-concurrent-requests"
Something like below

server :
max-concurrent-requests : 40
request-queue :
enable : true
max : 100

@tomas-langer
Copy link
Member

This is not different from setting max-concurrent-requests to 140.
Virtual threads do not consume resources when waiting on a lock. As long the the data source supports queuing of requests, this will work as intended, and only in case you get 140 requests that access the same data sources and have to queue would you get server overloaded (as you would with queing on the server enabled).
Can you explain what is the advantage of queuing on server level?

@barchetta
Copy link
Member

barchetta commented Sep 4, 2024

The Fault Tolerance Bulkhead feature (SE, MP) provides a mechanism for rate-limiting access to specific tasks. You control both parallelism and wait-queue length.

See the Helidon SE Rate Limiting example for examples of using a Bulkhead as well as a Java Semaphore for doing rate limiting.

I think of max-concurrent-request as a hard cap to protect the integrity of the server. Then use Bulkheads or Semaphores to have more fine grained control of rate limiting on individual tasks.

@scottoaks17
Copy link

The bulkhead feature requires programmatic changes, where providing the queue via max-connection-requests would just be a config change to make old code still behave the same way.

@romain-grecourt
Copy link
Contributor

You can setup bulkhead for all requests with a filter:

int rateLimit = Config.global().get("ratelimit").asInt().orElse(20);
Bulkhead bulkhead = Bulkhead.builder()
        .limit(rateLimit)
        .queueLength(rateLimit * 2)
        .build();
routing
        .addFilter((chain, req, res) -> {
            try {
                bulkhead.invoke(() -> {
                    chain.proceed();
                    return null;
                });
            } catch (BulkheadException ex) {
                res.status(Status.SERVICE_UNAVAILABLE_503).send();
            }
        })

@vasanth-bhat
Copy link
Author

Yes, This is not same as having the ability in Helidon level, and at Helidon level behavior is not consistent with H3 and individual services have to make code changes to implement this.

@lettrung99
Copy link

Hello team,

I am with @vasanth-bhat on this request as we also are experiencing the same issue on H4.

Our team has identified that implementing the Bulkhead API with a queue is necessary to effectively manage the spike in load.

However, we've encountered a challenge: each team is required to implement and maintain the same logic independently. This approach is not only time-consuming but also potentially leads to inconsistencies across teams.

To streamline our process and ensure uniformity, I propose implementing this solution at the Helidon level. This approach would:

  1. Reduce redundant work across teams
  2. Ensure consistency in implementation
  3. Simplify maintenance and updates

I would greatly appreciate your thoughts on this proposal.

Thank you for your attention to this matter.

Best regards,
David

@lettrung99
Copy link

lettrung99 commented Sep 24, 2024

Hi Team,
I propose implementing a Hierarchical Rate Limiting capability, taking into consideration the diverse characteristics of different applications and usage scenarios. Hierarchical Rate Limiting feature shall have:

  1. Application Level (Whole App) Rate Limiting - Easier to implement and manage since you set a global limit for all requests.
  2. Path Level Rate Limiting - Allows for different rate limits based on the sensitivity or resource intensity of different paths. For example, you might have a more lenient limit for read operations versus write operations.
  3. Hybrid Rate Limiting - This approach works well where you have a global rate limit plus specific limits for particularly sensitive or high-traffic paths. This combines the simplicity of app-level control with the precision of path-level where needed.
  4. Queue Buffer - Configurable per path or globally. Optimize queue capacity to handle load spikes, accommodate HPA scaling delays, and serve as a scaling metric.
  5. Retry and Backoff Strategy - Configurable per path or globally. Configure retry attempts and backoff strategies based on typical network latencies or service response times.
  6. Expose metrics for capacity planning and scaling decision
    • Request Denial metric
    • Queue Utilization metric

Hierarchical Rate Limiting intended to provide Multiple Levels of Limiting instead of a single rate limit applied uniformly across all requests, hierarchical rate limiting involves applying different limits at different levels or scopes. In this context:

Application Level (Whole App) Rate Limiting:

Simplicity: Easier to implement and manage since you set a global limit for all requests.
Consistency: Ensures a fair distribution of resources across all users and all parts of the application.
Security: Can help prevent large-scale attacks like DDoS more effectively since the total request volume is controlled.
Less Granular Control: You might restrict legitimate traffic on one path because of high usage on another, which could affect user experience or functionality where high throughput is expected.

Path Level Rate Limiting:

Granularity: Allows for different rate limits based on the sensitivity or resource intensity of different paths. For example, you might have a more lenient limit for read operations versus write operations.
Customization: You can tailor the limits to match expected usage patterns or importance of different endpoints, which can optimize the user experience.
Resource Management: More efficient management of resources when certain paths are known to be resource-heavy or critical.
Complexity: More complex to manage, especially in large applications with many endpoints. Requires more configuration and potentially more maintenance.

Hybrid Approach:

Sometimes, a hybrid approach works well where you have a global rate limit plus specific limits for particularly sensitive or high-traffic paths. This combines the simplicity of app-level control with the precision of path-level where needed.

Example Scenario:
Global Limit: Set to 1000 requests per second across the entire application.
Path Limits:
/api/search might be limited to 500 requests per second due to its resource intensity.
/api/user/info might have a limit of 200 requests per second because it's less resource-intensive but still needs protection.

In this setup, even if /api/search is not hitting its limit, the global limit could still throttle requests if the total across all paths exceeds 1000 requests per second.
Metrics Instrumentation:

Counters:

globalRequestsDenied : This counter tracks the number of requests denied at the global level due to rate limiting.
pathRequestsDenied : This counter tracks the number of requests denied for specific paths due to rate limiting.

Gauges:

globalQueueLengthPercentage: This gauge provides the current percentage of the global queue length in use, calculated as (current queue size / maximum queue length) * 100 .
pathQueueLengthPercentages : This is a map of gauges, where each gauge corresponds to a specific path and provides the percentage of that path's queue length currently in use.
Request Denial Tracking : By counting denied requests both globally and per path, you can monitor how often rate limits are hit, which can help in tuning the rate limiting thresholds or identifying paths that might need more resources.

Queue Utilization : The gauges for queue length percentages offer real-time information on how close the system is to reaching its rate limit capacity, both globally and for specific paths. This can be crucial for understanding system load and for capacity planning or scaling decisions.

Properly configuring the queue size offers several key benefits:
Handling Sudden Load Bursts:

  1. A well-sized queue can effectively manage unexpected spikes in incoming requests.
  2. It acts as a buffer, temporarily storing excess requests during peak periods.

Mitigating Horizontal Pod Autoscaler (HPA) Lag:

  1. The queue provides a cushion during the time it takes for the HPA to detect increased load and initiate the scaling process.
  2. This buffer ensures continued service availability while new pods are being provisioned.

Serving as a Key Scale-Out Indicator:

  1. Queue depth serves as an excellent metric for triggering horizontal scaling.
  2. As the queue fills, it signifies increasing demand, prompting the system to allocate additional resources.

​By carefully tuning the queue size, you can enhance your system's resilience, responsiveness, and overall performance in the face of varying workloads.​

CONCLUSION:

For small to medium-sized apps or where simplicity is key**: Application-wide rate limiting might be preferable due to ease of management.
For larger, more complex systems or where different paths have significantly different usage patterns or security needs**: Path-level or a hybrid model would be more appropriate.

# example configuration for hybrid rate limiting
app:
  rate-limit:
    global:
      max-concurrent: 100
      requests-queue: 500
      retry:
        max-attempts: 3
        initial-delay-ms: 100
        backoff-factor: 2.0
    paths:
      - path: "/api/expensive"
        max-concurrent: 20
        requests-queue: 50
        retry:
          max-attempts: 5
          initial-delay-ms: 200
          backoff-factor: 1.5

Thanks,
David

@romain-grecourt
Copy link
Contributor

However, we've encountered a challenge: each team is required to implement and maintain the same logic independently.

You can create a shared module that implements ServerFeature to register a filter automatically.

@lettrung99
Copy link

hi @romain-grecourt - Yes that is what we have to do, but in the interest of everyone else who is using Helidon outside of our organization. Don't you think it valuable to have it at Helidon level?

@romain-grecourt
Copy link
Contributor

Yes that is what we have to do, but in the interest of everyone else who is using Helidon outside of our organization. Don't you think it valuable to have it at Helidon level?

The reasoning that this features must be provided by Helidon out of the box because using BulkHead requires programmatic changes or that it cannot be shared among projects is not correct.

It is reasonable to expect Helidon to provide a more sophisticated feature for concurrency limits, it is currently addressed by #8897.

This issue overlaps with #8897 and given the prescribed workarounds It isn't clear what it represents other than sharing one single class.

@tomas-langer
Copy link
Member

tomas-langer commented Sep 27, 2024

There is now a PR for Helidon.
See #9295 for details - both on how it would be configured and how it is implemented.
Please provide feedback!

@tomas-langer tomas-langer added webserver 4.x Version 4.x labels Oct 3, 2024
@tomas-langer tomas-langer added this to the 4.2.0 milestone Oct 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4.x Version 4.x webserver
Projects
Status: Sprint Scope
Development

Successfully merging a pull request may close this issue.

6 participants