-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add RPC requests limits & implement centralized subscription handler #354
Conversation
@snormore could you please review the above briefly to see if I'm on the right track ? if so, I'll go ahead and create some unit tests for these changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for jumping on this @tsachiherman 🙏
pkg/api/message/v1/service.go
Outdated
|
||
// maxTopicsPerBatch defines the maximum number of topics that can be queried in a batch query. This | ||
// limit is imposed in additional to the per-query limit maxTopicsPerRequest. | ||
maxTopicsPerBatch = 4096 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you tell me more about these clients ? are these the notification servers ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The notification servers tend to use the SubscribeAll endpoint which just listens to the whole firehose rather than specific topics. These clients are more likely chatbots that listen to a lot of topics for their users. We could push them to use SubscribeAll but today they're listening to individual topics via this endpoint. I had a wip branch from a few weeks ago that updates this subscribe endpoint to use the wildcard (listen to all topics) inside of this Subscribe endpoint if the batch size exceeds a threshold to optimize memory usage (reduce open goroutines where there's a goroutine per topic subscription now), that could be worth implementing here instead of a limit on the batch size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, so I think that now I understand where the memory issue is coming from.
I was thinking about #topic ~= number of concurrent conversation in the inbox. but I wan't aware of chatbots.
These are completely different use case, with (potentially) different attributes : we might want to favor slower operation @ a lower resource consumption for that.
I'll spend some more time looking into that, but I do think your direction is the way to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @snormore that a hard limit isn't an option. If we were to change the behaviour for existing apps, we'd want to give them a very generous warning and a reasonable alternative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would you be ok if we'll start off with a very high number, and add configurable options later on ? i.e. start with a 1M limit ( far beyond what would crash the server )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above metric mention by @snormore applies to the subscribe, not for the query, btw. We don't have a metric, afaik, that tracks the query size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the numerics here per a unit-test created, where I tested the max possible topics count.
This would guarantee it's won't break anything for the time being, and we will be able to adjust it in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like 1M per request would be fine. It's way above what any valid user needs.
If we were to enforce the limit per IP, we'd just have to be very sure the bookkeeping is correct. For example, that a client can't subscribe to 100,000 topics, disconnect, resubscribe to the same list, and repeat until they push themselves over the limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree - IP filtering isn't trivial. But it's also a bit beyond the scope of this PR. The goal here was to apply some basic and immediate remedies to the existing (crashing) node on the "happy" path.
I can look into IP filtering next.
pkg/api/message/v1/service.go
Outdated
|
||
// maxConcurrentSubscribedTopics is the total number of subscribed topics currently being registered by our node. | ||
// request to surpass that limit would be rejected, as the server would not be able to scale to that limit. | ||
maxConcurrentSubscribedTopics = 1024 * 1024 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The upper bound for this depends on available memory in the node, which changes depending on the deployment environment (devnet vs prod). Currently in prod we've seen over 6M concurrent topic subscriptions before memory warnings start firing.
Alternatively, or in additional to this, I think it would be great to limit concurrent topic subscriptions by IP. Maybe even to start with that vs a global limit across all IPs, but I'm not against having a global limit either, it just might need to be configurable via CLI option instead of a const.
Curious about @neekolas's thoughts on this too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree - we need multiple-tiers of limits to ensure the service won't be disrupted due to 1% of the clients placing unrealistic load on the server. I'm very much open to any other proposed default, which we would be able to dial-in later on. Does 100M sounds better as a starting point ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i.e. use a const for default, and override it via .env/config file/cli.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable to me. If we implement something like main...snor/sub-nats-wildcard-for-many-topics then 6M will no longer be the current prod bound too since memory usage won't be linear based on number of topic subscriptions, or at least not to the degree we have today with goroutines growing with topic subscriptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of much more fundamental change to address this level of scale. having go-routine per subscription is ridiculous and won't scale. Instead, I think that we should do the following:
- create a single subscription per node, which would receive all the messages.
- for each subscription (regardless of it's size), create a matching topics map.
- when a message received, compare it to all the maps, after Unmarshal'ing it once, and dispatch it to all the recipients.
this approach is less efficient in quiescent scenarios, but would create a limit on the rendered dynamic system load.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's going to be very difficult to determine what the right global limit is. Rejecting subscription requests might avoid one incident (nodes restarting), but it will create another incident (subscriptions mysteriously failing).
Per IP, we can at least get a sense of what is reasonable and put a cap in so that 99th percentile user doesn't break the network for everyone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the updated code doesn't reject any subscription requests anymore.
With this model, do we actually need We would need support for MLS subscriptions to be integrated here as well, before we could outright shut nats down. |
I think that we'll still need it as long as we have multiple node instances, since it's the message synchronization layer, right ? Even with MLS, we might want to keep it ( or its functional equivalent ) to support horizontal scaling. ( am I missing something ? ) |
The way we use |
pkg/api/message/v1/service.go
Outdated
return nil, status.Errorf(codes.InvalidArgument, "cannot exceed 50 requests in single batch") | ||
|
||
// NOTE: in our implementation, we implicitly limit batch size to 50 requests (maxSubscribeRequestLimitPerBatch = 50) | ||
if len(req.Requests) > maxSubscribeRequestLimitPerBatch { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BatchQuery
and subscribe are unrelated endpoints. I'd suggest we give each their own constants/limits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that @snormore asked for this variable name, although I'm very much open to renames. Do you want to suggest a different name ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe maxQueriesPerBatch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, done.
Oh.. yes. I wouldn't have found that if not being told.. it's quite a spaghetti path. |
Given the scope of the changes here, I'd suggest we push this up as a docker image and try running one of our client SDK's test suites against it (maybe |
Absolutely. That one is going to be delicate, since we need to make sure that messages still make it from Waku Relay -> apiV1 AND the MLS API, which |
That makes sense, and I'd like to pursue that.. but not sure how to. Is there any pipeline/github action that does that ? |
pkg/api/message/v1/service.go
Outdated
// maxTopicsPerRequest defines the maximum number of topics that can be queried in a single request. | ||
// the number is likely to be more than we want it to be, but would be a safe place to put it - | ||
// per Test_LargeQueryTesting, the request decoding already failing before it reaches th handler. | ||
maxTopicsPerRequest = 157733 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Want to rename this to maxTopicsPerQueryBatch
to make it clear it's for queries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But.. it's not per batch. it's used both in the Query
endpoint as well as in the BatchQuery
.
The subsequent one, maxTopicsPerBatch
is the one that is being used per batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, the double use of the term request (as in grpc request and query request) was confusing me. Maybe we can prefix these request references with query to make it clearer, like QueryRequest
?
EDIT: I see you already did exactly that 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already modified it to maxTopicsPerQueryRequest
and maxTopicsPerBatchQueryRequest
. do you think it's clear enough ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep I think that works 👍
pkg/api/message/v1/service.go
Outdated
} | ||
|
||
// are we still within limits ? | ||
if totalRequestedTopicsCount > maxTopicsPerBatch { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be > maxTopicsPerRequest
since we're comparing the total used topics in the request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe the variable name maxTopicsPerBatch
wasn't clear enough : it should be the limit for the total number of topics per batch query. I'll update the variable name accordingly.
pkg/api/message/v1/service.go
Outdated
// Naive implementation, perform all sub query requests sequentially | ||
responses := make([]*proto.QueryResponse, 0) | ||
for _, query := range req.Requests { | ||
if len(query.ContentTopics) > maxTopicsPerRequest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be > maxTopicsPerBatch
since we're comparing topics in the specific batch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think not.. it should align with the same limits we have in Service.Query
. ( where we had a single query ).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I think maybe I'm interpreting the terms differently; you're saying that the grpc request is the batch and that there are multiple "requests" within that. That seems reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exactly - we had
- grpc request
- containing one or more queries ( i.e. either Query or Batch Query )
- containing one or more topics.
The lint error in the CI pipeline for this PR is completely unrelated to your changes. I'm not sure how it made it's way into a green |
You can repurpose this github workflow if you want to build+push the docker image with another tag from this branch. |
264e75b
to
e0d9f57
Compare
…354) * add reuqests limit. * update per feedback 1. * update per peer review. * update testing. * update * update * rollback change. * additional rollback. * few small tweaks. * update test * update tests * update * add wait. * add minBacklogLength * update * update * update per peer feedback * cherry pick Steven's fix * fix lint * add docker hub deploy for branch. * Deploy this branch to dev for testing * fix bug in subscribed topic count * remove temp deploy workflows. --------- Co-authored-by: Steven Normore <[email protected]>
What ?
This PR addresses two separate issues:
RPC Requests Limits
This PR adds the following requests limits to the RPC handlers:
On its own, it would "only" prevent DDoS attacks on the server; however, at scale, it would ensure the server could provide sufficient amount of resources per client.
Separately of this change, we need to update the monitoring system to detect the situations where we exceeded the thresholds and generate an alert.
Subscription handler
The existing subscription model was quite naive; it would create a separate subscription for each topic. This was working well for small number of topics, but would completely fail with high number of topics.
To address that, I've created a subscription dispatcher, which is the only message subscriber. That handler compares each received message to any of the requested topics, and dispatch the message over a buffered channel.
Performance testing
Before
After
As a summary, you'd notice that there is a notable reduction in memory consumption ( 30% ).