Add RPC requests limits & implement centralized subscription handler #354

tsachiherman · 2024-02-14T18:40:56Z

What ?

This PR addresses two separate issues:

It adds RPC handlers size limits.
Implement a single subscription handler for higher efficiency.

RPC Requests Limits

This PR adds the following requests limits to the RPC handlers:

	// maxTopicsPerQueryRequest defines the maximum number of topics that can be queried in a single request.
	// the number is likely to be more than we want it to be, but would be a safe place to put it -
	// per Test_LargeQueryTesting, the request decoding already failing before it reaches th handler.
	maxTopicsPerQueryRequest = 157733

	// maxTopicsPerBatchQueryRequest defines the maximum number of topics that can be queried in a batch query. This
	// limit is imposed in additional to the per-query limit maxTopicsPerRequest.
	// as a starting value, we've using the same value as above, since the entire request would be tossed
	// away before this is reached.
	maxTopicsPerBatchQueryRequest = maxTopicsPerQueryRequest

On its own, it would "only" prevent DDoS attacks on the server; however, at scale, it would ensure the server could provide sufficient amount of resources per client.

Separately of this change, we need to update the monitoring system to detect the situations where we exceeded the thresholds and generate an alert.

Subscription handler

The existing subscription model was quite naive; it would create a separate subscription for each topic. This was working well for small number of topics, but would completely fail with high number of topics.

To address that, I've created a subscription dispatcher, which is the only message subscriber. That handler compares each received message to any of the requested topics, and dispatch the message over a buffered channel.

Performance testing

Before

Benchmark_SubscribePublishQuery-10    	       1	31397530417 ns/op	87927504 B/op	 1346703 allocs/op

After

Benchmark_SubscribePublishQuery-10    	       1	1333448042 ns/op	60801280 B/op	 1014429 allocs/op

As a summary, you'd notice that there is a notable reduction in memory consumption ( 30% ).

tsachiherman · 2024-02-14T18:50:17Z

@snormore could you please review the above briefly to see if I'm on the right track ? if so, I'll go ahead and create some unit tests for these changes.

snormore

Thanks for jumping on this @tsachiherman 🙏

pkg/api/message/v1/service.go

snormore · 2024-02-14T20:43:17Z

pkg/api/message/v1/service.go

+
+	// maxTopicsPerBatch defines the maximum number of topics that can be queried in a batch query. This
+	// limit is imposed in additional to the per-query limit maxTopicsPerRequest.
+	maxTopicsPerBatch = 4096


We have a metric that shows up to 90k in the same request, so dropping it down to 4k would break that usage, and I doubt those clients have anything in place currently to auto-adjust down to something lower because it'd be a new error.

Curious what @neekolas thinks about having a limit here too.

could you tell me more about these clients ? are these the notification servers ?

The notification servers tend to use the SubscribeAll endpoint which just listens to the whole firehose rather than specific topics. These clients are more likely chatbots that listen to a lot of topics for their users. We could push them to use SubscribeAll but today they're listening to individual topics via this endpoint. I had a wip branch from a few weeks ago that updates this subscribe endpoint to use the wildcard (listen to all topics) inside of this Subscribe endpoint if the batch size exceeds a threshold to optimize memory usage (reduce open goroutines where there's a goroutine per topic subscription now), that could be worth implementing here instead of a limit on the batch size.

ok, so I think that now I understand where the memory issue is coming from.

I was thinking about #topic ~= number of concurrent conversation in the inbox. but I wan't aware of chatbots.

These are completely different use case, with (potentially) different attributes : we might want to favor slower operation @ a lower resource consumption for that.

I'll spend some more time looking into that, but I do think your direction is the way to go.

Agree with @snormore that a hard limit isn't an option. If we were to change the behaviour for existing apps, we'd want to give them a very generous warning and a reasonable alternative.

would you be ok if we'll start off with a very high number, and add configurable options later on ? i.e. start with a 1M limit ( far beyond what would crash the server )

The above metric mention by @snormore applies to the subscribe, not for the query, btw. We don't have a metric, afaik, that tracks the query size.

I updated the numerics here per a unit-test created, where I tested the max possible topics count.
This would guarantee it's won't break anything for the time being, and we will be able to adjust it in the future.

Something like 1M per request would be fine. It's way above what any valid user needs.

If we were to enforce the limit per IP, we'd just have to be very sure the bookkeeping is correct. For example, that a client can't subscribe to 100,000 topics, disconnect, resubscribe to the same list, and repeat until they push themselves over the limit.

Agree - IP filtering isn't trivial. But it's also a bit beyond the scope of this PR. The goal here was to apply some basic and immediate remedies to the existing (crashing) node on the "happy" path.
I can look into IP filtering next.

snormore · 2024-02-14T20:46:51Z

pkg/api/message/v1/service.go

+
+	// maxConcurrentSubscribedTopics is the total number of subscribed topics currently being registered by our node.
+	// request to surpass that limit would be rejected, as the server would not be able to scale to that limit.
+	maxConcurrentSubscribedTopics = 1024 * 1024


The upper bound for this depends on available memory in the node, which changes depending on the deployment environment (devnet vs prod). Currently in prod we've seen over 6M concurrent topic subscriptions before memory warnings start firing.

Alternatively, or in additional to this, I think it would be great to limit concurrent topic subscriptions by IP. Maybe even to start with that vs a global limit across all IPs, but I'm not against having a global limit either, it just might need to be configurable via CLI option instead of a const.

Curious about @neekolas's thoughts on this too.

Yes, I agree - we need multiple-tiers of limits to ensure the service won't be disrupted due to 1% of the clients placing unrealistic load on the server. I'm very much open to any other proposed default, which we would be able to dial-in later on. Does 100M sounds better as a starting point ?

i.e. use a const for default, and override it via .env/config file/cli.

Sounds reasonable to me. If we implement something like main...snor/sub-nats-wildcard-for-many-topics then 6M will no longer be the current prod bound too since memory usage won't be linear based on number of topic subscriptions, or at least not to the degree we have today with goroutines growing with topic subscriptions.

I was thinking of much more fundamental change to address this level of scale. having go-routine per subscription is ridiculous and won't scale. Instead, I think that we should do the following:

create a single subscription per node, which would receive all the messages.

for each subscription (regardless of it's size), create a matching topics map.

when a message received, compare it to all the maps, after Unmarshal'ing it once, and dispatch it to all the recipients.

this approach is less efficient in quiescent scenarios, but would create a limit on the rendered dynamic system load.

It's going to be very difficult to determine what the right global limit is. Rejecting subscription requests might avoid one incident (nodes restarting), but it will create another incident (subscriptions mysteriously failing).

Per IP, we can at least get a sense of what is reasonable and put a cap in so that 99th percentile user doesn't break the network for everyone.

the updated code doesn't reject any subscription requests anymore.

neekolas · 2024-02-16T22:56:47Z

With this model, do we actually need nats at all? It seems like the subscriptionDispatcher has a lot of overlap with what we were previously using nats for.

We would need support for MLS subscriptions to be integrated here as well, before we could outright shut nats down.

tsachiherman · 2024-02-16T23:25:48Z

With this model, do we actually need nats at all? It seems like the subscriptionDispatcher has a lot of overlap with what we were previously using nats for.

We would need support for MLS subscriptions to be integrated here as well, before we could outright shut nats down.

I think that we'll still need it as long as we have multiple node instances, since it's the message synchronization layer, right ?

Even with MLS, we might want to keep it ( or its functional equivalent ) to support horizontal scaling. ( am I missing something ? )

neekolas · 2024-02-17T00:24:13Z

I think that we'll still need it as long as we have multiple node instances, since it's the message synchronization layer, right ?

The way we use nats here is just as a local subscription dispatcher. Message synchronization between nodes happens via Waku Relay (which we are going to need to keep)

neekolas · 2024-02-17T00:29:27Z

pkg/api/message/v1/service.go

-		return nil, status.Errorf(codes.InvalidArgument, "cannot exceed 50 requests in single batch")
+
+	// NOTE: in our implementation, we implicitly limit batch size to 50 requests (maxSubscribeRequestLimitPerBatch = 50)
+	if len(req.Requests) > maxSubscribeRequestLimitPerBatch {


BatchQuery and subscribe are unrelated endpoints. I'd suggest we give each their own constants/limits.

I believe that @snormore asked for this variable name, although I'm very much open to renames. Do you want to suggest a different name ?

Maybe maxQueriesPerBatch

tsachiherman · 2024-02-17T00:46:32Z

I think that we'll still need it as long as we have multiple node instances, since it's the message synchronization layer, right ?

The way we use nats here is just as a local subscription dispatcher. Message synchronization between nodes happens via Waku Relay (which we are going to need to keep)

Oh.. yes. I wouldn't have found that if not being told.. it's quite a spaghetti path.
Mind if we'll do that on a separate PR ? this one seems to be big enough already, and I want to avoid monster PRs.

neekolas · 2024-02-17T00:48:31Z

Given the scope of the changes here, I'd suggest we push this up as a docker image and try running one of our client SDK's test suites against it (maybe xmtp-js). That should validate that things are at least roughly working as expected.

neekolas · 2024-02-17T00:50:07Z

Mind if we'll do that on a separate PR ? this one seems to be big enough already, and I want to avoid monster PRs.

Absolutely. That one is going to be delicate, since we need to make sure that messages still make it from Waku Relay -> apiV1 AND the MLS API, which nats is currently facilitating.

tsachiherman · 2024-02-17T14:47:00Z

Given the scope of the changes here, I'd suggest we push this up as a docker image and try running one of our client SDK's test suites against it (maybe xmtp-js). That should validate that things are at least roughly working as expected.

That makes sense, and I'd like to pursue that.. but not sure how to. Is there any pipeline/github action that does that ?

pkg/api/message/v1/service.go

snormore · 2024-02-19T19:19:58Z

pkg/api/message/v1/service.go

+	// maxTopicsPerRequest defines the maximum number of topics that can be queried in a single request.
+	// the number is likely to be more than we want it to be, but would be a safe place to put it -
+	// per Test_LargeQueryTesting, the request decoding already failing before it reaches th handler.
+	maxTopicsPerRequest = 157733


Want to rename this to maxTopicsPerQueryBatch to make it clear it's for queries?

But.. it's not per batch. it's used both in the Query endpoint as well as in the BatchQuery.
The subsequent one, maxTopicsPerBatch is the one that is being used per batch.

I see, the double use of the term request (as in grpc request and query request) was confusing me. Maybe we can prefix these request references with query to make it clearer, like QueryRequest?

EDIT: I see you already did exactly that 🙏

I already modified it to maxTopicsPerQueryRequest and maxTopicsPerBatchQueryRequest. do you think it's clear enough ?

Yep I think that works 👍

snormore · 2024-02-19T19:23:47Z

pkg/api/message/v1/service.go

+	}
+
+	// are we still within limits ?
+	if totalRequestedTopicsCount > maxTopicsPerBatch {


Should this be > maxTopicsPerRequest since we're comparing the total used topics in the request?

maybe the variable name maxTopicsPerBatch wasn't clear enough : it should be the limit for the total number of topics per batch query. I'll update the variable name accordingly.

snormore · 2024-02-19T19:24:23Z

pkg/api/message/v1/service.go

 	// Naive implementation, perform all sub query requests sequentially
 	responses := make([]*proto.QueryResponse, 0)
 	for _, query := range req.Requests {
+		if len(query.ContentTopics) > maxTopicsPerRequest {


Should this be > maxTopicsPerBatch since we're comparing topics in the specific batch?

I think not.. it should align with the same limits we have in Service.Query. ( where we had a single query ).

Ok I think maybe I'm interpreting the terms differently; you're saying that the grpc request is the batch and that there are multiple "requests" within that. That seems reasonable.

exactly - we had

grpc request

containing one or more queries ( i.e. either Query or Batch Query )

containing one or more topics.

pkg/api/message/v1/subscription.go

pkg/api/server_test.go

snormore · 2024-02-19T20:50:38Z

The lint error in the CI pipeline for this PR is completely unrelated to your changes. I'm not sure how it made it's way into a green main tbh, but you can cherry pick cf16a95 to get rid of it if you want (from one of my open PRs).

snormore · 2024-02-19T21:01:42Z

You can repurpose this github workflow if you want to build+push the docker image with another tag from this branch.

…354) * add reuqests limit. * update per feedback 1. * update per peer review. * update testing. * update * update * rollback change. * additional rollback. * few small tweaks. * update test * update tests * update * add wait. * add minBacklogLength * update * update * update per peer feedback * cherry pick Steven's fix * fix lint * add docker hub deploy for branch. * Deploy this branch to dev for testing * fix bug in subscribed topic count * remove temp deploy workflows. --------- Co-authored-by: Steven Normore <[email protected]>

add reuqests limit.

7c6c19f

tsachiherman self-assigned this Feb 14, 2024

tsachiherman requested review from neekolas and snormore February 14, 2024 18:46

snormore reviewed Feb 14, 2024

View reviewed changes

tsachiherman added 4 commits February 14, 2024 16:07

update per feedback 1.

782d2bd

update per peer review.

368f7e0

update testing.

0e22ae9

update

8c3a231

tsachiherman changed the title ~~add rpc requests limits~~ Add RPC requests limits & implement centralized subscription handler Feb 16, 2024

tsachiherman requested a review from snormore February 16, 2024 14:00

tsachiherman added 9 commits February 16, 2024 09:15

update

11c24f6

rollback change.

4908e71

additional rollback.

545d86d

few small tweaks.

03b0304

update test

6799df3

update tests

a536979

update

06b76ff

add wait.

ff2eaed

add minBacklogLength

24bde2f

tsachiherman marked this pull request as ready for review February 16, 2024 21:17

update

3d5d07b

neekolas reviewed Feb 17, 2024

View reviewed changes

update

e703fbd

snormore reviewed Feb 19, 2024

View reviewed changes

tsachiherman and others added 6 commits February 20, 2024 09:35

update per peer feedback

b129da4

cherry pick Steven's fix

52391ba

fix lint

c7f317a

add docker hub deploy for branch.

57b1ee0

Merge branch 'main' into tsachi/topics-count-limit

9b33511

Deploy this branch to dev for testing

e0d9f57

snormore force-pushed the tsachi/topics-count-limit branch from 264e75b to e0d9f57 Compare February 21, 2024 22:01

tsachiherman added 2 commits February 22, 2024 09:11

fix bug in subscribed topic count

f06531a

remove temp deploy workflows.

aded2d7

tsachiherman requested review from snormore and neekolas February 22, 2024 19:31

snormore approved these changes Feb 22, 2024

View reviewed changes

tsachiherman merged commit 1481d1d into main Feb 22, 2024
3 checks passed

tsachiherman deleted the tsachi/topics-count-limit branch February 22, 2024 20:30

neekolas mentioned this pull request Feb 24, 2024

Debug subscription issues #359

Closed

richardhuaaa mentioned this pull request Feb 26, 2024

Revert "Add RPC requests limits & implement centralized subscription handler" #360

Merged

richardhuaaa mentioned this pull request Feb 27, 2024

Restore centralized subscription handler #361

Merged

Add RPC requests limits & implement centralized subscription handler #354

Add RPC requests limits & implement centralized subscription handler #354

Conversation

tsachiherman commented Feb 14, 2024 • edited Loading

What ?

RPC Requests Limits

Subscription handler

Performance testing

Before

After

tsachiherman commented Feb 14, 2024

snormore left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

snormore Feb 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

neekolas Feb 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

snormore Feb 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

snormore Feb 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

neekolas commented Feb 16, 2024

tsachiherman commented Feb 16, 2024

neekolas commented Feb 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tsachiherman commented Feb 17, 2024

neekolas commented Feb 17, 2024

neekolas commented Feb 17, 2024

tsachiherman commented Feb 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

snormore Feb 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

snormore commented Feb 19, 2024 • edited Loading

snormore commented Feb 19, 2024 • edited Loading

tsachiherman commented Feb 14, 2024 •

edited

Loading

snormore Feb 14, 2024 •

edited

Loading

neekolas Feb 17, 2024 •

edited

Loading

snormore Feb 14, 2024 •

edited

Loading

snormore Feb 14, 2024 •

edited

Loading

snormore Feb 20, 2024 •

edited

Loading

snormore commented Feb 19, 2024 •

edited

Loading

snormore commented Feb 19, 2024 •

edited

Loading