Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust receive window to make them linear to the count of streams #33913

Merged
merged 1 commit into from
Nov 14, 2023

Conversation

lijunwangs
Copy link
Contributor

@lijunwangs lijunwangs commented Oct 27, 2023

Problem

It has been found in across region testing, transaction sending to remote regions are some time failing with WriteError(Stopped). This is caused that after a stream is opened, the client is blocked from sending due to the following error in Quinn on blocked by connection level flow control. This is due to limited receive window among the multiple streams opened, some stream spent long time in waiting for the limit to be available. The problem is not as obvious when the latency is low as contenders might release the usage fast enough before the time out is reached. But in the case of wide area network with longer latency the problem becomes more severe. Fundamentally, if we allow that many of streams to be opened, we need to ensure they have enough bandwidth to be processed concurrently.

Summary of Changes

Make connection receive window ratio scale to the streams limit.

Without the change, with the following configuration: 4 nodes in the us-west and 1 node in eu-west, running bench-tps tests with thin-client, with duration set to > 300 seconds, we can observe send transaction failures in quic_client with WriteError(Stopped)., with the fix, the problem is no longer observed.

Fixes #

@codecov
Copy link

codecov bot commented Oct 27, 2023

Codecov Report

Merging #33913 (cb65405) into master (24a4670) will increase coverage by 0.0%.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #33913   +/-   ##
=======================================
  Coverage    81.9%    81.9%           
=======================================
  Files         809      809           
  Lines      218062   218062           
=======================================
+ Hits       178616   178660   +44     
+ Misses      39446    39402   -44     

@lijunwangs
Copy link
Contributor Author

In the bench-tps tests with thin client using the following command

lijun@lijun-dev:~/solana$ ./cargo run --release --bin solana-bench-tps -- -u http://35.233.177.221:8899 --identity /home/lijun/.config/solana/id.json --tx_count 1000 --thread-batch-sleep-ms 0 -t 20 --duration 300 -n 35.233.177.221:8001 --read-client-keys /home/lijun/gce-keypairs.yaml

The cluster has 5 nodes, 4 nodes running in us-west and one node in eu-west.

The results for with and without the changes when the target selected is local to the client are similar.

With the changes:

[2023-10-30T18:37:32.379799783Z INFO solana_bench_tps::bench] Average TPS: 13596.006
[2023-10-30T18:47:45.479273306Z INFO solana_bench_tps::bench] Average TPS: 14842.48
[2023-10-30T22:09:32.425591207Z INFO solana_bench_tps::bench] Average TPS: 13733.603
[2023-10-30T22:52:55.438325354Z INFO solana_bench_tps::bench] Average TPS: 13025.998
[2023-10-30T23:16:19.227599708Z INFO solana_bench_tps::bench] Average TPS: 12082.257

Without:
[2023-10-31T00:23:51.799392269Z INFO solana_bench_tps::bench] Average TPS: 13629.625
[2023-10-31T00:40:32.005345530Z INFO solana_bench_tps::bench] Average TPS: 13446.362
[2023-10-31T01:15:02.894377583Z INFO solana_bench_tps::bench] Average TPS: 15698.498
[2023-10-31T05:15:19.921295026Z INFO solana_bench_tps::bench] Average TPS: 10978.847
[2023-10-31T05:21:02.765027475Z INFO solana_bench_tps::bench] Average TPS: 5734.667

There are material differences when the target selected is the only one of the remote node in the cluster (eu-west).

With the change:
[2023-10-30T18:54:22.452562507Z INFO solana_bench_tps::bench] Average TPS: 1743.3228
[2023-10-30T19:12:23.275733223Z INFO solana_bench_tps::bench] Average TPS: 1781.0348
[2023-10-30T19:20:13.163388833Z INFO solana_bench_tps::bench] Average TPS: 1737.1093

Without the fix:
[2023-10-31T00:11:18.913058858Z INFO solana_bench_tps::bench] Average TPS: 125.85008
[2023-10-31T01:04:35.701763927Z INFO solana_bench_tps::bench] Average TPS: 125.05482
[2023-10-31T21:51:28.900832620Z INFO solana_bench_tps::bench] Average TPS: 125.72437

There are 10 times of difference.

@@ -26,12 +26,12 @@ pub const QUIC_CONNECTION_HANDSHAKE_TIMEOUT: Duration = Duration::from_secs(60);

/// The receive window for QUIC connection from unstaked nodes is
/// set to this ratio times [`solana_sdk::packet::PACKET_DATA_SIZE`]
pub const QUIC_UNSTAKED_RECEIVE_WINDOW_RATIO: u64 = 1;
pub const QUIC_UNSTAKED_RECEIVE_WINDOW_RATIO: u64 = 128;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not very familiar with how we use quic, so forgive my long and possibly out of scope questions. I'm trying to understand the effect these constants have.

This means an unstaked peer can send 128 * PACKET_DATA_SIZE bytes before being blocked, right?
PACKET_DATA_SIZE will not include IP, or Quic header bytes, so this means it would be slightly less than 128 max-length transactions?

This is a big jump from 1 to 128 for unstaked peers. Does this 128x the amount of unstaked traffic allowed by the quic-server? Are there concerns about overwhelming the server?

When does this receive window get reset? i.e. the thin-client is spamming transactions all the time during bench-tps, so I'd expect it to nearly always have more than 128 txs in-flight.

Would they get reset on acknowledgement?

What does that mean for my packets if say 1024 packets arrive very close in time to each other? Probably assume the sw quic server would not be fast enough to respond before all packets physically arrive. Would the first 128 be accepted and the rest just dropped?

Copy link
Contributor Author

@lijunwangs lijunwangs Nov 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, the receive window is set to linear to the amount of the the streams allowed. Each connection will be allowed to consume stream count limit x the PACKET_DATA_SIZE. The amounts to about 128*1232 = 150K at maximum for a unstaked connection. We allow 500 unstaked connections -- that amounts to about 73MB max for unstaked connections. I think that is a reasonable number. In any case my thinking is the number of streams and the receive window ratio should be linearly set. Having the large amount of streams with small receive window can 1. reduce throughput because of fragmentation and 2. can actually increase the load of both CPU and memory. For the CPU, the client and server need to maintain the states and doing the pulling on these blocked streams and for memory -- we are actually buffering the data in the quic streamer anyway until the full stream data is received. So for the worst case scenario I do not think it is increasing the max memory consumption. The receive window is dynamically adjusted by the quic protocol layer as data is received and consumed -- it is sliding window.

A good read explaining this mechanism is https://docs.google.com/document/d/1F2YfdDXKpy20WVKJueEf4abn_LVZHhMUMS5gX6Pgjl4/mobilebasic.

When does this receive window get reset? i.e. the thin-client is spamming transactions all the time during bench-tps, so I'd expect it to nearly always have more than 128 txs in-flight.
Yes, if I client decide to do so -- we allow at maximum concurrently 128 streams. That is not net new change though.

What does that mean for my packets if say 1024 packets arrive very close in time to each other? Probably assume the sw quic server would not be fast enough to respond before all packets physically arrive. Would the first 128 be accepted and the rest just dropped?

QUIC protocol will issue blocked message to the stream if the receive window is being exceeded, the client will be blocked and will need to wait for notifications of the WINDOW_UPDATE. For client maliciously ignoring the receive window mechanism, the connection can be dropped by the server.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this will increase the receive window for all the streams for a given connection. So just to understand, this can potentially increase the RAM usage for a given node. Is that correct? If so, in worst case what's the increase in RAM usage?

Copy link
Contributor Author

@lijunwangs lijunwangs Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the worst case scenario which is malicious clients intentionally holds off finalizing the streams and the streamer actually fails to pull off the data from quinn protocol layer, the memory consumed will be

128x1232x512/1024^2 = 73MB for unstaked
512x1232x2000/1024^2 = 1203 MB for staked.

But we are very unlikely to run into this situation as we actively polling data from Quinn into streamer's internal buffers. In that case, even the client intentionally stopping finalizing the data it would not actually increase the net worst case memory pressure as the data would have been buffered at the streamer's temporary buffers anyway which are dictated by the stream_per_connection_count * connection_count.

@lijunwangs
Copy link
Contributor Author

@ryleung-solana @pgarg66 -- can you please help review this?

@lijunwangs lijunwangs merged commit aa991b6 into solana-labs:master Nov 14, 2023
42 checks passed
lijunwangs added a commit to lijunwangs/solana that referenced this pull request Mar 18, 2024
…lana-labs#33913)

Adjust receive window to make them linear to the count of streams to reduce fragmentations
lijunwangs added a commit to lijunwangs/solana that referenced this pull request Mar 18, 2024
…lana-labs#33913)

Adjust receive window to make them linear to the count of streams to reduce fragmentations
lijunwangs added a commit to lijunwangs/solana that referenced this pull request Mar 21, 2024
…lana-labs#33913)

Adjust receive window to make them linear to the count of streams to reduce fragmentations
lijunwangs added a commit to lijunwangs/solana that referenced this pull request Mar 27, 2024
…lana-labs#33913)

Adjust receive window to make them linear to the count of streams to reduce fragmentations
lijunwangs added a commit to lijunwangs/solana that referenced this pull request Apr 3, 2024
…lana-labs#33913)

Adjust receive window to make them linear to the count of streams to reduce fragmentations
@lijunwangs lijunwangs added the v1.17 PRs that should be backported to v1.17 label Apr 5, 2024
Copy link
Contributor

mergify bot commented Apr 5, 2024

Backports to the stable branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule.

mergify bot pushed a commit that referenced this pull request Apr 5, 2024
…3913)

Adjust receive window to make them linear to the count of streams to reduce fragmentations

(cherry picked from commit aa991b6)
lijunwangs added a commit to lijunwangs/solana that referenced this pull request Apr 5, 2024
…lana-labs#33913)

Adjust receive window to make them linear to the count of streams to reduce fragmentations
jeffwashington pushed a commit to lijunwangs/solana that referenced this pull request Apr 5, 2024
…lana-labs#33913)

Adjust receive window to make them linear to the count of streams to reduce fragmentations
lijunwangs added a commit to anza-xyz/agave that referenced this pull request Apr 5, 2024
#595)

Adjust receive window to make them linear to the count of streams (solana-labs#33913)

Adjust receive window to make them linear to the count of streams to reduce fragmentations
willhickey pushed a commit that referenced this pull request Apr 8, 2024
#595)

Adjust receive window to make them linear to the count of streams (#33913)

Adjust receive window to make them linear to the count of streams to reduce fragmentations
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
v1.17 PRs that should be backported to v1.17
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants