Gray/sockmap #10

jschwinger233 · 2024-03-24T10:43:52Z

Background

这个 PR 引入了两个新的 bpf 程序来加速 WAN TCP。

总体来说，原本的 WAN TCP 劫持路径的数据平面如下图：

 ┌─────────┐                   ┌─────────┐ 
 │ process │                   │ process │ 
 └────┬────┘                   └────▲────┘ 
      │                             │      
 ┌────▼────┐                   ┌────┴────┐ 
 │ socket  │                   │ socket  │ 
 └────┬────┘                   └────▲────┘ 
      │                             │      
 ┌────▼────┐                   ┌────┴────┐ 
 │ tcp/ip  │                   │ tcp/ip  │ 
 └────┬────┘                   └────▲────┘ 
      │                             │      
 ┌────▼────┐    ┌────┬────┐    ┌────┴────┐ 
 │ routing ├────►veth│veth├────► routing │ 
 └─────────┘    └────┴────┘    └─────────┘

这个 PR 把上述路径优化为：

 ┌─────────┐                   ┌─────────┐ 
 │ process │                   │ process │ 
 └────┬────┘                   └────▲────┘ 
      │                             │      
 ┌────▼────┐                   ┌────┴────┐ 
 │ socket  ├───────────────────► socket  │ 
 └─────────┘                   └─────────┘ 
                                           
 ┌─────────┐                   ┌─────────┐ 
 │ tcp/ip  │                   │ tcp/ip  │ 
 └─────────┘                   └─────────┘ 
                                           
 ┌─────────┐    ┌────┬────┐    ┌─────────┐ 
 │ routing │    │veth│veth│    │ routing │ 
 └─────────┘    └────┴────┘    └─────────┘

优化成果见 Benchmark。

实现细节

需要联合使用两个 bpf:

BPF_PROG_TYPE_SOCK_OPS：这个类型的 bpf 是 attach 在 cgroup 上，可以在 TCP socket 三次握手完成时被触发。我们通过检查 routing_tuples_map 来判断一个 socket 是否是 WAN 代理的 socket，如果是的话就用 bpf_sock_hash_update 把 socket 加入 sockmap。
BPF_PROG_TYPE_SK_MSG：这个类型的 bpf 是 attach 一个 sockmap 上，就是第一步收集的 WAN 代理劫持的 sockets。它会在 socket 发送消息的时候触发，通过调用 bpf_msg_redirect_hash 实现 TCP segment 的直接投递。

注意 TCP 握手和挥手依然走内核栈，这部分是不加速的，只有建立连接后才可以

Benchmark

使用 sockperf 测试 latency

dae-0.4.0 结果是

# nsenter -t $(pidof dae-0.4.0) -n sockperf ping-pong -i 172.18.0.3 --tcp --time 10
sockperf: == version #3.7-no.git == 
sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)

[ 0] IP = 172.18.0.3      PORT = 11111 # TCP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=10.000 sec; Warm up time=400 msec; SentMessages=134874; ReceivedMessages=134873
sockperf: ========= Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=9.550 sec; SentMessages=128877; ReceivedMessages=128877
sockperf: ====> avg-latency=37.006 (std-dev=5.955)
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Latency is 37.006 usec
sockperf: Total 128877 observations; each percentile contains 1288.77 observations
sockperf: ---> <MAX> observation =  420.339
sockperf: ---> percentile 99.999 =  313.563
sockperf: ---> percentile 99.990 =  206.996
sockperf: ---> percentile 99.900 =   79.486
sockperf: ---> percentile 99.000 =   50.174
sockperf: ---> percentile 90.000 =   42.508
sockperf: ---> percentile 75.000 =   39.476
sockperf: ---> percentile 50.000 =   36.514
sockperf: ---> percentile 25.000 =   34.145
sockperf: ---> <MIN> observation =   21.565

这个 PR 的结果是

# nsenter -t $(pidof dae) -n sockperf ping-pong -i 172.18.0.3 --tcp --time 10
sockperf: == version #3.7-no.git == 
sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)

[ 0] IP = 172.18.0.3      PORT = 11111 # TCP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=10.000 sec; Warm up time=400 msec; SentMessages=143488; ReceivedMessages=143487
sockperf: ========= Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=9.550 sec; SentMessages=137069; ReceivedMessages=137069
sockperf: ====> avg-latency=34.788 (std-dev=6.701)
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Latency is 34.788 usec
sockperf: Total 137069 observations; each percentile contains 1370.69 observations
sockperf: ---> <MAX> observation =  425.241
sockperf: ---> percentile 99.999 =  407.120
sockperf: ---> percentile 99.990 =  244.703
sockperf: ---> percentile 99.900 =   80.511
sockperf: ---> percentile 99.000 =   47.190
sockperf: ---> percentile 90.000 =   40.633
sockperf: ---> percentile 75.000 =   37.325
sockperf: ---> percentile 50.000 =   34.607
sockperf: ---> percentile 25.000 =   31.777
sockperf: ---> <MIN> observation =   20.779

TCP latency 提升 6%

但 latency 只是性能的一部分，如果是 iperf 跑 tcp rr (round-trip) 在我虚拟机上会直接把内存跑炸

[Mon Mar 25 18:17:02 2024] Out of memory: Killed process 1233 (dae) total-vm:1315492kB, anon-rss:86784kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:296kB oom_score_adj:0
[Mon Mar 25 18:17:02 2024] TCP: out of memory -- consider tuning tcp_mem

在实际场景中，比如 redis-server 和 redis-benchmark 中的表现往往能达到 10%+ 的 p99 提升。

Checklist

The Pull Request has been fully tested
There's an entry in the CHANGELOGS
There is a user-facing docs PR against https://github.com/daeuniverse/dae

Full Changelogs

[Implement ...]

Issue Reference

Closes #[issue number]

Test Result

jschwinger233 added 2 commits March 21, 2024 01:42

bpf: Clean up useless code

63fe825

bpf: Implement local tcp fast redirect for ipv4

7eb11ab

jschwinger233 force-pushed the gray/sockmap branch from 5099a60 to 7eb11ab Compare March 24, 2024 11:07

bpf: support ipv6 for local tcp fast redirect

c1dcec7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gray/sockmap #10

Gray/sockmap #10

jschwinger233 commented Mar 24, 2024 •

edited

Loading

Gray/sockmap #10

Are you sure you want to change the base?

Gray/sockmap #10

Conversation

jschwinger233 commented Mar 24, 2024 • edited Loading

Background

实现细节

Benchmark

Checklist

Full Changelogs

Issue Reference

Test Result

jschwinger233 commented Mar 24, 2024 •

edited

Loading