Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gray/sockmap #10

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft

Gray/sockmap #10

wants to merge 3 commits into from

Conversation

jschwinger233
Copy link
Owner

@jschwinger233 jschwinger233 commented Mar 24, 2024

Background

这个 PR 引入了两个新的 bpf 程序来加速 WAN TCP。

总体来说,原本的 WAN TCP 劫持路径的数据平面如下图:

 ┌─────────┐                   ┌─────────┐ 
 │ process │                   │ process │ 
 └────┬────┘                   └────▲────┘ 
      │                             │      
 ┌────▼────┐                   ┌────┴────┐ 
 │ socket  │                   │ socket  │ 
 └────┬────┘                   └────▲────┘ 
      │                             │      
 ┌────▼────┐                   ┌────┴────┐ 
 │ tcp/ip  │                   │ tcp/ip  │ 
 └────┬────┘                   └────▲────┘ 
      │                             │      
 ┌────▼────┐    ┌────┬────┐    ┌────┴────┐ 
 │ routing ├────►veth│veth├────► routing │ 
 └─────────┘    └────┴────┘    └─────────┘ 

这个 PR 把上述路径优化为:

 ┌─────────┐                   ┌─────────┐ 
 │ process │                   │ process │ 
 └────┬────┘                   └────▲────┘ 
      │                             │      
 ┌────▼────┐                   ┌────┴────┐ 
 │ socket  ├───────────────────► socket  │ 
 └─────────┘                   └─────────┘ 
                                           
 ┌─────────┐                   ┌─────────┐ 
 │ tcp/ip  │                   │ tcp/ip  │ 
 └─────────┘                   └─────────┘ 
                                           
 ┌─────────┐    ┌────┬────┐    ┌─────────┐ 
 │ routing │    │veth│veth│    │ routing │ 
 └─────────┘    └────┴────┘    └─────────┘ 

优化成果见 Benchmark。

实现细节

需要联合使用两个 bpf:

  1. BPF_PROG_TYPE_SOCK_OPS:这个类型的 bpf 是 attach 在 cgroup 上,可以在 TCP socket 三次握手完成时被触发。我们通过检查 routing_tuples_map 来判断一个 socket 是否是 WAN 代理的 socket,如果是的话就用 bpf_sock_hash_update 把 socket 加入 sockmap。
  2. BPF_PROG_TYPE_SK_MSG:这个类型的 bpf 是 attach 一个 sockmap 上,就是第一步收集的 WAN 代理劫持的 sockets。它会在 socket 发送消息的时候触发,通过调用 bpf_msg_redirect_hash 实现 TCP segment 的直接投递。

注意 TCP 握手和挥手依然走内核栈,这部分是不加速的,只有建立连接后才可以

Benchmark

使用 sockperf 测试 latency

dae-0.4.0 结果是

# nsenter -t $(pidof dae-0.4.0) -n sockperf ping-pong -i 172.18.0.3 --tcp --time 10
sockperf: == version #3.7-no.git == 
sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)

[ 0] IP = 172.18.0.3      PORT = 11111 # TCP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=10.000 sec; Warm up time=400 msec; SentMessages=134874; ReceivedMessages=134873
sockperf: ========= Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=9.550 sec; SentMessages=128877; ReceivedMessages=128877
sockperf: ====> avg-latency=37.006 (std-dev=5.955)
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Latency is 37.006 usec
sockperf: Total 128877 observations; each percentile contains 1288.77 observations
sockperf: ---> <MAX> observation =  420.339
sockperf: ---> percentile 99.999 =  313.563
sockperf: ---> percentile 99.990 =  206.996
sockperf: ---> percentile 99.900 =   79.486
sockperf: ---> percentile 99.000 =   50.174
sockperf: ---> percentile 90.000 =   42.508
sockperf: ---> percentile 75.000 =   39.476
sockperf: ---> percentile 50.000 =   36.514
sockperf: ---> percentile 25.000 =   34.145
sockperf: ---> <MIN> observation =   21.565

这个 PR 的结果是

# nsenter -t $(pidof dae) -n sockperf ping-pong -i 172.18.0.3 --tcp --time 10
sockperf: == version #3.7-no.git == 
sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)

[ 0] IP = 172.18.0.3      PORT = 11111 # TCP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=10.000 sec; Warm up time=400 msec; SentMessages=143488; ReceivedMessages=143487
sockperf: ========= Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=9.550 sec; SentMessages=137069; ReceivedMessages=137069
sockperf: ====> avg-latency=34.788 (std-dev=6.701)
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Latency is 34.788 usec
sockperf: Total 137069 observations; each percentile contains 1370.69 observations
sockperf: ---> <MAX> observation =  425.241
sockperf: ---> percentile 99.999 =  407.120
sockperf: ---> percentile 99.990 =  244.703
sockperf: ---> percentile 99.900 =   80.511
sockperf: ---> percentile 99.000 =   47.190
sockperf: ---> percentile 90.000 =   40.633
sockperf: ---> percentile 75.000 =   37.325
sockperf: ---> percentile 50.000 =   34.607
sockperf: ---> percentile 25.000 =   31.777
sockperf: ---> <MIN> observation =   20.779

TCP latency 提升 6%

但 latency 只是性能的一部分,如果是 iperf 跑 tcp rr (round-trip) 在我虚拟机上会直接把内存跑炸

[Mon Mar 25 18:17:02 2024] Out of memory: Killed process 1233 (dae) total-vm:1315492kB, anon-rss:86784kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:296kB oom_score_adj:0
[Mon Mar 25 18:17:02 2024] TCP: out of memory -- consider tuning tcp_mem

在实际场景中,比如 redis-server 和 redis-benchmark 中的表现往往能达到 10%+ 的 p99 提升。

Checklist

Full Changelogs

  • [Implement ...]

Issue Reference

Closes #[issue number]

Test Result

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant