Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpf: Parse skb->data only once #8

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft

Conversation

jschwinger233
Copy link
Owner

@jschwinger233 jschwinger233 commented Mar 2, 2024

Background

这个 PR 引入了四项性能优化。先回顾 datapath:

                           ┌──────────────────┐ 
             a             │ b                │ 
┌────┐     ┌────┐     ┌────┼────┐      ┌───┐  │ 
│    ├─────►    ├─────►    │    ├──────►   │  │ 
│curl│     │wan0│     │dae0│peer│      │dae│  │ 
│    ◄─────┤    ◄─────┤    │    ◄──────┤   │  │ 
└────┘     └────┘     └────┼────┘      └───┘  │ 
                        c  │     dae netns    │ 
                           └──────────────────┘ 

a. bpf_lan_ingress: 做分流决策:直连流量放行进入网络栈,分流流量调用 bpf_redirect 重定向给 dae0
b. bpf_peer_ingress: 只有分流流量才可能到达这里,调用 bpf_skc_lookup 和 bpf_sk_assign 把流量指定给 dae socket
c. bpf_dae0_ingress: 只有分流流量的 **回复** 才可能到达这里,调用 bpf_redirect 把它重定向回 wan0

优化 1:a 和 b 处的 bpf 程序都解析了一遍二三四层的包头,其实完全没有必要解析两次,在 a 出解析完了之后可以通过 skb->cb 把 b 处需要知道的信息夹带过去。
优化 2:b 处的 peer_ingress bpf 没有必要对 established tcp 调用 bpf_skc_lookup 查询 socket,因为内核本身就可以完成 socket lookup。在开启 tcp_early_demux 的情况下还可以避免路由决策直接做 local delivery。
优化 3:a 处的 wan_egress 把 skb 从 wan0 重定向给 dae0,dae0 egress 翻阅 netns 到达 peer,这一步可以通过 bpf_redirect_peer 简化为:skb 从 wan0 直接重定向给 netns 内部的 peer,避免 enqueue_to_backlog 造成的性能影响。
优化 4:dae 回复分流流量时,是在 netns 内部走一遍邻居系统、然后从 peer 翻阅 netns 到达 dae0 的 c 处。这一步可以通过把 netns 内部的路由全部指向 lo (ip r a default dev lo) 来避开邻居,同时在 lo 上用 bpf 调用 bpf_redirect_peer 直接跨越 netns 提速。

Checklist

Full Changelogs

  • [Implement ...]

Issue Reference

Closes #[issue number]

Test Result

Previously we parsed skb->data for twice: wan_egress/lan_ingress and
dae0peer_ingress. This is because the limit of bpf_sk_assign: we have to
call it within the netns where the socket is.

This patch manages to parse skb->data only once at
wan_egress/lan_ingress, where we leave a value in skb->cb[1] to tell
dae0peer_ingress:
1. if skb->cb[1] == TCP, then it's a new TCP conn, assign skb to TCP
   listener;
2. if skb->cb[1] == UDP, then it's a UDP, assign skb to UDP listener;
3. else it's an establised TCP conn, stack can take care of socket
   lookup;
bpf_redirect_peer can redirect skb from host to inside netns without
going through veth on host. This helps skip enqueue_to_backlog() and
improve performance.

The cost is we have to attach wan_ingress bpf to perform such
bpf_redirect_peer, because this bpf helper only allows to be called from
tc ingress.

wireguard requires kernel patch
https://lore.kernel.org/bpf/[email protected]/.
@jschwinger233 jschwinger233 force-pushed the gray/datapath-perf branch 4 times, most recently from bc3fe38 to 773bc69 Compare March 2, 2024 08:46
Now reply traffic will be routed to lo, then lo_egress bpf redirects
packets to lo_ingress bpf, where bpf_redirect_peer will be called to
pass packets to host.

This also leaves neighbor system bypassed because routing destination lo
requires no L2 header filling.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant