Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Populate client side trace's local address via tcp kprobes #1989

Conversation

ddelnano
Copy link
Member

@ddelnano ddelnano commented Aug 23, 2024

Summary: Populate client side trace's local address via tcp kprobes

This change populates client side trace's local_addr and local_port columns for the following use cases:

  1. To provide more consistency for the protocol data tables. Having columns that are empty make it difficult for end users to understand what is being traced and make them less useful
  2. To facilitate addressing a portion of the short lived process problems (Pixie Misses Events from Short Lived Processes/Pods #1638)

For 2, the root of the issue is that df.ctx["pod"] syntax relies on the px.upid_to_pod_name function. If a PEM misses the short lived process during its metadata update, this function fails to resolve the pod name. For client side traces where the pod is making an outbound connection (non localhost), the local_addr column provides an alternative pod name lookup for short lived processes when the pod is long lived. This means the following would be equivalent to the df.ctx["pod"] lookup: px.pod_id_to_pod_name(px.ip_to_pod_id(df.local_addr)).

I intend to follow this PR with a compiler change that will make df.ctx["pod"] try both methods should px.upid_to_pod_name fail to resolve. This will allow the existing pxl scripts to display the previously missed short lived processes.

Alternatives

Another approach I considered was expanding our use of the sock_alloc kprobe. I used ftrace on a simple curl command to see what other options could be used (sudo trace-cmd record -F -p function_graph http://google.com). The socket syscall calls sock_alloc, which would be another mechanism for accessing the struct sock. I decided against this approach because I don't think its viable to assume that the same thread/process that calls socket will be the one that does the later syscalls (how our BPF maps are set up). It's common to have a forking web server model, which means a different process/thread can call socket than the ones that later read/write to it.

Probe stability

These probes appear to be stable from our oldest and newest supported kernel. These functions exist in the tcp_prot, tcpv6_prot structs and I've seen that other projects and bcc tools use these probes. This makes me believe that these functions have a pretty well defined interface.

Relevant Issues: #1829, #1638

Type of change: /kind feature

Test Plan: New tests verify that ipv4 and ipv6 cases work

  • Ran for i in $(seq 0 1000); do curl http://google.com/$i; sleep 2; done within a pod and verified that local_addr is populated with this change and px.pod_id_to_pod_name(px.ip_to_pod_id(df.local_addr)) works for pod name resolution.

  • Verified the above curl test results in traces without local_addr without this change
    local-addr-testing

  • Tested on the following k8s offerings and machine images

  • GKE COS and Ubuntu

  • EKS Amazon Linux 2

Changelog Message: Populate socket tracer data table local_addr and local_port column for client side traces.

@ddelnano ddelnano requested a review from a team as a code owner August 23, 2024 18:37
@ddelnano ddelnano changed the title Populate client side traces local address via tcp kprobes Populate client side trace's local address via tcp kprobes Aug 23, 2024
Comment on lines +353 to +356
{"tcp_sendmsg", ProbeType::kEntry, "probe_entry_populate_active_connect_sock",
/*is_syscall*/ false},
{"tcp_sendmsg", ProbeType::kReturn, "probe_ret_populate_active_connect_sock",
/*is_syscall*/ false},
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we don't have an easy mechanism for testing this, I tried to test this mid stream case by creating a psql shell prior to the PEM starting and issuing queries after the socket tracer was initialized. I verified that a tcp connection was made and only sendto syscalls were issued when queries were executed (no connect syscalls) to check that this would simulate a mid stream case.

Surprisingly, I saw that the local_addr column was populated before I added the probe on tcp_sendmsg despite my thinking that it should be empty. I went ahead and added this probe since I verified with ftrace on the same psql process that tcp_v4_connect isn't called. Just raising this since I wasn't able to verify that this additional probe adds extra coverage from the testing I did.

@ddelnano
Copy link
Member Author

@oazizi000 would appreciate if you could give this a review. I still need to fix the bpf instruction issue for the 4.x kernels and the incorrect storage of the local port (it is stored in network byte order and is an existing issue), but that shouldn't impact the core details in the PR.

@@ -176,7 +176,7 @@ DEFINE_bool(
stirling_debug_tls_sources, gflags::BoolFromEnv("PX_DEBUG_TLS_SOURCES", false),
"If true, stirling will add additional prometheus metrics regarding the traced tls sources");

DEFINE_uint32(stirling_bpf_loop_limit, 42,
DEFINE_uint32(stirling_bpf_loop_limit, 41,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The iovec syscall return probes that rely on process_implicit_conn (recvmsg, recvmmsg, sendmsg, sendmmsg) had their instruction count increase beyond the 4.14 limit. This was my solution to getting the BPF instruction count on 4.14 kernels to work. The rationale is that cases that are right on the edge of this limit are likely to have problems already.

The other things I considered were the following:

  • Remove match_trace_tgid logic on 4.14 kernels -- did not lower the instruction count enough
  • Make the iovec ret syscalls provide a NULL sock for its submit_new_conn call (e.g. submit_new_conn(ctx, tgid, fd, addr, /*sock*/ NULL, role, source_fn)) -- not viable since it partially handles mid stream connections

I'm open to other suggestions if you have any, but this was the set of things I came up with.

@ddelnano ddelnano force-pushed the ddelnano/populate-laddr-for-socket-tracer-records branch from 9e10095 to c219f42 Compare August 29, 2024 17:58
@ddelnano ddelnano force-pushed the ddelnano/populate-laddr-for-socket-tracer-records branch from c219f42 to 7960135 Compare August 29, 2024 17:59
@oazizi000
Copy link
Contributor

Approach looks good. We should just manually test on a few more endpoints to gain more confidence. If that all looks good, then this should be good to go.

…local port (skc_num field)

Signed-off-by: Dom Del Nano <[email protected]>
@ddelnano
Copy link
Member Author

ddelnano commented Sep 4, 2024

@oazizi000 I've tested this change on EKS (Amazon Linux 2), GKE Ubuntu and GKE COS in addition to tracking down the endianness problem (kernel data structure stores the local port in host byte order).

I've also created #2002 to make that byte order more clear even though that case doesn't have a bug.

@ddelnano ddelnano merged commit 0c1fdd2 into pixie-io:main Sep 5, 2024
44 checks passed
@ddelnano ddelnano deleted the ddelnano/populate-laddr-for-socket-tracer-records branch September 5, 2024 16:16
ddelnano added a commit that referenced this pull request Sep 16, 2024
…#2002)

Summary: Use correct byte ordering function for kernel struct member
(skc_num)

This change doesn't result in a functional difference since `ntohs` and
`htons` are inverses of each other on little endian machines (and noops
for big endian machines). This field's byte order caused me confusion in
#1989, so I wanted to make this struct access consistent.

Relevant Issues: N/A

Type of change: /kind cleanup

Test Plan: Testing during #1989 (details
[here](#1989 (comment)))

Signed-off-by: Dom Del Nano <[email protected]>
ddelnano added a commit to ddelnano/pixie that referenced this pull request Sep 23, 2024
…1989)

Summary: Populate client side trace's local address via tcp kprobes

This change populates client side trace's `local_addr` and `local_port`
columns for the following use cases:
1. To provide more consistency for the protocol data tables. Having
columns that are empty make it difficult for end users to understand
what is being traced and make them less useful
2. To facilitate addressing a portion of the short lived process
problems (pixie-io#1638)

For 2, the root of the issue is that `df.ctx["pod"]` syntax relies on
the
[px.upid_to_pod_name](https://docs.px.dev/reference/pxl/udf/upid_to_pod_name/)
function. If a PEM misses the short lived process during its metadata
update, this function fails to resolve the pod name. For client side
traces where the pod is making an outbound connection (non localhost),
the `local_addr` column provides an alternative pod name lookup for
short lived processes when the pod is long lived. This means the
following would be equivalent to the `df.ctx["pod"]` lookup:
`px.pod_id_to_pod_name(px.ip_to_pod_id(df.local_addr))`.

I intend to follow this PR with a compiler change that will make
`df.ctx["pod"]` try both methods should `px.upid_to_pod_name` fail to
resolve. This will allow the existing pxl scripts to display the
previously missed short lived processes.

**Alternatives**

Another approach I considered was expanding our use of the `sock_alloc`
kprobe. I used ftrace on a simple curl command to see what other options
could be used (`sudo trace-cmd record -F -p function_graph
http://google.com`). The `socket` syscall calls `sock_alloc`, which
would be another mechanism for accessing the `struct sock`. I decided
against this approach because I don't think its viable to assume that
the same thread/process that calls `socket` will be the one that does
the later syscalls (how our BPF maps are set up). It's common to have a
forking web server model, which means a different process/thread can
call `socket` than the ones that later read/write to it.

**Probe stability**

These probes appear to be stable from our oldest and newest supported
kernel. These functions exist in the
[tcp_prot](https://elixir.bootlin.com/linux/v4.14.336/source/net/ipv4/tcp_ipv4.c#L2422),
[tcpv6_prot](https://elixir.bootlin.com/linux/v4.14.336/source/net/ipv6/tcp_ipv6.c#L1941)
structs and I've seen that other projects and bcc tools use these
probes. This makes me believe that these functions have a pretty well
defined interface.

Relevant Issues: pixie-io#1829, pixie-io#1638

Type of change: /kind feature

Test Plan: New tests verify that ipv4 and ipv6 cases work
- [x] Ran `for i in $(seq 0 1000); do curl http://google.com/$i; sleep
2; done` within a pod and verified that `local_addr` is populated with
this change and `px.pod_id_to_pod_name(px.ip_to_pod_id(df.local_addr))`
works for pod name resolution.

- [x] Verified the above curl test results in traces without
`local_addr` without this change

![local-addr-testing](https://github.com/user-attachments/assets/344be022-97a0-4096-8af7-8de20d741e40)
- Tested on the following k8s offerings and machine images
- [x] GKE COS and Ubuntu
- [x] EKS Amazon Linux 2

Changelog Message: Populate socket tracer data table `local_addr` and
`local_port` column for client side traces.

---------

Signed-off-by: Dom Del Nano <[email protected]>
GitOrigin-RevId: 0c1fdd2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants