Kubernetes pod labels are sometimes missing #1775

Namnamseo · 2024-04-02T05:53:26Z

Describe the bug

Hi!
I'm using Falco to monitor some specific syscalls of Kubernetes pods on a GKE cluster.

It seemed to work well at first, but I've noticed that some events had incomplete fields.
These events:

do not have k8s.pod.labels (shows up as <NA>)
do not have k8s.pod.label[some.valid/label] (shows up as <NA>)
however, have:
- k8s.ns.name
- k8s.pod.cni.json
- k8s.pod.name
- container.id
- container.name
- container.image.repository
- container.image.tag

Upon inspecting the logs, I think that a pod sandbox query is sometimes failing, and it keeps staying that way.

cri (21sa65q9wwiq): Performing lookup
cri_async (21sa65q9wwiq): Starting synchronous lookup
cri (21sa65q9wwiq): Status from ContainerStatus: (an error occurred when try to find container "21sa65q9wwiq": not found)
cri (21sa65q9wwiq): id is neither a container nor a pod sandbox: an error occurred when try to find container "21sa65q9wwiq": not found
cri (21sa65q9wwiq): Failed to get metadata, returning successful=false
cri_async (21sa65q9wwiq): Source callback result=2
notify_new_container (21sa65q9wwiq): created CONTAINER_JSON event, queuing to inspector
...
v (this line repeats:)
Checking IP address of container ad8d174831f6 with incomplete metadata (in context of a6fc074bf49c; state=2)

One thing I noticed is, when I restart the Falco pod on that node, it parses the labels fine.

My weak guesses (after quickly skimming through what I've seen) are:

a timing issue? (pod sandbox had just been created, maybe we're too soon to query its status?)
cache misbehavior? (pod sandbox later gets updated, but we're insisting on our first impression that it is "neither a container nor a pod sanbdox"?)
a whole other issue in containerd?

One subtle issue: in these two log lines,

cri (21sa65q9wwiq): Status from ContainerStatus: (an error occurred when try to find container "21sa65q9wwiq": not found)
cri (21sa65q9wwiq): id is neither a container nor a pod sandbox: an error occurred when try to find container "21sa65q9wwiq": not found

the latter one should explain why PodSandboxStatus (not ContainerStatus) call failed, but there was a bug (using status defined earlier instead of status_pod) in the library v0.14.3.
This seems to have been fixed in rigorous refactoring since.

How to reproduce it

Sorry, I couldn't have this consistently reproduced. It occurs from time to time with no pattern.

Expected behaviour

k8s.pod.labels and k8s.pod.label[some.valid/label] are always filled.

Also, the log should show up like this. (This is the log when everything is normal, and above fields are filled).

cri (b4jswe6vtwr9): Performing lookup
cri_async (b4jswe6vtwr9): Starting synchronous lookup
cri (b4jswe6vtwr9): Status from ContainerStatus: (an error occurred when try to find container "b4jswe6vtwr9": not found)
cri_async (b4jswe6vtwr9): Source callback result=1  
identify_category (131398) (pause): initial process for container, assigning CAT_CONTAINER   
adding container [b4jswe6vtwr9] group: 65535

Screenshots

Environment

Falco version: 0.37.1
System info:

{
  "machine": "x86_64",
  "nodename": "falco-6kh8r",
  "release": "5.15.0-1049-gke",
  "sysname": "Linux",
  "version": "#54-Ubuntu SMP Thu Jan 18 02:57:35 UTC 2024"
}

Cloud provider or hardware configuration
- GKE, Kubernetes v1.27.11-gke.1202000
- $ ctr version says 1.7.12-0ubuntu0~22.04.1~gke1
OS: Ubuntu 22.04.3 LTS
Kernel: 5.15.0-1049-gke
Installation method: falcosecurity/falco Helm chart (of version 4.2.3)
- additional options: --disable-cri-async

Additional context

These are some logs I found relevant.

Mesos container [21sa65q9wwiq],thread [365150], has likely malformed mesos task id [], ignoring
cri (21sa65q9wwiq): Performing lookup
cri_async (21sa65q9wwiq): Starting synchronous lookup
cri (21sa65q9wwiq): Status from ContainerStatus: (an error occurred when try to find container "21sa65q9wwiq": not found)
cri (21sa65q9wwiq): id is neither a container nor a pod sandbox: an error occurred when try to find container "21sa65q9wwiq": not found
cri (21sa65q9wwiq): Failed to get metadata, returning successful=false
cri_async (21sa65q9wwiq): Source callback result=2
notify_new_container (21sa65q9wwiq): created CONTAINER_JSON event, queuing to inspector
adding container [21sa65q9wwiq] user 0
adding container [21sa65q9wwiq] group: 0
Parsing Container JSON={"container":{"Mounts":[],"User":"<NA>","cni_json":"","cpu_period":100000,"cpu_quota":0,"cpu_shares":1024,"cpuset_cpu_count":0,"created_time":19390065,
4831f6","image":"","imagedigest":"","imageid":"","imagerepo":"","imagetag":"","ip":"0.0.0.0","is_pod_sandbox":false,"labels":null,"lookup_state":2,"memory_limit":0,"metadata_
gs":[],"privileged":false,"swap_limit":0,"type":7}}
Filtering container event for failed lookup of 21sa65q9wwiq (but calling callbacks anyway)
identify_category (365151) (runc:[1:CHILD]): initial process for container, assigning CAT_CONTAINER
adding container [21sa65q9wwiq] user 0
adding container [21sa65q9wwiq] group: 0
adding container [21sa65q9wwiq] user 0
adding container [21sa65q9wwiq] group: 0

(NOTE: process tree is as follows.)
365130 /usr/bin/containerd-shim-runc-v2 -namespace k8s.io -id 21sa65q9wwiq...
365151 \_ /pause
...

The text was updated successfully, but these errors were encountered:

alacuku · 2024-04-02T07:50:38Z

Hi @Namnamseo, there is a new component released with the latest falco version called k8s-metacollector which has been developed for such use-cases. It reduces the cases where the pod metadata is missing.

Here you can find the docs on how to install it using the falco chart: https://github.com/falcosecurity/charts/tree/master/charts/falco#k8s-metacollector

Namnamseo · 2024-04-02T10:53:42Z

Right, I've seen those. A standalone metadata collector would really bring up the overall stability.

I only need the pod labels, so I was wondering if this can be done with just the container runtime integration.

incertum · 2024-05-15T17:02:57Z

@Namnamseo once Falco 0.38.0 is out very soon it would be interesting to see if the container runtime socket info extraction is working better since we improved it a bit. And as @alacuku stated you also have the option to use the new k8s plugin.

poiana · 2024-08-13T22:09:30Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

poiana · 2024-09-12T22:10:39Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh with /remove-lifecycle rotten.

Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle rotten

Namnamseo added the kind/bug Something isn't working label Apr 2, 2024

poiana added the lifecycle/stale label Aug 13, 2024

poiana added lifecycle/rotten and removed lifecycle/stale labels Sep 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubernetes pod labels are sometimes missing #1775

Kubernetes pod labels are sometimes missing #1775

Namnamseo commented Apr 2, 2024

alacuku commented Apr 2, 2024

Namnamseo commented Apr 2, 2024 •

edited

Loading

incertum commented May 15, 2024

poiana commented Aug 13, 2024

poiana commented Sep 12, 2024

Kubernetes pod labels are sometimes missing #1775

Kubernetes pod labels are sometimes missing #1775

Comments

Namnamseo commented Apr 2, 2024

alacuku commented Apr 2, 2024

Namnamseo commented Apr 2, 2024 • edited Loading

incertum commented May 15, 2024

poiana commented Aug 13, 2024

poiana commented Sep 12, 2024

Namnamseo commented Apr 2, 2024 •

edited

Loading