Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes pod labels are sometimes missing #1775

Open
Namnamseo opened this issue Apr 2, 2024 · 5 comments
Open

Kubernetes pod labels are sometimes missing #1775

Namnamseo opened this issue Apr 2, 2024 · 5 comments
Labels
kind/bug Something isn't working lifecycle/rotten

Comments

@Namnamseo
Copy link

Describe the bug

Hi!
I'm using Falco to monitor some specific syscalls of Kubernetes pods on a GKE cluster.

It seemed to work well at first, but I've noticed that some events had incomplete fields.
These events:

  • do not have k8s.pod.labels (shows up as <NA>)
  • do not have k8s.pod.label[some.valid/label] (shows up as <NA>)
  • however, have:
    • k8s.ns.name
    • k8s.pod.cni.json
    • k8s.pod.name
    • container.id
    • container.name
    • container.image.repository
    • container.image.tag

Upon inspecting the logs, I think that a pod sandbox query is sometimes failing, and it keeps staying that way.

cri (21sa65q9wwiq): Performing lookup
cri_async (21sa65q9wwiq): Starting synchronous lookup
cri (21sa65q9wwiq): Status from ContainerStatus: (an error occurred when try to find container "21sa65q9wwiq": not found)
cri (21sa65q9wwiq): id is neither a container nor a pod sandbox: an error occurred when try to find container "21sa65q9wwiq": not found
cri (21sa65q9wwiq): Failed to get metadata, returning successful=false
cri_async (21sa65q9wwiq): Source callback result=2
notify_new_container (21sa65q9wwiq): created CONTAINER_JSON event, queuing to inspector
...
v (this line repeats:)
Checking IP address of container ad8d174831f6 with incomplete metadata (in context of a6fc074bf49c; state=2)

One thing I noticed is, when I restart the Falco pod on that node, it parses the labels fine.

My weak guesses (after quickly skimming through what I've seen) are:

  • a timing issue? (pod sandbox had just been created, maybe we're too soon to query its status?)
  • cache misbehavior? (pod sandbox later gets updated, but we're insisting on our first impression that it is "neither a container nor a pod sanbdox"?)
  • a whole other issue in containerd?

One subtle issue: in these two log lines,

cri (21sa65q9wwiq): Status from ContainerStatus: (an error occurred when try to find container "21sa65q9wwiq": not found)
cri (21sa65q9wwiq): id is neither a container nor a pod sandbox: an error occurred when try to find container "21sa65q9wwiq": not found

the latter one should explain why PodSandboxStatus (not ContainerStatus) call failed, but there was a bug (using status defined earlier instead of status_pod) in the library v0.14.3.
This seems to have been fixed in rigorous refactoring since.

How to reproduce it

Sorry, I couldn't have this consistently reproduced. It occurs from time to time with no pattern.

Expected behaviour

k8s.pod.labels and k8s.pod.label[some.valid/label] are always filled.

Also, the log should show up like this. (This is the log when everything is normal, and above fields are filled).

cri (b4jswe6vtwr9): Performing lookup
cri_async (b4jswe6vtwr9): Starting synchronous lookup
cri (b4jswe6vtwr9): Status from ContainerStatus: (an error occurred when try to find container "b4jswe6vtwr9": not found)
cri_async (b4jswe6vtwr9): Source callback result=1  
identify_category (131398) (pause): initial process for container, assigning CAT_CONTAINER   
adding container [b4jswe6vtwr9] group: 65535

Screenshots

Environment

  • Falco version: 0.37.1
  • System info:
{
  "machine": "x86_64",
  "nodename": "falco-6kh8r",
  "release": "5.15.0-1049-gke",
  "sysname": "Linux",
  "version": "#54-Ubuntu SMP Thu Jan 18 02:57:35 UTC 2024"
}
  • Cloud provider or hardware configuration
    • GKE, Kubernetes v1.27.11-gke.1202000
    • $ ctr version says 1.7.12-0ubuntu0~22.04.1~gke1
  • OS: Ubuntu 22.04.3 LTS
  • Kernel: 5.15.0-1049-gke
  • Installation method: falcosecurity/falco Helm chart (of version 4.2.3)
    • additional options: --disable-cri-async

Additional context

These are some logs I found relevant.

Mesos container [21sa65q9wwiq],thread [365150], has likely malformed mesos task id [], ignoring
cri (21sa65q9wwiq): Performing lookup
cri_async (21sa65q9wwiq): Starting synchronous lookup
cri (21sa65q9wwiq): Status from ContainerStatus: (an error occurred when try to find container "21sa65q9wwiq": not found)
cri (21sa65q9wwiq): id is neither a container nor a pod sandbox: an error occurred when try to find container "21sa65q9wwiq": not found
cri (21sa65q9wwiq): Failed to get metadata, returning successful=false
cri_async (21sa65q9wwiq): Source callback result=2
notify_new_container (21sa65q9wwiq): created CONTAINER_JSON event, queuing to inspector
adding container [21sa65q9wwiq] user 0
adding container [21sa65q9wwiq] group: 0
Parsing Container JSON={"container":{"Mounts":[],"User":"<NA>","cni_json":"","cpu_period":100000,"cpu_quota":0,"cpu_shares":1024,"cpuset_cpu_count":0,"created_time":19390065,
4831f6","image":"","imagedigest":"","imageid":"","imagerepo":"","imagetag":"","ip":"0.0.0.0","is_pod_sandbox":false,"labels":null,"lookup_state":2,"memory_limit":0,"metadata_
gs":[],"privileged":false,"swap_limit":0,"type":7}}
Filtering container event for failed lookup of 21sa65q9wwiq (but calling callbacks anyway)
identify_category (365151) (runc:[1:CHILD]): initial process for container, assigning CAT_CONTAINER
adding container [21sa65q9wwiq] user 0
adding container [21sa65q9wwiq] group: 0
adding container [21sa65q9wwiq] user 0
adding container [21sa65q9wwiq] group: 0

(NOTE: process tree is as follows.)
365130 /usr/bin/containerd-shim-runc-v2 -namespace k8s.io -id 21sa65q9wwiq...
365151 \_ /pause
...
@Namnamseo Namnamseo added the kind/bug Something isn't working label Apr 2, 2024
@alacuku
Copy link
Member

alacuku commented Apr 2, 2024

Hi @Namnamseo, there is a new component released with the latest falco version called k8s-metacollector which has been developed for such use-cases. It reduces the cases where the pod metadata is missing.

Here you can find the docs on how to install it using the falco chart: https://github.com/falcosecurity/charts/tree/master/charts/falco#k8s-metacollector

@Namnamseo
Copy link
Author

Namnamseo commented Apr 2, 2024

Right, I've seen those. A standalone metadata collector would really bring up the overall stability.

I only need the pod labels, so I was wondering if this can be done with just the container runtime integration.

@incertum
Copy link
Contributor

@Namnamseo once Falco 0.38.0 is out very soon it would be interesting to see if the container runtime socket info extraction is working better since we improved it a bit. And as @alacuku stated you also have the option to use the new k8s plugin.

@poiana
Copy link
Contributor

poiana commented Aug 13, 2024

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

@poiana
Copy link
Contributor

poiana commented Sep 12, 2024

Stale issues rot after 30d of inactivity.

Mark the issue as fresh with /remove-lifecycle rotten.

Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle rotten

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working lifecycle/rotten
Projects
None yet
Development

No branches or pull requests

4 participants