Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error scanning namespace workloads if there are batch jobs running on it #18

Open
saholo21 opened this issue Aug 31, 2023 · 10 comments
Open

Comments

@saholo21
Copy link

saholo21 commented Aug 31, 2023

I am trying to scan a cluster that has different kinds of workloads (deployments, pods, statefulsets, batch jobs, etc). However, when the scan finishes, I always get the same error: "jobs.batch not found. The following table may be incomplete due to errors detected during the run." The table only returns a single row, analyzing the kube-system namespace, but not all the other workloads, which amount to more than 300. I believe this issue arises because when the scan starts, there are some jobs running but then they finish during the scan (as they are meant to do). However, the plugin interprets this as an issue and throws an error. Is there any workaround for this problem?

Input = kubectl dds

Output =
error: [jobs.batch "job1" not found, jobs.batch "job2" not found, jobs.batch "job3" not found, jobs.batch "job4" not found]
Warning: The following table may be incomplete due to errors detected during the run
NAMESPACE TYPE NAME STATUS
kube-system daemonset aws-node mounted

@rothgar
Copy link
Contributor

rothgar commented Sep 1, 2023

Do you have a yaml example of the workload you're running?

@saholo21
Copy link
Author

saholo21 commented Sep 4, 2023

No, I don't have access to jobs batch yaml. Is there any possibility of running the kubectl dds only for certain types of workloads? i.e., only for deployments, then do it only for statefulset and so on, to avoid the jobs batch scanning error

@rothgar
Copy link
Contributor

rothgar commented Sep 5, 2023

That might be difficult to implement because the way it works is it scans all pods and then looks for the parent of the pod. It doesn't have a way to start with deployments and work their way down to the pods.

If I implemented this what types of flags would you want? --scan-resource=deployment or --skip=job It would get complicated to add both options but I would need something that could be the default behavior eg --scan-type=all but either way I still have to scan all pods in the cluster and inspect what owns them.

@saholo21
Copy link
Author

saholo21 commented Sep 5, 2023

Understood, the type of flag that would fit the best for this case would be --skip=job, because that's the only workload with which I'm facing issues. However, do you know what could be happening? I mean, there are some running jobs but then they finish during the scan as they meant to do, but the plugin detects this as an error, Is that an expected behavior? Thanks for answering

@rothgar
Copy link
Contributor

rothgar commented Sep 6, 2023

I'm not too sure what would be causing it without being able to replicate the problem or seeing the job spec with something like kubectl get job job1 --output yaml

What version of Kubernetes are you using?

@saholo21
Copy link
Author

saholo21 commented Sep 7, 2023

I was able to get one of the job workloads that's throwing the error.
I am using Kubernetes 1.23 version.
Let me know if that helps.

apiVersion: batch/v1
kind: Job
metadata:
  creationTimestamp: "2023-09-05T11:55:32Z"
  generation: 1
  labels:
    controller-uid: 80fef74c-a01f-4059-b345-d9238c974bec
    job-name: populate-analytic-data-aws-28231914
  name: populate-analytic-data-aws-28231914
  namespace: default
  ownerReferences:
  - apiVersion: batch/v1
    blockOwnerDeletion: true
    controller: true
    kind: CronJob
    name: populate-analytic-data-aws
    uid: 4bb57997-3256-4197-b36d-3172c50732a8
  resourceVersion: "1177585793"
  uid: 80fef74c-a01f-4059-b345-d9238c974bec
spec:
  activeDeadlineSeconds: 10000
  backoffLimit: 3
  completionMode: NonIndexed
  completions: 1
  parallelism: 1
  selector:
    matchLabels:
      controller-uid: 80fef74c-a01f-4059-b345-d9238c974bec
  suspend: false
  template:
    metadata:
      creationTimestamp: null
      labels:
        controller-uid: 80fef74c-a01f-4059-b345-d9238c974bec
        job-name: populate-analytic-data-aws-28231914
    spec:
      containers:
      - args:
        - --botName
        - populate-analytic-data
        - --cassandra
        - cassandra-traffic-04.internal.company.com,cassandra-traffic-02.internal.company.com,cassandra-traffic-03.internal.company.com
        - --keyspace
        - traffic
        - --threads
        - "4"
        - --env
        - staging
        env:
        - name: ENV
          value: staging
        - name: log_level
          value: DEBUG
        image: 111111111111.dkr.ecr.us-east-1.amazonaws.com/populate-analytic-data:4.53-reporting
        imagePullPolicy: IfNotPresent
        name: docker
        resources:
          limits:
            cpu: 450m
            memory: 2000Mi
          requests:
            cpu: 250m
            memory: 1400Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Never
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
status:
  conditions:
  - lastProbeTime: "2023-09-05T14:42:12Z"
    lastTransitionTime: "2023-09-05T14:42:12Z"
    message: Job was active longer than specified deadline
    reason: DeadlineExceeded
    status: "True"
    type: Failed
  failed: 1
  startTime: "2023-09-05T11:55:32Z"

@saholo21
Copy link
Author

Hi @rothgar is there any update about this?

@rothgar
Copy link
Contributor

rothgar commented Sep 12, 2023

Thank you for the example. I'm sorry I haven't been able to test this yet. I'm preparing for some work travel and conference talks and other priorities at work.

@saholo21
Copy link
Author

Hi @rothgar. Just a quick question to confirm something, if the error message only shows some jobs and the final warning says "The following table may be incomplete due to errors detected during the run" means that the result may be incomplete because only the jobs were not scanned and it is not known if they have a docker.sock mount or because this error with the jobs could have stopped the missing scans of other workloads (deployments, daemonsets, statefulset, etc)?

@rothgar
Copy link
Contributor

rothgar commented Sep 15, 2023

It should continue with other jobs and workload types. It doesn't exit the app. It appends the error and continues.
https://github.com/aws-containers/kubectl-detector-for-docker-socket/blob/main/main.go#L270-L273

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants