Pending status for cow-job #202

pouya-codes · 2020-07-02T19:02:42Z

I was trying to run the cow-job after setup environments by the following command:
vagrant up && vagrant ssh k8s-master
kubectl apply -f examples/cow.yaml

but when I run kubectl get pods my cow-job is "Pending":
NAME READY STATUS RESTARTS AGE
cow-job 0/1 Pending 0 13s
wlm-operator-ffddd8795-lz98t 1/1 Running 0 16m

The text was updated successfully, but these errors were encountered:

adamwoolhether · 2020-12-11T03:07:27Z

Have you figured it out? Having the same problem of SlurmJobs not initiating.

It seems like they aren't being assigned to the virtual-kubelets, despite ensuring the virtual kubelets have both labels:
k describe pod cow-job

Name:           cow-job
Namespace:      default
Priority:       0
Node:           <none>
Labels:         <none>
Annotations:    <none>
Status:         Pending
IP:
IPs:            <none>
Controlled By:  SlurmJob/cow
Containers:
  jt1:
    Image:        no-image
    Port:         <none>
    Host Port:    <none>
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-b86xw (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  default-token-b86xw:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-b86xw
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  type=virtual-kubelet
                 wlm.sylabs.io/containers=singularity
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
                 virtual-kubelet.io/provider=wlm:NoSchedule
Events:
  Type     Reason            Age        From               Message
  ----     ------            ----       ----               -------
  Warning  FailedScheduling  <unknown>  default-scheduler  0/7 nodes are available: 7 node(s) didn't match node selector.

k get nodes --show-labels

NAME                   STATUS   ROLES    AGE   VERSION          LABELS
qpod3-cn01             Ready    <none>   10d   v1.17.4          beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=qpod3-cn01,kubernetes.io/os=linux
qpod3-cn02             Ready    <none>   10d   v1.17.4          beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=qpod3-cn02,kubernetes.io/os=linux
qpod3-cn03             Ready    <none>   10d   v1.17.4          beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=qpod3-cn03,kubernetes.io/os=linux
qpod3-k8s-master       Ready    master   10d   v1.17.4          beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=qpod3-k8s-master,kubernetes.io/os=linux,node-role.kubernetes.io/master=
slurm-qpod3-cn01-cpn   Ready    agent    49m   v1.13.1-vk-N/A   alpha.service-controller.kubernetes.io/exclude-balancer=true,beta.kubernetes.io/os=linux,kubernetes.io/hostname=slurm-qpod3-cn01-cpn,kubernetes.io/os=linux,kubernetes.io/role=agent,type=virtual-kubelet,wlm.sylabs.io/containers=singularity
slurm-qpod3-cn02-cpn   Ready    agent    49m   v1.13.1-vk-N/A   alpha.service-controller.kubernetes.io/exclude-balancer=true,beta.kubernetes.io/os=linux,kubernetes.io/hostname=slurm-qpod3-cn02-cpn,kubernetes.io/os=linux,kubernetes.io/role=agent,type=virtual-kubelet,wlm.sylabs.io/containers=singularity
slurm-qpod3-cn03-cpn   Ready    agent    49m   v1.13.1-vk-N/A   alpha.service-controller.kubernetes.io/exclude-balancer=true,beta.kubernetes.io/os=linux,kubernetes.io/hostname=slurm-qpod3-cn03-cpn,kubernetes.io/os=linux,kubernetes.io/role=agent,type=virtual-kubelet,wlm.sylabs.io/containers=singularity

pouya-codes · 2020-12-16T00:00:27Z

Hi, yeah, you need to change the nodeSelector at your cow-job config file to one of the existing nodes (e.g., slurm-qpod3-cn01-cpn).

adamwoolhether · 2020-12-16T04:13:17Z

@pisarukv I really appreciate the response. I assume you're referring to the "virtual-kubelet" node?

The slurmjob's pod still isn't being assigned to any node, even after adding kubernetes.io/hostname: slurm-qpod3-cn03-cp to Yaml. No matter how congurent, the pod still fails scheduling, citing no matching no selectors.

If it's not too much trouble, would you mind showing me the output for the following commands?
kubectl get nodes -o wide --show-labels
kubectl describe pods cow-job
kubectl describe describe slurmjobs.wlm.sylabs.io cow
kubectl logs wlm-operator.......

Thanks again.

pouya-codes · 2020-12-17T20:54:35Z

Yes, I'm referring to the virtual-kubelet nodes.
I attached the output of the commands you mentioned.
logs.txt
describeSlurmCow.txt
describePods.txt
getNodes.txt

adamwoolhether · 2020-12-21T00:07:56Z

Many thanks! I think my issue may stem from the fact that I was running the k8s master and slurm master(with slurmctld) as the same node. I've set up a separate test env from out dev environment and got it working.

Thanks again!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pending status for cow-job #202

Pending status for cow-job #202

pouya-codes commented Jul 2, 2020 •

edited

Loading

adamwoolhether commented Dec 11, 2020

pouya-codes commented Dec 16, 2020

adamwoolhether commented Dec 16, 2020 •

edited

Loading

pouya-codes commented Dec 17, 2020

adamwoolhether commented Dec 21, 2020 •

edited

Loading

Pending status for cow-job #202

Pending status for cow-job #202

Comments

pouya-codes commented Jul 2, 2020 • edited Loading

adamwoolhether commented Dec 11, 2020

pouya-codes commented Dec 16, 2020

adamwoolhether commented Dec 16, 2020 • edited Loading

pouya-codes commented Dec 17, 2020

adamwoolhether commented Dec 21, 2020 • edited Loading

pouya-codes commented Jul 2, 2020 •

edited

Loading

adamwoolhether commented Dec 16, 2020 •

edited

Loading

adamwoolhether commented Dec 21, 2020 •

edited

Loading