You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What happened:
Queue A shares some GPUs and CPUs resources with Queue B, and there are no resources for Queue A's jobs. However, the resources reclaiming became invalid. So I check the logs of volcano-scheduler, and the jobs that need resources don't go into the predicting node and evicting pods stage. There are some bugs that cause unexpected situations in the reclaim.go.
What you expected to happen:
Reclaim happen normally.
How to reproduce it (as minimally and precisely as possible):
The reason of this bug is there:
// Found "high" priority task to reclaim othersiftasks, found:=preemptorTasks[job.UID]; !found||tasks.Empty() {
continue
} else {
task=tasks.Pop().(*api.TaskInfo)
}
if!ssn.Allocatable(queue, task) {
klog.V(3).Infof("Queue <%s> is overused when considering task <%s>, ignore it.", queue.Name, task.Name)
continue
}
iferr:=ssn.PrePredicateFn(task); err!=nil {
klog.V(3).Infof("PrePredicate for task %s/%s failed for: %v", task.Namespace, task.Name, err)
continue
}
When ssn.Allocatable() check false or the preemtor's tasks are empty, continue immediately. The queue has multiple jobs the first job has no tasks and the others have preemptor tasks. The first job checking failed, and others' jobs are skipped. This is how the bug happens.
Anything else we need to know?:
Environment:
Volcano Version: lasted version
Kubernetes version (use kubectl version): v1.22.5-tke.8
The text was updated successfully, but these errors were encountered:
What happened:
Queue A shares some GPUs and CPUs resources with Queue B, and there are no resources for Queue A's jobs. However, the resources reclaiming became invalid. So I check the logs of volcano-scheduler, and the jobs that need resources don't go into the predicting node and evicting pods stage. There are some bugs that cause unexpected situations in the
reclaim.go
.What you expected to happen:
Reclaim happen normally.
How to reproduce it (as minimally and precisely as possible):
The reason of this bug is there:
When
ssn.Allocatable()
check false or the preemtor's tasks are empty, continue immediately. The queue has multiple jobs the first job has no tasks and the others have preemptor tasks. The first job checking failed, and others' jobs are skipped. This is how the bug happens.Anything else we need to know?:
Environment:
kubectl version
): v1.22.5-tke.8The text was updated successfully, but these errors were encountered: