Reclaim action push queue bugs, maybe cause the reclaiming reouces invalid #3003

RamezesDong · 2023-07-27T09:10:31Z

What happened:
Queue A shares some GPUs and CPUs resources with Queue B, and there are no resources for Queue A's jobs. However, the resources reclaiming became invalid. So I check the logs of volcano-scheduler, and the jobs that need resources don't go into the predicting node and evicting pods stage. There are some bugs that cause unexpected situations in the reclaim.go.

What you expected to happen:
Reclaim happen normally.

How to reproduce it (as minimally and precisely as possible):
The reason of this bug is there:

// Found "high" priority task to reclaim others
if tasks, found := preemptorTasks[job.UID]; !found || tasks.Empty() {
	continue
} else {
	task = tasks.Pop().(*api.TaskInfo)
}

if !ssn.Allocatable(queue, task) {
	klog.V(3).Infof("Queue <%s> is overused when considering task <%s>, ignore it.", queue.Name, task.Name)
	continue
}

if err := ssn.PrePredicateFn(task); err != nil {
	klog.V(3).Infof("PrePredicate for task %s/%s failed for: %v", task.Namespace, task.Name, err)
	continue
}

When ssn.Allocatable() check false or the preemtor's tasks are empty, continue immediately. The queue has multiple jobs the first job has no tasks and the others have preemptor tasks. The first job checking failed, and others' jobs are skipped. This is how the bug happens.

Anything else we need to know?:

Environment:

Volcano Version: lasted version
Kubernetes version (use kubectl version): v1.22.5-tke.8

The text was updated successfully, but these errors were encountered:

RamezesDong · 2023-07-27T09:13:00Z

The issue maybe be assigned to me

RamezesDong · 2023-07-27T09:23:28Z

fix #3004

RamezesDong added the kind/bug Categorizes issue or PR as related to a bug. label Jul 27, 2023

RamezesDong mentioned this issue Jul 27, 2023

fix the reclaim bugs: push back the queue to queues in reclaim #3004

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reclaim action push queue bugs, maybe cause the reclaiming reouces invalid #3003

Reclaim action push queue bugs, maybe cause the reclaiming reouces invalid #3003

RamezesDong commented Jul 27, 2023

RamezesDong commented Jul 27, 2023

RamezesDong commented Jul 27, 2023

Reclaim action push queue bugs, maybe cause the reclaiming reouces invalid #3003

Reclaim action push queue bugs, maybe cause the reclaiming reouces invalid #3003

Comments

RamezesDong commented Jul 27, 2023

RamezesDong commented Jul 27, 2023

RamezesDong commented Jul 27, 2023