Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reclaim action push queue bugs, maybe cause the reclaiming reouces invalid #3003

Open
RamezesDong opened this issue Jul 27, 2023 · 2 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@RamezesDong
Copy link

What happened:
Queue A shares some GPUs and CPUs resources with Queue B, and there are no resources for Queue A's jobs. However, the resources reclaiming became invalid. So I check the logs of volcano-scheduler, and the jobs that need resources don't go into the predicting node and evicting pods stage. There are some bugs that cause unexpected situations in the reclaim.go.

What you expected to happen:
Reclaim happen normally.

How to reproduce it (as minimally and precisely as possible):
The reason of this bug is there:

// Found "high" priority task to reclaim others
if tasks, found := preemptorTasks[job.UID]; !found || tasks.Empty() {
	continue
} else {
	task = tasks.Pop().(*api.TaskInfo)
}

if !ssn.Allocatable(queue, task) {
	klog.V(3).Infof("Queue <%s> is overused when considering task <%s>, ignore it.", queue.Name, task.Name)
	continue
}

if err := ssn.PrePredicateFn(task); err != nil {
	klog.V(3).Infof("PrePredicate for task %s/%s failed for: %v", task.Namespace, task.Name, err)
	continue
}

When ssn.Allocatable() check false or the preemtor's tasks are empty, continue immediately. The queue has multiple jobs the first job has no tasks and the others have preemptor tasks. The first job checking failed, and others' jobs are skipped. This is how the bug happens.

Anything else we need to know?:

Environment:

  • Volcano Version: lasted version
  • Kubernetes version (use kubectl version): v1.22.5-tke.8
@RamezesDong RamezesDong added the kind/bug Categorizes issue or PR as related to a bug. label Jul 27, 2023
@RamezesDong
Copy link
Author

The issue maybe be assigned to me

@RamezesDong
Copy link
Author

fix #3004

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant