Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: not remove podgroup uid will cause topology annotation to be useless #3711

Merged
merged 2 commits into from
Sep 14, 2024

Conversation

JesseStutler
Copy link
Member

Background

fix #3469.

In topology plugin affinityCheck, jobNamePrefix is podGroup.Name + '-', the podgroup's name created by vc-controller contains uid, but created pod is {vcjob.Name}+{pod.Name}+{index}, vcjob's name does not contain uid, so the topology annotation now is useless.

func affinityCheck(job *api.JobInfo, affinity [][]string) error {
if job == nil || affinity == nil {
return fmt.Errorf("empty input, job: %v, affinity: %v", job, affinity)
}
var taskNumber = len(job.Tasks)
var taskRef = make(map[string]bool, taskNumber)
var jobNamePrefix = job.Name + "-"
for _, task := range job.Tasks {
// the full task name looks like "${job name}-${task name}-${index}",
// so we can trim the jobNamePrefix and the indexSuffix to get the short task name.
tmpTaskName := task.Name[:strings.LastIndex(task.Name, "-")]
tmpTaskName = strings.TrimPrefix(tmpTaskName, jobNamePrefix)
if _, exist := taskRef[tmpTaskName]; !exist {
taskRef[tmpTaskName] = true
}
}
for _, aff := range affinity {
affTasks := make(map[string]bool, len(aff))
for _, task := range aff {
if len(task) == 0 {
continue
}
if _, exist := taskRef[task]; !exist {
return fmt.Errorf("task %s do not exist in job <%s/%s>", task, job.Namespace, job.Name)
}
if _, exist := affTasks[task]; exist {
return fmt.Errorf("task %s is duplicated in job <%s/%s>", task, job.Namespace, job.Name)
}
affTasks[task] = true
}
}
return nil
}

pgName := job.Name + "-" + string(job.UID)

podName := fmt.Sprintf(jobhelpers.PodNameFmt, job.Name, name, i)

Solution

use var jobNamePrefix = job.Name[:len(job.Name)-uidLength] instead, remove uid

Verification

Create a vcjob, has 2 replicas ps-task and 1 worker-task, ps-task and worker contains affinity, but ps-task and ps-task contains anti.

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: test
  annotations:
    volcano.sh/task-topology-affinity: "ps-task,worker-task"
    volcano.sh/task-topology-anti-affinity: "ps-task"
spec:
  schedulerName: volcano
  minAvailable: 3
  tasks:
    - replicas: 2
      name: "ps-task"
      template:
        spec:
          containers:
            - image: alpine
              command: ["/bin/sh", "-c", "sleep 3600"]
              imagePullPolicy: IfNotPresent
              name: running
              resources:
                requests:
                  cpu: "1"
          restartPolicy: OnFailure
    - replicas: 1
      name: "worker-task"
      template:
        spec:
          containers:
            - image: alpine
              command: ["/bin/sh", "-c", "sleep 3600"]
              imagePullPolicy: IfNotPresent
              name: running
              resources:
                requests:
                  cpu: "1"
          restartPolicy: OnFailure

The worker-task and ps-task are scheduled onto the same pod, but the other ps-task is on different node.
image

@volcano-sh-bot
Copy link
Contributor

Welcome @JesseStutler!

It looks like this is your first PR to volcano-sh/volcano.

Thank you, and welcome to Volcano. 😃

@volcano-sh-bot volcano-sh-bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Sep 9, 2024
Copy link
Member

@hwdef hwdef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some UT

@lowang-bh
Copy link
Member

we should also consider the scene when only scheduler is used, without vc-controller.

@JesseStutler
Copy link
Member Author

we should also consider the scene when only scheduler is used, without vc-controller.

@lowang-bh So when there is no vc-controller, podgroup may be manually created by user? Any suggestion about how to distinguish it?

@lowang-bh
Copy link
Member

we should also consider the scene when only scheduler is used, without vc-controller.

@lowang-bh So when there is no vc-controller, podgroup may be manually created by user? Any suggestion about how to distinguish it?

Manually created the podgroup which is assigned in pod annotations. I think we can distinguish it via weather the pg name exist in pod's annotations.

@Monokaix
Copy link
Member

volcano version before v1.5 use vcjob name as the pg name,we should also consider the compatibility when upgrade from old version,detail:#3652

@@ -247,7 +252,7 @@ func affinityCheck(job *api.JobInfo, affinity [][]string) error {

var taskNumber = len(job.Tasks)
var taskRef = make(map[string]bool, taskNumber)
var jobNamePrefix = job.Name + "-"
var jobNamePrefix = job.Name[:len(job.Name)-uidLength]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When upgrade from old version before v1.5,old format pg name has no uid,the index can be out of range and panic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Now the taskName is task.Taskrole, which is assigned by volcano.sh/task-spec annotation, no need to remove the prefix and suffix to get the taskName now. This is compatible with upgrade issues from older volcano.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's good.

@volcano-sh-bot volcano-sh-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Sep 10, 2024
@JesseStutler
Copy link
Member Author

Please add some UT

@hwdef Already add UTs, please review again, thanks~

@JesseStutler
Copy link
Member Author

JesseStutler commented Sep 10, 2024

we should also consider the scene when only scheduler is used, without vc-controller.

@lowang-bh So when there is no vc-controller, podgroup may be manually created by user? Any suggestion about how to distinguish it?

Manually created the podgroup which is assigned in pod annotations. I think we can distinguish it via weather the pg name exist in pod's annotations.

@lowang-bh After discussing with @Monokaix, we agree that we should use task-spec annotation to directly get the task name from pod(the annotation is added by vc-controller), not remove prefix or suffix, otherwise removing prefix or suffix will introduce some compatibility issues. In your situation, when there is no vc-controller, users should directly add task-spec annotation by themselves.

@Monokaix
Copy link
Member

/lgtm
Please also cherry-pick to release-1.9 and release-1.10 after it's merged.

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Sep 10, 2024
@hwdef
Copy link
Member

hwdef commented Sep 11, 2024

/ok-to-test

@volcano-sh-bot volcano-sh-bot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Sep 11, 2024
Copy link
Member

@william-wang william-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@volcano-sh-bot volcano-sh-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 12, 2024
@volcano-sh-bot volcano-sh-bot removed the lgtm Indicates that a PR is ready to be merged. label Sep 12, 2024
@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: william-wang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Monokaix
Copy link
Member

/lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

vcjob的task名称包含 - 时, task-topolopy 插件会误判而失效。
6 participants