Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate jobs in 'active jobs'. #3668

Open
gcpmendez opened this issue Jul 9, 2024 · 1 comment
Open

Duplicate jobs in 'active jobs'. #3668

gcpmendez opened this issue Jul 9, 2024 · 1 comment

Comments

@gcpmendez
Copy link

We notice that in the Jobs -> Active jobs tab there are duplicate jobs per cluster as both have the same slurm configuration and slurm is configured with a single cluster:

$ _cpu1r
$ sacctmgr show cluster
   Cluster     ControlHost  ControlPort   RPC     Share GrpJobs       GrpTRES GrpSubmit MaxJobs       MaxTRES MaxSubmit     MaxWall                  QOS   Def QOS 
---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- --------- 
     teide      10.0.22.24         6817 10240         1                                                                                           normal    

and in ondemand we can verify the cluster configurations:

$ ssh [email protected]
$ cd /etc/ood/config/clusters.d
$ cat anaga.yml
---
v2:
  metadata:
    title: "Anaga"
  login:
    host: "10.5.22.101"
  job:
    adapter: "slurm"
    cluster: "teide"
    bin: "/usr/bin"
    conf: "/etc/slurm/slurm.conf"
    #bin_overrides:
      # sbatch: "/usr/local/bin/sbatch"
      # squeue: "/usr/bin/squeue"
      # scontrol: "/usr/bin/scontrol"
      # scancel: ""
    copy_enviornment: false
    partitions: ["gpu"]
  batch_connect:
    basic:
      script_wrapper: |
        ml purge
        %s
      set_host: "host=$(hostname -A | awk '{print $1}')"
    vnc:
      script_wrapper: |
        ml purge
        ml load TurboVNC
        #export PATH="/usr/local/turbovnc/bin:$PATH"
        #export WEBSOCKIFY_CMD="/usr/local/websockify/run"
        %s
      set_host: "host=$(hostname -A | awk '{print $1}')"
$ cat teide.yml
---
v2:
  metadata:
    title: "Teide"
  login:
    host: "10.5.22.100"
  job:
    adapter: "slurm"
    cluster: "teide"
    bin: "/usr/bin"
    conf: "/etc/slurm/slurm.conf"
    #bin_overrides:
      # sbatch: "/usr/local/bin/sbatch"
      # squeue: "/usr/bin/squeue"
      # scontrol: "/usr/bin/scontrol"
      # scancel: ""
    copy_enviornment: false
  batch_connect:
    basic:
      script_wrapper: |
        ml purge
        %s
      set_host: "host=$(hostname -A | awk '{print $1}')"
    vnc:
      script_wrapper: |
        ml purge
        ml load TurboVNC
        #export PATH="/usr/local/turbovnc/bin:$PATH"
        #export WEBSOCKIFY_CMD="/usr/local/websockify/run"
        %s
      set_host: "host=$(hostname -A | awk '{print $1}')"

According to the following thread https://discourse.openondemand.org/t/configure-partitions-as-clusters/701/2 we can try to create an "initialiser" to filter the jobs.

The best would be to filter the jobs by "partition" and assign the "anaga" cluster when the jobs are in the "gpu" partition and assign them to the "teide" cluster in the rest of the cases.

$ _cpu1r
$ scontrol show partition | grep PartitionName
PartitionName=main
PartitionName=batch
PartitionName=express
PartitionName=long
PartitionName=gpu
PartitionName=fatnodes
PartitionName=ondemand

Any help is welcome in order to correctly view the jobs associated to each virtual cluster having a single cluster configured in Slurm.
thanks in advance))

@osc-bot osc-bot added this to the Backlog milestone Jul 9, 2024
@johrstrom
Copy link
Contributor

I'm not sure what the solution is here. Sure you can define an initializer to filter based on the cluster if you don't already have it, but if you've defined 2 clusters, OnDemand will act as if they're actually two clusters.

I'm not aware of the virtual cluster pattern here where teide is actually just a partition on anaga (or vice versa), but I guess I'd ask if you actually need the two separate cluster definitions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants