Skip to content

Commit

Permalink
Add support for external slurmdbd in Slurm configuration (aws#2595)
Browse files Browse the repository at this point in the history
Use ExternalSlurmdbd to render AccountingStorageHost when defined.

Add unit test case to cover external slurmdbd scenario.

Adapt other unit tests for Slurm configuration rendering when Slurm
Accounting is used.

---

Signed-off-by: Jacopo De Amicis <[email protected]>
  • Loading branch information
jdeamicis committed Jan 29, 2024
1 parent dac1ac3 commit 97d4ca8
Show file tree
Hide file tree
Showing 13 changed files with 76 additions and 6 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ This file is used to list changes made in each version of the AWS ParallelCluste
- Critical Update for Intel oneAPI DPC++/C++ Compiler: 2023.2.1
- Critical Update for Intel Fortran Compiler & Intel Fortran Compiler Classic: 2023.2.1
- Add possibility to choose between Open and Closed Source Nvidia Drivers when building an AMI, through the ```['cluster']['nvidia']['kernel_open']``` cookbook node attribute.
- Add support for external slurmdbd in Slurm cluster configuration.

**CHANGES**
- Upgrade Slurm to 23.11.3 (from 23.02.7).
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
{# Adding comments at the beginning of each line is a trick to indent the template without affecting the output #}
# slurm_parallelcluster.conf is managed by the pcluster processes.
# Do not modify.
# Please use CustomSlurmSettings in the ParallelCluster configuration file to add user-specific slurm configuration
Expand All @@ -12,12 +13,16 @@ SelectTypeParameters=CR_CPU_Memory
{% else %}
SelectTypeParameters=CR_CPU
{% endif %}
{% if scaling_config.Database.Uri is defined %}
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost={{ head_node_config.head_node_hostname }}
AccountingStoragePort=6819
AccountingStorageUser={{ slurmdbd_user }}
JobAcctGatherType=jobacct_gather/cgroup
{% if scaling_config.Database.Uri is defined or scaling_config.ExternalSlurmdbd != None %}
{# #}AccountingStorageType=accounting_storage/slurmdbd
{# #}{% if scaling_config.Database.Uri is defined %}
{# #}AccountingStorageHost={{ head_node_config.head_node_hostname }}
{# #}{% elif scaling_config.ExternalSlurmdbd != None %}
{# #}AccountingStorageHost={{ scaling_config.ExternalSlurmdbd }}
{# #}{% endif %}
{# #}AccountingStoragePort=6819
{# #}AccountingStorageUser={{ slurmdbd_user }}
{# #}JobAcctGatherType=jobacct_gather/cgroup
{% endif %}

{% for queue in queues %}
Expand Down
9 changes: 9 additions & 0 deletions test/unit/slurm/test_slurm_config_generator.py
Original file line number Diff line number Diff line change
Expand Up @@ -128,6 +128,15 @@ def test_generate_slurm_config_files_memory_scheduling(
],
id="Case with Slurm Accounting passing DatabaseName",
),
pytest.param(
"sample_input_externaldbd.yaml",
# Here we don't care about the include file for the slurmdbd.conf, because slurmdbd is not going
# to be launched on the PC cluster (even if our current recipes may still generate it empty).
[
"slurm_parallelcluster_externaldbd.conf",
],
id="Case with Slurmdbd daemon external to the cluster",
),
],
)
def test_generate_slurm_config_files_slurm_accounting(mocker, test_datadir, tmpdir, input_config, expected_outputs):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -87,3 +87,4 @@ Scheduling:
ScaledownIdletime: 10
EnableMemoryBasedScheduling: false
Database: null
ExternalSlurmdbd: null
Original file line number Diff line number Diff line change
Expand Up @@ -89,3 +89,4 @@ Scheduling:
ScaledownIdletime: 10
EnableMemoryBasedScheduling: true
Database: null
ExternalSlurmdbd: null
Original file line number Diff line number Diff line change
Expand Up @@ -86,3 +86,4 @@ Scheduling:
SlurmSettings:
ScaledownIdletime: 10
Database: null
ExternalSlurmdbd: null
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# slurm_parallelcluster.conf is managed by the pcluster processes.
# Do not modify.
# Please use CustomSlurmSettings in the ParallelCluster configuration file to add user-specific slurm configuration
# options

SlurmctldHost=ip-1-0-0-0(ip.1.0.0.0)
SuspendTime=600
ResumeTimeout=1600
SelectTypeParameters=CR_CPU
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=test.slurmdbd.host
AccountingStoragePort=6819
AccountingStorageUser=slurm
JobAcctGatherType=jobacct_gather/cgroup

include <DIR>/pcluster/slurm_parallelcluster_efa_partition.conf

SuspendExcNodes=efa-st-efa-c5n-[1-1]
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# slurm_parallelcluster_slurmdbd.conf is managed by the pcluster processes.
# Do not modify.
# Please add user-specific slurmdbd configuration options in slurmdbd.conf
Original file line number Diff line number Diff line change
Expand Up @@ -24,3 +24,4 @@ Scheduling:
SlurmSettings:
ScaledownIdletime: 10
Database: null
ExternalSlurmdbd: null
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
Scheduling:
SlurmQueues:
- CapacityType: ONDEMAND
ComputeResources:
- DisableSimultaneousMultithreading: true
Efa:
Enabled: true
GdrSupport: false
InstanceType: c5n.18xlarge
MaxCount: 5
MinCount: 1
Name: efa-c5n
SpotPrice: null
StaticNodePriority: 1
DynamicNodePriority: 1000
ComputeSettings: null
CustomActions: null
Iam:
AdditionalIamPolicies: []
InstanceRole: null
S3Access: null
Name: efa
Scheduler: slurm
SlurmSettings:
ScaledownIdletime: 10
ExternalSlurmdbd: test.slurmdbd.host
Database: null
Original file line number Diff line number Diff line change
Expand Up @@ -28,3 +28,4 @@ Scheduling:
UserName: test_admin
PasswordSecretArn: arn:aws:secretsmanager:us-east-1:111111111111:secret:Secret-xxxxxxxx-xxxxx
DatabaseName: null
ExternalSlurmdbd: null
Original file line number Diff line number Diff line change
Expand Up @@ -28,3 +28,4 @@ Scheduling:
UserName: test_admin
PasswordSecretArn: arn:aws:secretsmanager:us-east-1:111111111111:secret:Secret-xxxxxxxx-xxxxx
DatabaseName: test_database
ExternalSlurmdbd: null
Original file line number Diff line number Diff line change
Expand Up @@ -386,3 +386,4 @@ Scheduling:
QueueUpdateStrategy: COMPUTE_FLEET_STOP
ScaledownIdletime: -1
Database: null
ExternalSlurmdbd: null

0 comments on commit 97d4ca8

Please sign in to comment.