Skip to content

Commit

Permalink
Merge branch 'slurm-24.05' into 24.05.ug-before-reduce-patches
Browse files Browse the repository at this point in the history
  • Loading branch information
itkovian committed Dec 13, 2024
2 parents b5ef918 + bd17c8d commit 425a197
Show file tree
Hide file tree
Showing 37 changed files with 1,045 additions and 719 deletions.
4 changes: 2 additions & 2 deletions META
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@
Name: slurm
Major: 24
Minor: 05
Micro: 4
Version: 24.05.4
Micro: 5
Version: 24.05.5
Release: 1

##
Expand Down
22 changes: 22 additions & 0 deletions NEWS
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
This file describes changes in recent versions of Slurm. It primarily
documents those changes that are of interest to users and administrators.

* Changes in Slurm 24.05.6
==========================

* Changes in Slurm 24.05.5
==========================
-- Fix issue signaling cron jobs resulting in unintended requeues.
Expand All @@ -14,6 +17,25 @@ documents those changes that are of interest to users and administrators.
removal of a dynamic node.
-- gpu/nvml - Attempt loading libnvidia-ml.so.1 as a fallback for failure in
loading libnvidia-ml.so.
-- slurmrestd - Fix populating non-required object fields of objects as '{}' in
JSON/YAML instead of 'null' causing compiled OpenAPI clients to reject
the response to 'GET /slurm/v0.0.40/jobs' due to validation failure of
'.jobs[].job_resources'.
-- Fix sstat/sattach protocol errors for steps on higher version slurmd's
(regressions since 20.11.0rc1 and 16.05.1rc1 respectively).
-- slurmd - Avoid a crash when starting slurmd version 24.05 with
SlurmdSpoolDir files that have been upgraded to a newer major version of
Slurm. Log warnings instead.
-- Fix race condition in stepmgr step completion handling.
-- Fix slurmctld segfault with stepmgr and MpiParams when running a job array.
-- Fix requeued jobs keeping their priority until the decay thread happens.
-- slurmctld - Fix crash and possible split brain issue if the
backup controller handles an scontrol reconfigure while in control
before the primary resumes operation.
-- Fix stepmgr not getting dynamic node addrs from the controller
-- stepmgr - avoid "Unexpected missing socket" errors.
-- Fix `scontrol show steps` with dynamic stepmgr
-- Support IPv6 in configless mode.

* Changes in Slurm 24.05.4
==========================
Expand Down
2 changes: 1 addition & 1 deletion debian/changelog
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
slurm-smd (24.05.4-1) UNRELEASED; urgency=medium
slurm-smd (24.05.5-1) UNRELEASED; urgency=medium

* Initial release.

Expand Down
92 changes: 82 additions & 10 deletions doc/html/containers.shtml
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,11 @@ job or any given plugin).</li>

<h2 id="prereq">Prerequisites<a class="slurm_link" href="#prereq"></a></h2>
<p>The host kernel must be configured to allow user land containers:</p>
<pre>$ sudo sysctl -w kernel.unprivileged_userns_clone=1</pre>
<pre>
sudo sysctl -w kernel.unprivileged_userns_clone=1
sudo sysctl -w kernel.apparmor_restrict_unprivileged_unconfined=0
sudo sysctl -w kernel.apparmor_restrict_unprivileged_userns=0
</pre>

<p>Docker also provides a tool to verify the kernel configuration:
<pre>$ dockerd-rootless-setuptool.sh check --force
Expand Down Expand Up @@ -353,6 +357,62 @@ exit $rc
</pre>
</p>

<h3 id="multiple-runtimes">Handling multiple runtimes
<a class="slurm_link" href="#multiple-runtimes"></a>
</h3>

<p>If you wish to accommodate multiple runtimes in your environment,
it is possible to do so with a bit of extra setup. This section outlines one
possible way to do so:</p>

<ol>
<li>Create a generic oci.conf that calls a wrapper script
<pre>
IgnoreFileConfigJson=true
RunTimeRun="/opt/slurm-oci/run %b %m %u %U %n %j %s %t %@"
RunTimeKill="kill -s SIGTERM %p"
RunTimeDelete="kill -s SIGKILL %p"
</pre>
</li>
<li>Create the wrapper script to check for user-specific run configuration
(e.g., /opt/slurm-oci/run)
<pre>
#!/bin/bash
if [[ -e ~/.slurm-oci-run ]]; then
~/.slurm-oci-run "$@"
else
/opt/slurm-oci/slurm-oci-run-default "$@"
fi
</pre>
</li>
<li>Create a generic run configuration to use as the default
(e.g., /opt/slurm-oci/slurm-oci-run-default)
<pre>
#!/bin/bash --login
# Parse
CONTAINER="$1"
SPOOL_DIR="$2"
USER_NAME="$3"
USER_ID="$4"
NODE_NAME="$5"
JOB_ID="$6"
STEP_ID="$7"
TASK_ID="$8"
shift 8 # subsequent arguments are the command to run in the container
# Run
apptainer run --bind /var/spool --containall "$CONTAINER" "$@"
</pre>
</li>
<li>Add executable permissions to both scripts
<pre>chmod +x /opt/slurm-oci/run /opt/slurm-oci/slurm-oci-run-default</pre>
</li>
</ol>

<p>Once this is done, users may create a script at '~/.slurm-oci-run' if
they wish to customize the container run process, such as using a different
container runtime. Users should model this file after the default
'/opt/slurm-oci/slurm-oci-run-default'</p>

<h2 id="testing">Testing OCI runtime outside of Slurm
<a class="slurm_link" href="#testing"></a>
</h2>
Expand Down Expand Up @@ -458,11 +518,16 @@ scrun being isolated from the network and not being able to communicate with
the Slurm controller. The container is run by Slurm on the compute nodes which
makes having Docker setup a network isolation layer ineffective for the
container.</li>
<li><pre>docker exec</pre> command is not supported.</li>
<li><pre>docker compose</pre> command is not supported.</li>
<li><pre>docker pause</pre> command is not supported.</li>
<li><pre>docker unpause</pre> command is not supported.</li>
<li><pre>docker swarm</pre> command is not supported.</li>
<li><code>docker exec</code> command is not supported.</li>
<li><code>docker swarm</code> command is not supported.</li>
<li><code>docker compose</code>/<code>docker-compose</code> command is not
supported.</li>
<li><code>docker pause</code> command is not supported.</li>
<li><code>docker unpause</code> command is not supported.</li>
<li><code>docker swarm</code> command is not supported.</li>
<li>All <code>docker</code> commands are not supported inside of containers.</li>
<li><a href="https://docs.docker.com/reference/api/engine/">Docker API</a> is
not supported inside of containers.</li>
</ol>

<h3>Setup procedure</h3>
Expand Down Expand Up @@ -580,9 +645,16 @@ configuration.</li>
<li>All containers must use
<a href="https://github.com/containers/podman/blob/main/docs/tutorials/basic_networking.md">
host networking</a></li>
<li><pre>podman exec</pre> command is not supported.</li>
<li><pre>podman kube</pre> command is not supported.</li>
<li><pre>podman pod</pre> command is not supported.</li>
<li><code>podman exec</code> command is not supported.</li>
<li><code>podman-compose</code> command is not supported, due to only being
partially implemented. Some compositions may work but each container
may be run on different nodes. The network for all containers must be
the <code>network_mode: host</code> device.</li>
<li><code>podman kube</code> command is not supported.</li>
<li><code>podman pod</code> command is not supported.</li>
<li><code>podman farm</code> command is not supported.</li>
<li>All <code>podman</code> commands are not supported inside of containers.</li>
<li>Podman REST API is not supported inside of containers.</li>
</ol>

<h3>Setup procedure</h3>
Expand Down Expand Up @@ -875,6 +947,6 @@ Overview slides of Sarus are

<hr size=4 width="100%">

<p style="text-align:center;">Last modified 08 October 2024</p>
<p style="text-align:center;">Last modified 27 November 2024</p>

<!--#include virtual="footer.txt"-->
8 changes: 4 additions & 4 deletions doc/html/faq.shtml
Original file line number Diff line number Diff line change
Expand Up @@ -1231,9 +1231,9 @@ that node may be rendered unusable, but no other harm will result.</p>

<p><a id="clock"><b>Do I need to maintain synchronized
clocks on the cluster?</b></a><br>
In general, yes. Having inconsistent clocks may cause nodes to
be unusable. Slurm log files should contain references to
expired credentials. For example:</p>
In general, yes. Having inconsistent clocks may cause nodes to be unusable and
generate errors in Slurm log files regarding expired credentials. For example:
</p>
<pre>
error: Munge decode failed: Expired credential
ENCODED: Wed May 12 12:34:56 2008
Expand Down Expand Up @@ -2438,6 +2438,6 @@ dset TV::parallel_configs {
}
!-->

<p style="text-align:center;">Last modified 07 November 2024</p>
<p style="text-align:center;">Last modified 19 November 2024</p>

<!--#include virtual="footer.txt"-->
118 changes: 61 additions & 57 deletions doc/html/quickstart_admin.shtml
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,8 @@
<ul>
<li><a href="#prereqs">Installing Prerequisites</a></li>
<li><a href="#rpmbuild">Building RPMs</a></li>
<li><a href="#rpms">Installing RPMs</a></li>
<li><a href="#debuild">Building Debian Packages</a></li>
<li><a href="#debinstall">Installing Debian Packages</a></li>
<li><a href="#pkg_install">Installing Packages</a></li>
<li><a href="#manual_build">Building Manually</a></li>
</ul>
</li>
Expand Down Expand Up @@ -208,28 +207,6 @@ Some macro definitions that may be used in building Slurm include:
%with_munge "--with-munge=/opt/munge"
</pre>

<h3 id="rpms">RPMs Installed<a class="slurm_link" href="#rpms"></a></h3>

<p>The RPMs needed on the head node, compute nodes, and slurmdbd node can vary
by configuration, but here is a suggested starting point:
<ul>
<li>Head Node (where the slurmctld daemon runs),<br>
Compute and Login Nodes
<ul>
<li>slurm</li>
<li>slurm-perlapi</li>
<li>slurm-slurmctld (only on the head node)</li>
<li>slurm-slurmd (only on the compute nodes)</li>
</ul>
</li>
<li>SlurmDBD Node
<ul>
<li>slurm</li>
<li>slurm-slurmdbd</li>
</ul>
</li>
</ul>

<h3 id="debuild">Building Debian Packages
<a class="slurm_link" href="#debuild"></a>
</h3>
Expand Down Expand Up @@ -258,40 +235,67 @@ the packages:</p>

<p>The packages will be in the parent directory after debuild completes.</p>

<h3 id="debinstall">Installing Debian Packages
<a class="slurm_link" href="#debinstall"></a>
<h3 id="pkg_install">Installing Packages
<a class="slurm_link" href="#pkg_install"></a>
</h3>

<p>The packages needed on the head node, compute nodes, and slurmdbd node can
vary site to site, but this is a good starting point:</p>
<ul>
<li>SlurmDBD Node
<ul>
<li>slurm-smd</li>
<li>slurm-smd-slurmdbd</li>
</ul>
</li>
<li>Head Node (slurmctld node)
<ul>
<li>slurm-smd</li>
<li>slurm-smd-slurmctld</li>
<li>slurm-smd-client</li>
</ul>
</li>
<li>Compute Nodes (slurmd node)
<ul>
<li>slurm-smd</li>
<li>slurm-smd-slurmd</li>
<li>slurm-smd-client</li>
</ul>
</li>
<li>Login Nodes
<ul>
<li>slurm-smd</li>
<li>slurm-smd-client</li>
</ul>
</li>
</ul>
<p>The following packages are recommended to achieve basic functionality for the
different <a href="#nodes">node types</a>. Other packages may be added to enable
optional functionality:</p>

<table class="tlist">
<tbody>
<tr>
<td id="rpms"><strong>RPM name</strong></td>
<td id="debinstall"><strong>DEB name</strong></td>
<td><a href="#login">Login</a></td>
<td><a href="#ctld">Controller</a></td>
<td><a href="#compute">Compute</a></td>
<td><a href="#dbd">DBD</a></td>
</tr>
<tr>
<td><code>slurm</code></td>
<td><code>slurm-smd</code></td>
<td><b>X</b></td>
<td><b>X</b></td>
<td><b>X</b></td>
<td><b>X</b></td>
</tr>
<tr>
<td><code>slurm-perlapi</code></td>
<td><code>slurm-smd-client</code></td>
<td><b>X</b></td>
<td><b>X</b></td>
<td><b>X</b></td>
<td></td>
</tr>
<tr>
<td><code>slurm-slurmctld</code></td>
<td><code>slurm-smd-slurmctld</code></td>
<td></td>
<td><b>X</b></td>
<td></td>
<td></td>
</tr>
<tr>
<td><code>slurm-slurmd</code></td>
<td><code>slurm-smd-slurmd</code></td>
<td></td>
<td></td>
<td><b>X</b></td>
<td></td>
</tr>
<tr>
<td><code>slurm-slurmdbd</code></td>
<td><code>slurm-smd-slurmdbd</code></td>
<td></td>
<td></td>
<td></td>
<td><b>X</b></td>
</tr>
</tbody>
</table>
<br>

<h3 id="manual_build">Building Manually
<a class="slurm_link" href="#manual_build"></a>
Expand Down Expand Up @@ -833,6 +837,6 @@ cd /usr/ports/sysutils/slurm-wlm && make install
typical compute nodes. Installing from source allows the user to enable
options such as mysql and gui tools via a configuration menu.</p>

<p style="text-align:center;">Last modified 31 October 2024</p>
<p style="text-align:center;">Last modified 14 November 2024</p>

<!--#include virtual="footer.txt"-->
2 changes: 1 addition & 1 deletion doc/html/related_software.shtml
Original file line number Diff line number Diff line change
Expand Up @@ -173,7 +173,7 @@ time as performed with this tool:
<a href="elasticsearch.html">jobcomp/elasticsearch</a>, and
<a href="jobcomp_kafka.html">jobcomp/kafka</a>) parse and/or
serialize JSON format data. These plugins and slurmrestd are designed to
make use of the <b>JSON-C library (&gt;= v1.12.0)</b> for this purpose.
make use of the <b>JSON-C library (&gt;= v0.15)</b> for this purpose.
Instructions for the build are as follows:</p>
<pre>
git clone --depth 1 --single-branch -b json-c-0.15-20200726 https://github.com/json-c/json-c.git json-c
Expand Down
Loading

0 comments on commit 425a197

Please sign in to comment.