slurm.html

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  <meta http-equiv="Content-Style-Type" content="text/css" />
  <meta name="generator" content="pandoc" />
  <meta name="author" content="April 18, 2024" />
  <title>Savio intermediate training: Savio tips and tricks – making the most of the Slurm scheduler and of Mamba/Conda environments</title>
  <style type="text/css">
    code{white-space: pre-wrap;}
    span.smallcaps{font-variant: small-caps;}
    div.columns{display: flex; gap: min(4vw, 1.5em);}
    div.column{flex: auto; overflow-x: auto;}
    div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;}
    ul.task-list{list-style: none;}
    ul.task-list li input[type="checkbox"] {
      width: 0.8em;
      margin: 0 0.8em 0.2em -1.6em;
      vertical-align: middle;
    }
    pre > code.sourceCode { white-space: pre; position: relative; }
    pre > code.sourceCode > span { display: inline-block; line-height: 1.25; }
    pre > code.sourceCode > span:empty { height: 1.2em; }
    .sourceCode { overflow: visible; }
    code.sourceCode > span { color: inherit; text-decoration: inherit; }
    div.sourceCode { margin: 1em 0; }
    pre.sourceCode { margin: 0; }
    @media screen {
    div.sourceCode { overflow: auto; }
    }
    @media print {
    pre > code.sourceCode { white-space: pre-wrap; }
    pre > code.sourceCode > span { text-indent: -5em; padding-left: 5em; }
    }
    pre.numberSource code
      { counter-reset: source-line 0; }
    pre.numberSource code > span
      { position: relative; left: -4em; counter-increment: source-line; }
    pre.numberSource code > span > a:first-child::before
      { content: counter(source-line);
        position: relative; left: -1em; text-align: right; vertical-align: baseline;
        border: none; display: inline-block;
        -webkit-touch-callout: none; -webkit-user-select: none;
        -khtml-user-select: none; -moz-user-select: none;
        -ms-user-select: none; user-select: none;
        padding: 0 4px; width: 4em;
        color: #aaaaaa;
      }
    pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa;  padding-left: 4px; }
    div.sourceCode
      {   }
    @media screen {
    pre > code.sourceCode > span > a:first-child::before { text-decoration: underline; }
    }
    code span.al { color: #ff0000; font-weight: bold; } /* Alert */
    code span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */
    code span.at { color: #7d9029; } /* Attribute */
    code span.bn { color: #40a070; } /* BaseN */
    code span.bu { color: #008000; } /* BuiltIn */
    code span.cf { color: #007020; font-weight: bold; } /* ControlFlow */
    code span.ch { color: #4070a0; } /* Char */
    code span.cn { color: #880000; } /* Constant */
    code span.co { color: #60a0b0; font-style: italic; } /* Comment */
    code span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */
    code span.do { color: #ba2121; font-style: italic; } /* Documentation */
    code span.dt { color: #902000; } /* DataType */
    code span.dv { color: #40a070; } /* DecVal */
    code span.er { color: #ff0000; font-weight: bold; } /* Error */
    code span.ex { } /* Extension */
    code span.fl { color: #40a070; } /* Float */
    code span.fu { color: #06287e; } /* Function */
    code span.im { color: #008000; font-weight: bold; } /* Import */
    code span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */
    code span.kw { color: #007020; font-weight: bold; } /* Keyword */
    code span.op { color: #666666; } /* Operator */
    code span.ot { color: #007020; } /* Other */
    code span.pp { color: #bc7a00; } /* Preprocessor */
    code span.sc { color: #4070a0; } /* SpecialChar */
    code span.ss { color: #bb6688; } /* SpecialString */
    code span.st { color: #4070a0; } /* String */
    code span.va { color: #19177c; } /* Variable */
    code span.vs { color: #4070a0; } /* VerbatimString */
    code span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */
    .display.math{display: block; text-align: center; margin: 0.5rem auto;}
  </style>
  <link rel="stylesheet" type="text/css" media="screen, projection, print"
    href="https://www.w3.org/Talks/Tools/Slidy2/styles/slidy.css" />
  <script src="https://www.w3.org/Talks/Tools/Slidy2/scripts/slidy.js"
    charset="utf-8" type="text/javascript"></script>
</head>
<body>
<div class="slide titlepage">
  <h1 class="title">Savio intermediate training: Savio tips and tricks –
making the most of the Slurm scheduler and of Mamba/Conda
environments</h1>
  <p class="author">
April 18, 2024
  </p>
  <p class="date">Chris Paciorek and Jeffrey Jacob</p>
</div>
<div id="upcoming-events-and-hiring" class="slide section level1">
<h1>Upcoming events and hiring</h1>
<ul>
<li><p>Cybersecurity for Researchers</p>
<ul>
<li>Tuesday, October 22, 2024 at 1 pm via Zoom</li>
<li>This brown bag session will focus on secure campus tools and
services that Research IT and Berkeley IT offer to researchers, tips on
navigating campus security processes, and cybersecurity best practices
for keeping your research and research subjects safe.</li>
<li>In partnership with the UC Berkeley Information Security Office and
Industry Alliances Office.</li>
<li>Check our <a
href="https://research-it.berkeley.edu/events-trainings/upcoming-events-trainings">Events
&amp; Training page</a> for more information about this and other
upcoming events.</li>
</ul></li>
<li><p>We offer platforms and services for researchers working with <a
href="https://docs-research-it.berkeley.edu/services/srdc/">sensitive
data</a>.</p></li>
<li><p>Get paid to develop your skills in research data and
computing!</p>
<ul>
<li>Berkeley Research Computing is hiring several graduate student
Domain Consultants for flexible appointments, 10% to 25% effort (4-10
hours/week).</li>
<li>Email your cover letter and CV to: research-it@berkeley.edu.</li>
</ul></li>
</ul>
</div>
<div id="introduction" class="slide section level1">
<h1>Introduction</h1>
<p>We’ll do this in part as a demonstration. We encourage you to login
to your account and try out the various examples yourself as we go
through them.</p>
<p>Much of this material is based on the extensive Savio documention we
have prepared and continue to update, available at <a
href="https://docs-research-it.berkeley.edu/services/high-performance-computing/">https://docs-research-it.berkeley.edu/services/high-performance-computing/</a>.</p>
<p>The materials for this tutorial are available using git at the short
URL (<a href="https://tinyurl.com/brc-apr24">tinyurl.com/brc-apr24</a>),
the GitHub URL (<a
href="https://github.com/ucb-rit/savio-training-slurm-conda-spring-2024">https://github.com/ucb-rit/savio-training-slurm-conda-spring-2024</a>),
or simply as a <a
href="https://github.com/ucb-rit/savio-training-slurm-conda-spring-2024/archive/main.zip">zip
file</a>.</p>
</div>
<div id="how-to-get-additional-help" class="slide section level1">
<h1>How to get additional help</h1>
<ul>
<li>For technical issues and questions about using Savio:
<ul>
<li>brc-hpc-help@berkeley.edu</li>
</ul></li>
<li>For questions about computing resources in general, including cloud
computing:
<ul>
<li>brc@berkeley.edu or research-it-consulting@berkeley.edu</li>
<li>office hours: Wed. 1:30-3:00 and Thur. 9:30-11:00 <a
href="https://research-it.berkeley.edu/programs/berkeley-research-computing/research-computing-consulting">on
Zoom</a></li>
</ul></li>
<li>For questions about data management (including HIPAA-protected
data):
<ul>
<li>researchdata@berkeley.edu</li>
<li>office hours: Wed. 1:30-3:00 and Thur. 9:30-11:00 <a
href="https://research-it.berkeley.edu/programs/berkeley-research-computing/research-computing-consulting">on
Zoom</a></li>
</ul></li>
<li>Status &amp; Service Updates
<ul>
<li>The best way to stay updated on the latest status and updates for
the Research IT services is on the front page of the Research IT
website. If you are having issues or unsure if one of our services is
down, check there first before sending us a ticket.</li>
</ul></li>
</ul>
</div>
<div id="outline" class="slide section level1">
<h1>Outline</h1>
<p>This training session will cover the following topics:</p>
<ul>
<li>Slurm tips and tricks
<ul>
<li>Associations: Accounts, partitions and queues</li>
<li>Requesting specific resources, including GPUs</li>
<li>Diagnosing Slurm submission errors</li>
<li>Understanding the queue and getting jobs to start faster</li>
<li>Using Slurm flags for parallelization</li>
<li>Using MPI and troubleshooting problems</li>
<li>Diagnosing job run-time errors</li>
</ul></li>
<li>Working with Conda/Mamba environments
<ul>
<li>Introduction and Conda vs. Mamba</li>
<li>Creating and isolating environments</li>
<li>Disk space and Conda</li>
<li>Jupyter kernels</li>
</ul></li>
</ul>
</div>
<div id="slurm-scheduler" class="slide section level1">
<h1>Slurm scheduler</h1>
<p>All computations are done by submitting jobs to the scheduling
software that manages jobs on the cluster, called Slurm.</p>
<p>Why is this necessary? Otherwise your jobs would be slowed down by
other people’s jobs running on the same node. This also allows everyone
to fairly share Savio.</p>
<p>Savio uses Slurm to:</p>
<ol style="list-style-type: decimal">
<li>Allocate access to resources (compute nodes) for users’ jobs</li>
<li>Start and monitor jobs on allocated resources</li>
<li>Manage the queue of pending jobs</li>
</ol>
<center>
<img src="savio_diagram.jpeg">
</center>
</div>
<div id="submitting-jobs-accounts-and-partitions"
class="slide section level1">
<h1>Submitting jobs: accounts and partitions</h1>
<p>Generally request:</p>
<ul>
<li>project account (FCA, condo, etc.)</li>
<li>partition (type of node)</li>
</ul>
<p>You can see what accounts you have access to and which partitions
within those accounts as follows:</p>
<pre><code>sacctmgr -p show associations user=SAVIO_USERNAME</code></pre>
<p>Here’s an example of the output for a user who has access to an FCA
and a condo.</p>
<pre><code>Cluster|Account|User|Partition|Share|Priority|GrpJobs|GrpTRES|GrpSubmit|GrpWall|GrpTRESMins|MaxJobs|MaxTRES|MaxTRESPerNode|MaxSubmit|MaxWall|MaxTRESMins|QOS|Def QOS|GrpTRESRunMins|
brc|ucb|paciorek|ood-inter|1|||||||||||||ood_interactive|ood_interactive||
brc|fc_paciorek|paciorek|savio4_gpu|1|||||||||||||a5k_gpu4_normal,savio_lowprio|a5k_gpu4_normal||
brc|fc_paciorek|paciorek|savio4_htc|1|||||||||||||savio_debug,savio_normal|savio_normal||
brc|fc_paciorek|paciorek|savio3_gpu|1|||||||||||||a40_gpu3_normal,gtx2080_gpu3_normal,savio_lowprio,v100_gpu3_normal|gtx2080_gpu3_normal||
brc|fc_paciorek|paciorek|savio3_htc|1|||||||||||||savio_debug,savio_normal|savio_normal||
brc|fc_paciorek|paciorek|savio3_bigmem|1|||||||||||||savio_debug,savio_normal|savio_normal||
brc|fc_paciorek|paciorek|savio3|1|||||||||||||savio_debug,savio_normal|savio_normal||
brc|fc_paciorek|paciorek|savio2_1080ti|1|||||||||||||savio_debug,savio_normal|savio_normal||
brc|fc_paciorek|paciorek|savio2_knl|1|||||||||||||savio_debug,savio_normal|savio_normal||
brc|fc_paciorek|paciorek|savio2_gpu|1|||||||||||||savio_debug,savio_normal|savio_normal||
brc|fc_paciorek|paciorek|savio2_htc|1|||||||||||||savio_debug,savio_long,savio_normal|savio_normal||
brc|fc_paciorek|paciorek|savio2_bigmem|1|||||||||||||savio_debug,savio_normal|savio_normal||
brc|fc_paciorek|paciorek|savio2|1|||||||||||||savio_debug,savio_normal|savio_normal||
brc|fc_paciorek|paciorek|savio|1|||||||||||||savio_debug,savio_normal|savio_normal||
brc|fc_paciorek|paciorek|savio_bigmem|1|||||||||||||savio_debug,savio_normal|savio_normal||
brc|co_stat|paciorek|savio3_gpu|1|||||||||||||savio_lowprio|savio_lowprio||
brc|co_stat|paciorek|savio4_gpu|1|||||||||||||savio_lowprio|savio_lowprio||
brc|co_stat|paciorek|savio4_htc|1|||||||||||||savio_lowprio|savio_lowprio||
brc|co_stat|paciorek|savio3_htc|1|||||||||||||savio_lowprio|savio_lowprio||
brc|co_stat|paciorek|savio3_bigmem|1|||||||||||||savio_lowprio|savio_lowprio||
brc|co_stat|paciorek|savio3|1|||||||||||||savio_lowprio|savio_lowprio||
brc|co_stat|paciorek|savio2_1080ti|1|||||||||||||savio_lowprio|savio_lowprio||
brc|co_stat|paciorek|savio2_knl|1|||||||||||||savio_lowprio|savio_lowprio||
brc|co_stat|paciorek|savio2_bigmem|1|||||||||||||savio_lowprio|savio_lowprio||
brc|co_stat|paciorek|savio2_gpu|1|||||||||||||savio_lowprio,stat_gpu2_normal|stat_gpu2_normal||
brc|co_stat|paciorek|savio2_htc|1|||||||||||||savio_lowprio|savio_lowprio||
brc|co_stat|paciorek|savio|1|||||||||||||savio_lowprio|savio_lowprio||
brc|co_stat|paciorek|savio_bigmem|1|||||||||||||savio_lowprio|savio_lowprio||
brc|co_stat|paciorek|savio2|1|||||||||||||savio_lowprio,stat_savio2_normal|stat_savio2_normal||</code></pre>
<p>If you are part of a condo, you’ll notice that you have
<em>low-priority</em> access to certain partitions. For example I am
part of the statistics condo <em>co_stat</em>, which owns some ‘savio2’
nodes and ‘savio2_gpu’ nodes and therefore I have normal access to
those, but I can also burst beyond the condo and use other partitions at
low priority.</p>
<p>In contrast, through my FCA, I have access to the most partitions at
normal priority, but not all of them…</p>
<pre><code>[paciorek@ln002 ~]$ srun -p savio3_xlmem -A co_stat -t 5:00 --pty bash
srun: error: Unable to allocate resources: Invalid account or account/partition combination specified</code></pre>
</div>
<div id="submitting-a-batch-job" class="slide section level1">
<h1>Submitting a batch job</h1>
<p>Let’s see how to submit a simple job. If your job will only use the
resources on a single node, you can do the following.</p>
<p>Here’s an example job script that I’ll run.</p>
<div class="sourceCode" id="cb4"><pre
class="sourceCode bash"><code class="sourceCode bash"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="co">#!/bin/bash</span></span>
<span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a><span class="co"># Job name:</span></span>
<span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a><span class="co">#SBATCH --job-name=test</span></span>
<span id="cb4-4"><a href="#cb4-4" aria-hidden="true" tabindex="-1"></a><span class="co">#</span></span>
<span id="cb4-5"><a href="#cb4-5" aria-hidden="true" tabindex="-1"></a><span class="co"># Account:</span></span>
<span id="cb4-6"><a href="#cb4-6" aria-hidden="true" tabindex="-1"></a><span class="co">#SBATCH --account=fc_paciorek</span></span>
<span id="cb4-7"><a href="#cb4-7" aria-hidden="true" tabindex="-1"></a><span class="co">#</span></span>
<span id="cb4-8"><a href="#cb4-8" aria-hidden="true" tabindex="-1"></a><span class="co"># Partition:</span></span>
<span id="cb4-9"><a href="#cb4-9" aria-hidden="true" tabindex="-1"></a><span class="co">#SBATCH --partition=savio3_htc</span></span>
<span id="cb4-10"><a href="#cb4-10" aria-hidden="true" tabindex="-1"></a><span class="co">#</span></span>
<span id="cb4-11"><a href="#cb4-11" aria-hidden="true" tabindex="-1"></a><span class="co"># Cores:</span></span>
<span id="cb4-12"><a href="#cb4-12" aria-hidden="true" tabindex="-1"></a><span class="co">#SBATCH --cpus-per-task=2</span></span>
<span id="cb4-13"><a href="#cb4-13" aria-hidden="true" tabindex="-1"></a><span class="co">#</span></span>
<span id="cb4-14"><a href="#cb4-14" aria-hidden="true" tabindex="-1"></a><span class="co"># Wall clock limit (2 minutes here):</span></span>
<span id="cb4-15"><a href="#cb4-15" aria-hidden="true" tabindex="-1"></a><span class="co">#SBATCH --time=00:02:00</span></span>
<span id="cb4-16"><a href="#cb4-16" aria-hidden="true" tabindex="-1"></a><span class="co">#</span></span>
<span id="cb4-17"><a href="#cb4-17" aria-hidden="true" tabindex="-1"></a><span class="co">## Command(s) to run:</span></span>
<span id="cb4-18"><a href="#cb4-18" aria-hidden="true" tabindex="-1"></a><span class="ex">module</span> load python/3.10.10    </span>
<span id="cb4-19"><a href="#cb4-19" aria-hidden="true" tabindex="-1"></a><span class="ex">python</span> calc.py <span class="op">&gt;&amp;</span> calc.out</span></code></pre></div>
<p>Note: The number of cores and nodes requested default to 1.</p>
<p>Tip: It’s generally a good idea to specify module versions explicitly
for reproducibility. Default versions will change over time.</p>
</div>
<div id="monitoring-jobs" class="slide section level1">
<h1>Monitoring jobs</h1>
<p>Now let’s submit and monitor the job:</p>
<pre><code>sbatch test.sh

squeue -j &lt;JOB_ID&gt;

wwall -j &lt;JOB_ID&gt;</code></pre>
<p>You can also login to the node where you are running and use commands
like <code>top</code>, <code>free</code>, and <code>ps</code>:</p>
<pre><code>srun --jobid=&lt;JOB_ID&gt; --pty /bin/bash</code></pre>
<p>After a job has completed (or been terminated/cancelled), you can
review the maximum memory used (and other information) via the sacct
command.</p>
<pre><code>sacct -j &lt;JOB_ID&gt; --format=JobID,JobName,MaxRSS,Elapsed</code></pre>
<p>MaxRSS will show the maximum amount of memory that the job used in
kilobytes.</p>
</div>
<div id="specific-resources-cpus-cores" class="slide section level1">
<h1>Specific resources: CPUs (cores)</h1>
<p><strong>Per-core allocations</strong>: For partitions named
<code>_htc</code> or <code>_gpu</code>, jobs are scheduled (and charged)
per core. Default one core.</p>
<p><strong>Per-node allocations</strong>: For other partitions, jobs are
given exclusive access to entire node(s) (and your account is charged
for all of the cores on the node(s)).</p>
<p>In a few partitions the number of cores differ between machines in
the partition.</p>
<ul>
<li><p>E.g., <a
href="https://docs-research-it.berkeley.edu/services/high-performance-computing/user-guide/hardware-config/">in
<code>savio3</code>, some nodes have 40 cores and some have 32
cores</a>.</p></li>
<li><p>To request <a
href="https://docs-research-it.berkeley.edu/services/high-performance-computing/user-guide/running-your-jobs/scheduler-config/">particular
‘features’</a>, you can use <code>-C</code>, e.g.,</p>
<pre><code>srun -p savio3 -C savio3_c40 -A ac_scsguest --pty -t 5:00 bash  # 40 cores
srun -p savio3 -C savio3 -A ac_scsguest --pty -t 5:00 bash      # 32 cores</code></pre></li>
</ul>
</div>
<div id="specific-resources-memory-ram" class="slide section level1">
<h1>Specific resources: Memory (RAM)</h1>
<p>You generally should not request a particular amount of memory:</p>
<ul>
<li>full-node allocations can automatically use all the memory</li>
<li>per-core allocations are given memory proportional to the <a
href="https://docs-research-it.berkeley.edu/services/high-performance-computing/user-guide/hardware-config/">number
of cores</a>.
<ul>
<li>to get more memory, request the number of cores equivalent to the
memory you need.</li>
</ul></li>
<li>In some partitions (<code>savio4_htc</code>,
<code>savio3_gpu</code>), the amount of CPU memory per node varies. (See
previous slide about ‘constraints’.)</li>
</ul>
</div>
<div id="specific-resources-gpus" class="slide section level1">
<h1>Specific resources: GPUs</h1>
<p>GPU technology is advancing fast. As a result, it’s hard to maintain
a large, homogeneous pool of <a
href="https://docs-research-it.berkeley.edu/services/high-performance-computing/user-guide/hardware-config">GPU
nodes</a>.</p>
<ul>
<li><code>savio2_gpu</code> has (old) K80 GPUs.</li>
<li><code>savio3_gpu</code> has GTX2080TI, TITAN RTX, V100, and A40
nodes.</li>
<li><code>savio4_gpu</code> has A5000 nodes.</li>
</ul>
<p>Required submission info:</p>
<ul>
<li>Request the number of GPUs.</li>
<li>Request a <a
href="https://docs-research-it.berkeley.edu/services/high-performance-computing/user-guide/running-your-jobs/submitting-jobs/#gpu-jobs">fixed
number of CPUs for each GPU you need</a>.</li>
<li>Request the GPU type in (for <code>savio3_gpu</code>,
<code>savio4_gpu</code>).</li>
</ul>
<p>For example:</p>
<pre><code>sbatch -A fc_foo -p savio3_gpu --gres=gpu:GTX2080TI:1 -c 2 -t 60:00 job.sh
sbatch -A fc_foo -p savio3_gpu --gres=gpu:A40:2 -c 16 -t 60:00 job.sh</code></pre>
<p><code>CUDA_VISIBLE_DEVICES</code> will be set to <code>0,....</code>
(i.e., “internal” numbering within job).</p>
</div>
<div id="submission-problems---obvious-failures"
class="slide section level1">
<h1>Submission problems - obvious failures</h1>
<ul>
<li><p>Submitting to an account/partition/QoS you don’t have access to
(“Invalid account or account/partition combination specified”).</p></li>
<li><p>FCA is exhausted (“This user/account pair does not have enough
service units”): If you’d like to see how much of an FCA has been
used:</p>
<pre><code>check_usage.sh -a fc_bands</code></pre></li>
</ul>
</div>
<div id="submission-problems---non-obvious-failures"
class="slide section level1">
<h1>Submission problems - non-obvious failures</h1>
<p>Frustratingly, some submissions can simply hang. They will never
start but do not give an error message.</p>
<ul>
<li>Time limit too long (e.g., more than 3 hours in
<code>savio_debug</code> queue or more than 72 hours FCA job in
<code>savio_normal</code>:</li>
</ul>
<pre><code>[paciorek@ln002 ~]$ srun -A ac_scsguest  -t 74:00:00 -p savio3_htc --pty bash
[paciorek@ln002 ~]$ squeue -u paciorek -o &quot;%.7i %.12P %.20j %.8u %.2t %.5C %.5D %.12M %.12l %.14r %.8p %.20q %.12b %.20R&quot;
  JOBID    PARTITION                 NAME     USER ST  CPUS NODES         TIME   TIME_LIMIT         REASON PRIORITY                  QOS TRES_PER_NOD     NODELIST(REASON)
1809333   savio3_htc                 bash paciorek PD     1     1         0:00   3-02:00:00 QOSMaxWallDura 0.000034         savio_normal          N/A (QOSMaxWallDurationP</code></pre>
<ul>
<li>Too many nodes requested:</li>
</ul>
<pre><code>[paciorek@ln002 ~]$ srun -A fc_paciorek -p savio4_htc -N 40 --pty -t 5:00 bash
[paciorek@ln002 ~]$ squeue -u paciorek -o &quot;%.7i %.12P %.20j %.8u %.2t %.5C %.5D %.12M %.12l %.14r %.8p %.20q %.12b %.20R&quot;
  JOBID    PARTITION                 NAME     USER ST  CPUS NODES         TIME   TIME_LIMIT         REASON PRIORITY                  QOS TRES_PER_NOD     NODELIST(REASON)
1809334   savio4_htc                 bash paciorek PD    40    40         0:00         5:00 QOSMaxNodePerJ 0.000085         savio_normal          N/A (QOSMaxNodePerJobLim</code></pre>
<ul>
<li>GPU jobs not requesting sufficient CPUs:</li>
</ul>
<pre><code>[paciorek@ln002 ~]$ srun -A fc_paciorek  -p savio4_gpu -c 2 --gres=gpu:A5000:1 --pty -t 5:00 bash
[paciorek@ln002 ~]$ squeue -u paciorek -o &quot;%.7i %.12P %.20j %.8u %.2t %.5C %.5D %.12M %.12l %.14r %.8p %.20q %.12b %.20R&quot;
  JOBID    PARTITION                 NAME     USER ST  CPUS NODES         TIME   TIME_LIMIT         REASON PRIORITY                  QOS TRES_PER_NOD     NODELIST(REASON)
1809335   savio4_gpu                 bash paciorek PD     2     1         0:00         5:00 QOSMinCpuNotSa 0.000108      a5k_gpu4_normal gres:gpu:A50 (QOSMinCpuNotSatisfi</code></pre>
<ul>
<li>Invalid or missing GPU type:</li>
</ul>
<pre><code>[paciorek@ln002 ~]$ srun -A fc_paciorek  -p savio4_gpu -c 4 --gres=gpu:1 --pty -t 5:00 bash
[paciorek@ln002 ~]$ squeue -u paciorek -o &quot;%.7i %.12P %.20j %.8u %.2t %.5C %.5D %.12M %.12l %.14r %.8p %.20q %.12b %.20R&quot;
  JOBID    PARTITION                 NAME     USER ST  CPUS NODES         TIME   TIME_LIMIT         REASON PRIORITY                  QOS TRES_PER_NOD     NODELIST(REASON)
1809336   savio4_gpu                 bash paciorek PD     4     1         0:00         5:00     QOSMinGRES 0.000108      a5k_gpu4_normal   gres:gpu:1         (QOSMinGRES)</code></pre>
</div>
<div id="monitoring-jobs-the-job-queue-and-overall-usage"
class="slide section level1">
<h1>Monitoring jobs, the job queue, and overall usage</h1>
<p>The basic command for seeing what is running on the system is
<code>squeue</code>:</p>
<pre><code>squeue
squeue -u $USER
squeue -A co_stat</code></pre>
<p>To see what nodes are available in a given partition:</p>
<pre><code>sinfo -p savio3
sinfo -p savio2_gpu</code></pre>
<p>For more information on cores, QoS, and additional (e.g., GPU)
resources, here’s some syntax:</p>
<pre><code>squeue -o &quot;%.7i %.12P %.20j %.8u %.2t %.5C %.5D %.12M %.12l %.14r %.8p %.20q %.12b %.20R&quot;</code></pre>
</div>
<div id="waiting-in-the-queue" class="slide section level1">
<h1>Waiting in the queue</h1>
<p>Tools to diagnose queueing situations:</p>
<ul>
<li>Our <code>sq</code> tool, which wraps <code>squeue</code>.</li>
<li><code>sinfo -p savio3_htc</code></li>
<li><code>squeue</code>
<ul>
<li><code>--state=PD</code> may be a helpful flag.</li>
</ul></li>
</ul>
<p><a
href="https://docs-research-it.berkeley.edu/services/high-performance-computing/user-guide/running-your-jobs/why-job-not-run/">Reasons
your job might sit in the queue</a>:</p>
<ul>
<li>The partition may be fully occupied (<code>Priority</code>,
<code>Resources</code>).</li>
<li>Your condo may be fully utilizing its purchased resources
(<code>QOSGrpCpuLimit</code>, <code>QOSGrpNodeLimit</code>).</li>
<li>The total number of FCA jobs in small partitions may be at its limit
(<code>QOSGrpCpuLimit</code>, <code>QOSGrpNodeLimit</code>).</li>
<li>Slurm’s fair share policy will prioritize less-active FCA groups
(and less-active users) (<code>Priority</code>).</li>
<li>FCA jobs have lower priority than condo jobs
(<code>Priority</code>).</li>
<li>Your time limit may overlap with a scheduled downtime
(<code>ReqNodeNotAvail, Reserved for Maintenance</code>).</li>
</ul>
<p>Let’s experiment with submitting jobs to heavily-used partitions and
see what the queue looks like.</p>
</div>
<div id="how-the-queue-works" class="slide section level1">
<h1>How the queue works</h1>
<ul>
<li><p>Fairshare</p>
<ul>
<li>Condo jobs get top priority and will go to the top of the queue.
<ul>
<li>Users within a condo will be prioritized inversely to recent
usage.</li>
</ul></li>
<li>FCAs (and then users within FCAs) prioritized inversely to recent
usage (see the <code>PRIORITY</code> column of
<code>squeue</code>).</li>
</ul></li>
<li><p>Backfilling</p>
<ul>
<li>Slurm uses <a
href="https://docs-research-it.berkeley.edu/services/high-performance-computing/user-guide/running-your-jobs/why-job-not-run/">“backfilling”</a>
to try to fit in lower-priority jobs that won’t delay higher-priority
jobs.</li>
</ul></li>
</ul>
<center>
<img src="scheduler_cartoon.jpg">
</center>
</div>
<div id="how-the-queue-works-condos" class="slide section level1">
<h1>How the queue works (condos)</h1>
<ul>
<li>A condo’s usage, aggregated over all the condo’s users is limited to
at most the number of nodes purchased by the condo at any given
time.</li>
<li>Additional jobs will be queued until usage drops below that limit.
<ul>
<li>The pending jobs will be ordered based on the Slurm Fairshare
priority, with users with less recent usage prioritized.</li>
</ul></li>
<li>Sometimes a condo job may not start immediately even if the condo’s
usage is below it’s allocation:
<ul>
<li>Because the partition is fully used, across all condo and FCA users
of the given partition.</li>
<li>This can occur when a condo has not been fully used and FCA jobs
have filled up the partition during that period of limited usage.</li>
<li>Condo jobs are prioritized over FCA jobs in the queue and will start
as soon as resources become available.</li>
<li>Usually any lag in starting condo jobs under this circumstance is
limited.</li>
</ul></li>
</ul>
</div>
<div id="how-the-queue-works-fcas" class="slide section level1">
<h1>How the queue works (FCAs)</h1>
<ul>
<li>Jobs start when they reach the top of the queue and resources become
available as running jobs finish.</li>
<li>The queue is ordered based on the Slurm Fairshare priority
(specifically the Fair Tree algorithm).</li>
<li>The primary influence on this priority is the overall recent usage
by all users in the same FCA as the user submitting the job.</li>
<li>Jobs from multiple users within an FCA are then influenced by their
individual recent usage.</li>
<li>In more detail, usage at the FCA level (summed across all
partitions) is ordered across all FCAs,
<ul>
<li>Priority for a given job depends inversely on that recent usage
(based on the FCA the job is using).</li>
<li>Similarly, amongst users within an FCA, usage is ordered amongst
those users, such that for a given partition, a user with lower recent
usage in that partition will have higher priority than one with higher
recent usage.</li>
</ul></li>
</ul>
</div>
<div id="when-will-my-job-start" class="slide section level1">
<h1>When will my job start?</h1>
<p><code>sq</code> provides a user-friendly way to understand why your
job isn’t running yet or the status of your finished/failed job.</p>
<pre><code># should be loaded by default, but if it isn&#39;t:
# module load sq
# sq -h   # for help with `sq`
sq</code></pre>
<pre><code>Showing results for user paciorek
Currently 0 running jobs and 1 pending job (most recent job first):
+---------|------|-------------|-----------|--------------|------|---------|-----------+
| Job ID  | Name |   Account   |   Nodes   |     QOS      | Time |  State  |  Reason   |
+---------|------|-------------|-----------|--------------|------|---------|-----------+
| 7510375 | test | fc_paciorek | 1x savio2 | savio_normal | 0:00 | PENDING | Resources |
+---------|------|-------------|-----------|--------------|------|---------|-----------+

7510375:
This job is scheduled to run after 21 higher priority jobs.
    Estimated start time: N/A
    To get scheduled sooner, you can try reducing wall clock time as appropriate.

Recent jobs (most recent job first):
+---------|------|-------------|-----------|----------|---------------------|-----------+
| Job ID  | Name |   Account   |   Nodes   | Elapsed  |         End         |   State   |
+---------|------|-------------|-----------|----------|---------------------|-----------+
| 7509474 | test | fc_paciorek | 1x savio2 | 00:00:16 | 2021-02-09 23:47:45 | COMPLETED |
+---------|------|-------------|-----------|----------|---------------------|-----------+

7509474:
 - This job ran for a very short amount of time (0:00:16). You may want to check that the output was correct or if it exited because of a problem.</code></pre>
<p>To see another user’s jobs:</p>
<pre><code>sq -u paciorek</code></pre>
<p>The <code>-a</code> flag shows current and past jobs together, the
<code>-q</code> flag suppresses messages about job issues, and the
<code>-n</code> flag sets the limit on the number of jobs to show in the
output (default = 8).</p>
<pre><code>sq -u paciorek -aq -n 10</code></pre>
<pre><code>Showing results for user paciorek
Recent jobs (most recent job first):
+-----------|------|-------------|-----------|------------|---------------------|-----------+
|  Job ID   | Name |   Account   |   Nodes   |  Elapsed   |         End         |   State   |
+-----------|------|-------------|-----------|------------|---------------------|-----------+
| 7487633.1 | ray  |   co_stat   |    1x     | 1-20:19:03 |       Unknown       |  RUNNING  |
| 7487633.0 | ray  |   co_stat   |    1x     | 1-20:19:08 |       Unknown       |  RUNNING  |
|  7487633  | test |   co_stat   | 2x savio2 | 1-20:19:12 |       Unknown       |  RUNNING  |
|  7487879  | bash | ac_scsguest | 1x savio  |  00:00:27  | 2021-02-08 14:54:19 | COMPLETED |
| 7487633.2 | bash |   co_stat   |    2x     |  00:00:34  | 2021-02-08 14:53:38 |  FAILED   |
|  7487515  | test |   co_stat   | 2x savio2 |  00:04:53  | 2021-02-08 14:22:17 | CANCELLED |
| 7487515.1 | ray  |   co_stat   |    1x     |  00:00:06  | 2021-02-08 14:17:39 |  FAILED   |
| 7487515.0 | ray  |   co_stat   |    1x     |  00:00:05  | 2021-02-08 14:17:33 |  FAILED   |
|  7473988  | test |   co_stat   | 2x savio2 | 3-00:00:16 | 2021-02-08 13:33:40 |  TIMEOUT  |
|  7473989  | test | ac_scsguest | 2x savio  | 2-22:30:11 | 2021-02-08 11:47:54 | CANCELLED |
+-----------|------|-------------|-----------|------------|---------------------|-----------+</code></pre>
</div>
<div id="getting-your-job-to-start-faster" class="slide section level1">
<h1>Getting your job to start faster</h1>
<ul>
<li>Reduce the time limit.</li>
<li>Request fewer nodes or cores.</li>
<li>Find a less-used partition (using <code>sinfo</code>).</li>
<li>Submit to a condo instead of an FCA (if you’re in both) for higher
priority.</li>
<li>Submit to an FCA instead of a condo (if you’re in both) if condo is
full.</li>
</ul>
</div>
<div id="parallelization" class="slide section level1">
<h1>Parallelization</h1>
<p>Some flavors of parallelization:</p>
<ul>
<li>single node only:
<ul>
<li>threaded code (e.g., <code>openMP</code>, <code>TBB</code>)</li>
<li>threaded linear algebra in Python/numpy, R, Julia, etc. (uses
<code>openMP</code> or <code>MKL</code>), e.g., our <code>test.sh</code>
example earlier</li>
</ul></li>
<li>one or more nodes:
<ul>
<li>parallel loops, parallel maps in Python, R, etc. (usually one Linux
process per worker)
<ul>
<li>Python: <code>dask</code>, <code>ray</code>,
<code>ipyparallel</code> packages</li>
<li>R: <code>future</code>, <code>parallel</code>, <code>foreach</code>
packages</li>
</ul></li>
<li>MPI (message-passing)</li>
<li><a
href="https://docs-research-it.berkeley.edu/services/high-performance-computing/user-guide/running-your-jobs/gnu-parallel/">GNU
parallel</a>: parallelize independent tasks</li>
</ul></li>
</ul>
<p>Various other executables (e.g., in bioinformatics, computational
chemistry, computational fluid mechanics, etc.) will use various of
these approaches internally.</p>
</div>
<div id="parallelization-considerations" class="slide section level1">
<h1>Parallelization considerations</h1>
<p>Rules-of-thumb:</p>
<ul>
<li>Often one core per process (i.e., “worker”)
<ul>
<li>Multiple cores per process for threaded code</li>
<li>Avoid having multiple processes per core</li>
</ul></li>
<li>One or more computational units per worker</li>
</ul>
<p>Confusing: “task” could mean “worker” in the context of MPI or
“computational unit” more generally.</p>
<p>Important:</p>
<ul>
<li>Is the executable you’re using written so as to use
parallelization?</li>
<li>What does the user need to specify?
<ul>
<li>Sometimes multi-core, single-node parallelization will occur without
user specification.</li>
</ul></li>
</ul>
</div>
<div id="slurm-flags" class="slide section level1">
<h1>Slurm flags:</h1>
<ul>
<li><code>--cpus-per-task</code> (<code>-c</code>): number of cores for
each task</li>
<li><code>--ntasks</code> (<code>-n</code>): total number of tasks</li>
<li><code>--ntasks-per-node</code>: number of tasks on each node</li>
<li><code>--nodes</code> (<code>-N</code>): the number of nodes to
use</li>
</ul>
<p>Based on the flags, Slurm will set various shell environment
variables your code can use to configure parallelization, e.g.,
<code>SLURM_NTASKS</code>, <code>SLURM_CPUS_PER_TASK</code>,
<code>SLURM_NODELIST</code>, <code>SLURM_NNODES</code>.</p>
<p>We generally refer to “cores” rather than “CPUs” as modern CPUs have
multiple computational cores that can each carry out independent
work.</p>
</div>
<div id="cpus-per-task-vs.-ntasks" class="slide section level1">
<h1><code>cpus-per-task</code> vs. <code>ntasks</code></h1>
<p>In some cases one can either use <code>--cpus-per-task</code> or
<code>--ntasks</code> (or <code>--ntasks-per-node</code>) to get
multiple cores on a single node.</p>
<p>Caveats:</p>
<ul>
<li>Can’t use <code>--cpus-per-task</code> to get cores on multiple
nodes.</li>
<li><code>--ntasks</code> does not guarantee cores all on a single node
(but <code>--ntasks-per-node</code> does).</li>
<li>Need to use <code>--ntasks</code> (or
<code>--ntasks-per-node</code>) for MPI jobs.</li>
<li>Need to use specify cpus and tasks for hybrid jobs with multiple
threaded processes (e.g., MPI+openMP or GNU parallel+openMP).</li>
</ul>
</div>
<div id="examples" class="slide section level1">
<h1>Examples</h1>
<p>Some common paradigms are:</p>
<ul>
<li>one node, many cores
<ul>
<li>openMP/threaded jobs - one task, <em>c</em> cores for the task</li>
<li>Python/R/GNU parallel - <em>n</em> tasks, one per core at any given
time, often more computational units than tasks</li>
</ul></li>
<li>many nodes, many cores
<ul>
<li>MPI jobs that use one core per task for each of <em>n</em> tasks,
spread across multiple nodes</li>
<li>Python/R/GNU parallel - <em>n</em> tasks, one per core at any given
time, often more computational units than tasks</li>
</ul></li>
<li>hybrid jobs that use <em>c</em> cores for each of <em>n</em> tasks
<ul>
<li>e.g., MPI+threaded code</li>
</ul></li>
</ul>
<p>We have lots more <a
href="https://docs-research-it.berkeley.edu/services/high-performance-computing/user-guide/running-your-jobs/scheduler-examples">examples
of job submission scripts</a> for different kinds of parallelization
(multi-node (MPI), multi-core (openMP), hybrid, etc.</p>
</div>
<div id="mpi-and-slurm" class="slide section level1">
<h1>MPI and Slurm</h1>
<p>Slurm’s “ntasks” corresponds to the number of MPI tasks.</p>
<p>MPI knows about the Slurm job specification.</p>
<p>So you don’t need to specify <code>-np</code> or
<code>--machinefile</code> with <code>mpirun/mpiexec</code>.</p>
</div>
<div id="mpi-troubleshooting" class="slide section level1">
<h1>MPI troubleshooting</h1>
<p>It’s not uncommon to get MPI run-time errors on Savio that can be
hard to decipher, particularly when running on multiple nodes.</p>
<ul>
<li>Load the compiler module (e.g., <code>gcc</code>,
<code>intel</code>), then load the compiler-specific MPI module (e.g.,
<code>openmpi</code>)</li>
<li>The MPI version used to compile code should be the same as used to
run the code.</li>
<li>The MPI version used inside an Apptainer/Singularity container
should be the same as module loaded on the system.</li>
<li>Use MPI+UCX for MPI jobs on <code>savio4_htc</code> for efficiency
(<code>module load gcc/11.3.0 openmpi/5.0.0-ucx</code>)</li>
</ul>
<p>If you troubleshoot based on the above items and are still stuck,
please contact us.</p>
</div>
<div id="using-multiple-gpus" class="slide section level1">
<h1>Using multiple GPUs</h1>
<ul>
<li>Is your code set up to use multiple GPUs?</li>
<li><code>CUDA_VISIBLE_DEVICES</code> will be set when your job
starts.</li>
<li>With PyTorch, you will refer to the GPUs indexed starting with
0.</li>
</ul>
<pre><code>import torch
gpu0 = torch.device(&quot;cuda:0&quot;)
gpu1 = torch.device(&quot;cuda:1&quot;)

x = torch.rand(100)

x0 = x.to(gpu0)
x1 = x.to(gpu1)</code></pre>
</div>
<div id="parallelizing-independent-computations"
class="slide section level1">
<h1>Parallelizing independent computations</h1>
<p>You may have many serial jobs to run. It may be more cost-effective
and/or simply easier to manage if you collect those jobs together and
run them across multiple cores on one or more nodes.</p>
<p>Here are some options:</p>
<ul>
<li>using <a
href="https://docs-research-it.berkeley.edu/services/high-performance-computing/user-guide/running-your-jobs/gnu-parallel/">GNU
parallel</a> to run many computational tasks (e.g., thousands of
simulations, scanning tens of thousands of parameter values, etc.) as
part of single Savio job submission</li>
<li>using <a
href="https://berkeley-scf.github.io/tutorial-parallelization">single-node
or multi-node parallelism</a> in Python, R, Julia, MATLAB, etc.
<ul>
<li>parallel R tools such as <em>future</em>, <em>foreach</em>,
<em>parLapply</em>, and <em>mclapply</em></li>
<li>parallel Python tools such as <em>ipyparallel</em>, <em>Dask</em>,
and <em>ray</em></li>
<li>parallel functionality in MATLAB through <em>parfor</em></li>
</ul></li>
</ul>
</div>
<div id="troubleshooting-failed-or-misbehaving-jobs"
class="slide section level1">
<h1>Troubleshooting failed or misbehaving jobs</h1>
<ul>
<li>Look at the software’s log/output files and Slurm’s job/error files
(<code>slurm-&lt;JOB_ID&gt;.out</code>,
<code>slurm-&lt;JOB_ID&gt;.err</code>)</li>
<li>Use <code>sacct</code> to look at result of failed jobs (memory use,
time limit, error codes):
<ul>
<li><code>sacct -j &lt;JOB_ID&gt; --format=JobID,JobName,MaxRSS,Elapsed</code></li>
<li><code>sacct -u &lt;USER&gt; -S 2024-04-04 --format User,JobID,JobName,Partition,Account,AllocCPUS,State,MaxRSS,ExitCode,Submit,Start,End,Elapsed,Timelimit,NodeList</code></li>
</ul></li>
<li>Possible hardware failures – use <code>sacct</code> to see if
repeated failures occur on particular node(s)
<ul>
<li>Specify nodes with <code>-w</code> or exclude with
<code>-x</code>.</li>
</ul></li>
<li>Run your code interactively via <code>srun</code></li>
<li>Run multi-node jobs on a single node to check for communication
issues or issues with modules on additional nodes</li>
<li>Contact us if you’re stuck.</li>
</ul>
</div>
</body>
</html>