Merge pull request #583 from stdweird/course_docs

add shinx cluster and pilot phase details
hpcugent · Oct 23, 2023 · 6cd434f · 6cd434f
2 parents 4f7fd95 + 4fa6d29
commit 6cd434f
Show file tree

Hide file tree

Showing 2 changed files with 124 additions and 1 deletion.
diff --git a/mkdocs/docs/HPC/only/gent/2023/shinx.md b/mkdocs/docs/HPC/only/gent/2023/shinx.md
@@ -0,0 +1,123 @@
+# New Tier-2 cluster: `shinx`
+
+In October 2023, a new pilot cluster was added to the HPC-UGent Tier-2 infrastructure: `shinx`.
+
+This page provides some important information regarding this cluster, and how it differs from the clusters
+it is replacing (`swalot` and `victini`).
+
+If you have any questions on using `shinx`, you can [contact the {{ hpcteam }}]({{ hpc_support_url }}).
+
+For software installation requests, please use the [request form](https://www.ugent.be/hpc/en/support/software-installation-request).
+
+---
+
+## `shinx`: generic CPU cluster
+
+`shinx` is a new CPU-only cluster.
+
+It replaces `swalot`, which will be retired on **Wednesday 01 November 2023**,
+and `victini`, which will be retired on **Monday 05 February 2024**.
+
+It is primarily for regular CPU compute use.
+
+This cluster consists of 48 workernodes, each with:
+
+* 2x 96-core AMD EPYC 9654 (Genoa @ 2.4 GHz) processor;
+* ~360 GiB of RAM memory;
+* 400GB local disk;
+* NDR-200 InfiniBand interconnect;
+* RHEL9 as operating system;
+
+To start using this cluster from a terminal session, first run:
+```
+module swap cluster/shinx
+```
+
+You can also start (interactive) sessions on `shinx` using the [HPC-UGent web portal](../../../web_portal.md).
+
+### Differences compared to `swalot` and `victini`.
+
+#### CPUs
+
+The most important difference between `shinx` and `swalot`/`victini` workernodes is in the CPUs:
+while `swalot` and `vicitini` workernodes featured *Intel * CPUs, `shinx` workernodes have `AMD Genoa` CPUs.
+
+Although software that was built on a `swalot` or `vicitini` workernode with compiler options that enable architecture-specific
+optimizations (like GCC's `-march=native`, or Intel compiler's `-xHost`) might still run on
+a `shinx` workernode, it is recommended to recompile the software to benefit from the support for
+`AVX-512` vector instructions (which is missing on `swalot`).
+
+#### Cluster size
+
+The `shinx` cluster is significantly bigger than `swalot` and `victini` in number of cores, and number of cores per workernode,
+but not in number of workernodes. In particular, requesting all cores via `ppn=all` might be something to reconsider.
+
+The amount of available memory per core is `1.9 GiB`, which is lower then the `swalot` nodes which had `6.2 GiB` per core
+and the vicitini nodes that had `2.5 GiB` per core.
+
+
+### Comparison with `doduo`
+
+As `doduo` is the current largest CPU cluster of the UGent Tier-2 infrastructure, and it is also based on `AMD EPYC` CPUs,
+we would like to point out that, roughly speaking, one `shinx` node is equal to 2 `doduo` nodes.
+
+Although software that was built on a `doduo` workernode with compiler options that enable architecture-specific
+optimizations (like GCC's `-march=native`, or Intel compiler's `-xHost`) might still run on
+ a `shinx` workernode, it is recommended to recompile the software to benefit from the support for
+`AVX-512` vector instructions (which is missing from `doduo`).
+
+### Other remarks
+
+* Possible issues with `OpenMP` thread pinning: we have seen, especially on `Tier-1 dodrio` cluster, that in certain cases
+thread pinning is invoked where it is not expected. Typical symptom is that all the processes that are started are pinned
+to a single core. Always report this issue when it occurs.
+You can try yourself to mitigate this by setting `export OMP_PROC_BIND=false`, but always report it so we can keep track of this problem.
+It is not recommended to always set this workaround, only for the specific tools that are affected.
+
+
+---
+
+## Shinx pilot phase (23/10/2023-20/05/2024)
+
+As usuial with any pilot phase, you need to be member of the `gpilot` group, and to start using this cluster run:
+
+```
+module swap cluster/.shinx
+```
+
+Because the delivery time of the infiniband network is very high, we only expect to have all material end of February 2024.
+However, all the workernodes will already be delivered in the week of 20 October 2023
+
+As such, we will have an extended pilot phase in 3 stages:
+
+### Stage 0: 23/10/2023-17/11/2023
+
+* Minimal cluster to test software and nodes
+    * Only 2 or 3 nodes available
+    * FDR or EDR infiniband network
+    * EL8 OS
+
+* Retirement of `swalot cluster` (as of 01 November 2023)
+* Racking of stage 1 nodes
+
+### Stage 1: 17/11/2023-01/03/2024
+
+* 2/3 cluster size
+    * 32 nodes
+    * EDR Infiniband
+    * EL8 OS
+
+* Retirement of `victini` (as of 05 February 2023)
+* Racking of last 16 nodes
+* Installation of NDR/NDR-200 infiniband network
+
+### Stage 2 (01/03/2024-20/05/2024)
+
+* Full size cluster
+    * 48 nodes
+    * NDR-200 Infiniband
+    * EL9 OS
+
+* We expect to plan a full Tier-2 downtime in May 2024 to cleanup, refactor and renew the core networks
+(ethernet and infiniband) and some core services. It makes no sense to put `shinx` in production before
+that period, and the testing of the `EL9` operating system will also take some time.
diff --git a/mkdocs/extra/gent.yml b/mkdocs/extra/gent.yml
@@ -76,7 +76,7 @@ othercluster: donphan
 hpcinfo: [email protected]
 hpcusersml: [email protected]
 hpcannounceml: [email protected]
-operatingsystem: and RHEL 8.8 (accelgor, doduo, donphan, gallade, joltik, skitty, swalot, victini)
+operatingsystem: RHEL 8.8 (accelgor, doduo, donphan, gallade, joltik, skitty, swalot, victini)
 operatingsystembase: Red Hat Enterprise Linux
 # Special for perfexpert tutorial
 mpirun: vsc-mympirun