From bf69b07303736030fadaf0b0c54eeeae7393a4da Mon Sep 17 00:00:00 2001 From: Wouter Depypere Date: Thu, 24 Oct 2024 12:24:41 +0200 Subject: [PATCH] port ugent infra page to user docs --- mkdocs/docs/HPC/index.md | 2 +- mkdocs/docs/HPC/infrastructure.md | 92 +++++++++++++++++++++++++++++++ 2 files changed, 93 insertions(+), 1 deletion(-) create mode 100644 mkdocs/docs/HPC/infrastructure.md diff --git a/mkdocs/docs/HPC/index.md b/mkdocs/docs/HPC/index.md index b3b187c6186..2d97d292b7b 100644 --- a/mkdocs/docs/HPC/index.md +++ b/mkdocs/docs/HPC/index.md @@ -20,7 +20,7 @@ Use the OS dropdown in the top bar to switch to a different operating system. {%- if site == 'Gent' %} - [Recording of HPC-UGent intro](https://www.ugent.be/hpc/en/training/introhpcugent-recording) - [Linux Tutorial](linux-tutorial/index.md) -- [Hardware overview](https://www.ugent.be/hpc/en/infrastructure) +- [Hardware overview](infrastructure.md) - [Available software](./only/gent/available_software/index.md) - [Migration of cluster and login nodes to RHEL9 (starting Sept'24)](rhel9.md) {%- endif %} diff --git a/mkdocs/docs/HPC/infrastructure.md b/mkdocs/docs/HPC/infrastructure.md new file mode 100644 index 00000000000..a1db179e385 --- /dev/null +++ b/mkdocs/docs/HPC/infrastructure.md @@ -0,0 +1,92 @@ +# Infrastructure + +## Tier2 clusters of Ghent University + +The Stevin computing infrastructure consists of several Tier2 clusters +which are hosted in the S10 datacenter of Ghent University. + +This infrastructure is co-financed by FWO and Department of Economy, +Science and Innovation (EWI). + +## Tier-2 login nodes + +Log in to the HPC-UGent Tier-2 infrastructure using SSH via `login.hpc.ugent.be`. + +## Tier-2 compute clusters + +### CPU clusters + +The HPC-UGent Tier-2 infrastructure currently included several standard +CPU-only clusters, of different generations (listed from old to new). + +For basic information on using these clusters, see our +[documentation](running_batch_jobs.md). + +| ***cluster name*** | ***# nodes*** | ***Processor architecture*** | ***Usable memory/node*** | ***Local diskspace/node*** | ***Interconnect*** | ***Operating system*** | +| --- | --- | --- | --- | --- | --- | --- | +| skitty | 72 | 2 x 18-core Intel Xeon Gold 6140 (Skylake @ 2.3 GHz) | 177 GiB | 1 TB & 240 GB SSD | EDR InfiniBand | RHEL 9 | +| doduo (default cluster) | 128 | 2x 48-core AMD EPYC 7552 (Rome @ 2.2 GHz) | 250 GiB | 180GB SSD | HDR-100 InfiniBand | RHEL 8 | +| gallade (*) | 16 | 2x 64-core AMD EPYC 7773X (Milan-X @ 2.2 GHz) | 940 GiB | 1.5 TB NVME | HDR-100 InfiniBand | RHEL 9 | +| shinx | 48 | 2x 96-core AMD EPYC 9654 (Genoa @ 2.4 GHz) | 370 GiB | 500GB NVME | NDR-200 InfiniBand | RHEL 9 | + +(*) also see this [extra information](donphan-gallade.md#gallade-large-memory-cluster) + +### Interactive debug cluster + + +A special-purpose interactive debug cluster is available, +where you should always be able to get a job running quickly, +**without waiting in the queue**. + +Intended usage is mainly for interactive work, +either via an interactive job or using the [HPC-UGent web portal](web_portal). + +This cluster is heavily over-provisioned, so jobs may +run slower if the cluster is used more heavily. + +Strict limits are in place per user: + * max. 5 jobs in queue + * max. 3 jobs running + * max. of 8 cores and 27GB of memory in total for running jobs + +For more information, see our [documentation](interactive_gent). + +| ***cluster name*** | ***# nodes*** | ***Processor architecture*** | ***Usable memory/node*** | ***Local diskspace/node*** | ***Interconnect*** | ***Operating system*** | +| --- | --- | --- | --- | --- | --- | --- | +| donphan (*) | 16 | 2 x 18-core Intel Xeon Gold 6240 (Cascade Lake @ 2.6 GHz) + 1x shared NVIDIA Ampere A2 GPU (16GB GPU memory) | 738 GiB | 1.6 TB NVME | HDR-100 Infiniband | RHEL 8 | + +(*) also see this [extra information](donphan-gallade.md#donphan-debuginteractive-cluster) + +### GPU clusters + +GPU clusters are available in the HPC-UGent Tier-2 infrastructure, +with different generations of NVIDIA GPUs. + +These are well suited for specific workloads, with software that +can leverage the GPU resources (like TensorFlow, PyTorch, GROMACS, AlphaFold, etc.). + +For more information on using these clusters, see our documentation. + +| ***cluster name*** | ***# nodes*** | ***Processor architecture & GPUs*** | ***Usable memory/node*** | ***Local diskspace/node*** | ***Interconnect*** | ***Operating system*** | +| --- | --- | --- | --- | --- | --- | --- | +| joltik | 10 | 2x 16-core Intel Xeon Gold 6242 (Cascade Lake @ 2.8 GHz) + 4x NVIDIA Volta V100 GPUs (32GB GPU memory) | 256 GiB | 800GB SSD | double EDR Infiniband | RHEL 9 | +| accelgor | 9 | 2x 24-core AMD EPYC 7413 (Milan @ 2.2 GHz) + 4x NVIDIA Ampere A100 GPUs (80GB GPU memory) | 500 GiB | 180GB SSD | HDR InfiniBand | RHEL 8 | + + +## Tier-2 shared storage + +| ***Filesystem name*** | ***Intended usage*** | ***Total storage space*** | ***Personal storage space*** | ***VO storage space (^)*** | +| ---| --- |---| --- | --- | +| $VSC_HOME | Home directory, entry point to the system | 90 TB | 3GB (fixed) | (none) | +| $VSC_DATA | Long-term storage of large data files | 1.9 PB | 25GB (fixed) | 250GB | +| $VSC_SCRATCH | Temporary fast storage of 'live' data for calculations | 1.7 PB | 25GB (fixed) | 250GB | +| $VSC_SCRATCH_ARCANINE | Temporary very fast storage of 'live' data for calculations (recommended for very I/O-intensive jobs) | 70 TB NVME | (none) | upon request | + + +^ Storage space for a group of users (Virtual Organisation or VO for short) can be +increased significantly on request. For more information, see our +[documentation](running_jobs_with_input_output_data.md#virtual-organisations). + +## Infrastructure status + +[Check the system status](https://www.ugent.be/hpc/en/infrastructure/status)