Migrate nebari slurm docs (#535)

Co-authored-by: Vinicius D. Cerutti <[email protected]>
nebari-dev · Oct 29, 2024 · e6db343 · e6db343
1 parent 2188a1d
commit e6db343
Show file tree

Hide file tree

Showing 19 changed files with 1,180 additions and 8 deletions.
diff --git a/docs/docs/references/RELEASE.md b/docs/docs/references/RELEASE.md
@@ -1312,7 +1312,7 @@ Explicit user facing changes:
 
 - `qhub deploy -c qhub-config.yaml` no longer prompts unsupported argument for `load_config_file`.
 - Minor changes on the Step-by-Step walkthrough on the docs.
-- Revamp of README.md to make it concise and highlight QHub HPC.
+- Revamp of README.md to make it concise and highlight Nebari Slurm.
 
 ### Breaking changes
 

diff --git a/docs/nebari-slurm/_static/images/architecture.png b/docs/nebari-slurm/_static/images/architecture.png
diff --git a/docs/nebari-slurm/_static/images/qhub-dashboards-bug.png b/docs/nebari-slurm/_static/images/qhub-dashboards-bug.png
diff --git a/docs/nebari-slurm/_static/images/qhub-dask-gateway-cluster.png b/docs/nebari-slurm/_static/images/qhub-dask-gateway-cluster.png
diff --git a/docs/nebari-slurm/_static/images/qhub-dask-gateway.png b/docs/nebari-slurm/_static/images/qhub-dask-gateway.png
diff --git a/docs/nebari-slurm/_static/images/qhub-grafana-node-exporter.png b/docs/nebari-slurm/_static/images/qhub-grafana-node-exporter.png
diff --git a/docs/nebari-slurm/_static/images/qhub-grafana-slurm.png b/docs/nebari-slurm/_static/images/qhub-grafana-slurm.png
diff --git a/docs/nebari-slurm/_static/images/qhub-grafana-traefik.png b/docs/nebari-slurm/_static/images/qhub-grafana-traefik.png
diff --git a/docs/nebari-slurm/_static/images/qhub-jupyterlab-profile.png b/docs/nebari-slurm/_static/images/qhub-jupyterlab-profile.png
diff --git a/docs/nebari-slurm/_static/images/qhub-landing-page.png b/docs/nebari-slurm/_static/images/qhub-landing-page.png
diff --git a/docs/nebari-slurm/benchmark.md b/docs/nebari-slurm/benchmark.md
@@ -0,0 +1,99 @@
+# Benchmarking
+
+There are many factors that go into HPC performance. Aside from the
+obvious cpu performance network and storage are equally important in a
+distributed context.
+
+## Storage
+
+[fio](https://fio.readthedocs.io/en/latest/fio_doc.html) is a powerful
+tool for benchmarking filesystems. Measuring maximum performance
+especially on extremely high performance filesystems can be tricky to
+measure and will require research on how to effectively use the
+tool. Often times measuring maximum performance on high performance
+distributed filesystems will require multiple nodes and threads for
+reading/writing. However it should provide a good ballpark of
+performance.
+
+Substitute `<directory>` with the filesystem that you want to
+test. `df -h` can be a great way to see where each drive is
+mounted. `fio` will need the ability to read/write in the given
+directory.
+
+IOPs (input/output operations per second)
+
+### Maximum Write Throughput
+
+```shell
+fio --ioengine=sync --direct=0 \
+  --fsync_on_close=1 --randrepeat=0 --nrfiles=1  --name=seqwrite --rw=write \
+  --bs=1m --size=20G --end_fsync=1 --fallocate=none  --overwrite=0 --numjobs=1 \
+  --directory=<directory> --loops=10
+```
+
+### Maximum Write IOPs
+
+```shell
+fio --ioengine=sync --direct=0 \
+  --fsync_on_close=1 --randrepeat=0 --nrfiles=1  --name=randwrite --rw=randwrite \
+  --bs=4K --size=1G --end_fsync=1 --fallocate=none  --overwrite=0 --numjobs=80 \
+  --sync=1 --directory=<directory> --loops=10
+```
+
+### Maximum Read Throughput
+
+```shell
+fio --ioengine=sync --direct=0 \
+  --fsync_on_close=1 --randrepeat=0 --nrfiles=1  --name=seqread --rw=read \
+  --bs=1m --size=240G --end_fsync=1 --fallocate=none  --overwrite=0 --numjobs=1 \
+  --directory=<directory> --invalidate=1 --loops=10
+```
+
+### Maximum Read IOPs
+
+```shell
+fio --ioengine=sync --direct=0 \
+  --fsync_on_close=1 --randrepeat=0 --nrfiles=1  --name=randread --rw=randread \
+  --bs=4K --size=1G --end_fsync=1 --fallocate=none  --overwrite=0 --numjobs=20 \
+  --sync=1 --invalidate=1 --directory=<directory> --loops=10
+```
+
+## Network
+
+To test network latency and bandwidth there needs to be a source and
+destination that you wish to test. It will expose a given port by
+default `5201` with iperf.
+
+### Bandwidth
+
+Start a server on a given `<dest>`
+
+```shell
+iperf3 -s
+```
+
+No on the `<src>` machine run
+
+```shell
+iperf3 -c <ip address>
+```
+
+This will measure the bandwidth of the link between the nodes from
+`<src>` to `<dest>`. This means that if you are using a provider where
+your Internet have very different upload vs. download speeds you will
+see very different results in the direction. Add a `-R` flag to the
+client to test the other direction.
+
+### Latency
+
+[ping](https://linux.die.net/man/8/ping) is a great way to watch the
+latency between `<src>` and `<dest>`.
+
+From the src machine run
+
+```shell
+ping -4 <dest> -c 10
+```
+
+Keep in mind that ping is the bi-directional (round trip) time. So
+dividing by 2 is roughly the latency.
diff --git a/docs/nebari-slurm/comparison.md b/docs/nebari-slurm/comparison.md
@@ -0,0 +1,79 @@
+# Nebari and Nebari Slurm Comparison
+
+At a high level QHub is focused on a Kubernetes and container based
+deployment of all of its components. Many of the advantages of a
+container based deployment allow for better security, scalability of
+components and compute nodes.
+
+QHub-HPC is focused on bringing many of the same features but within a
+bare metal installation allowing users to fully take advantage of
+their hardware for performance. Additionally these installations tend
+to be easier to manage and debug when issues arise (traditional linux
+sys-admin experience works well here). Due to this approach QHub-HPC
+lacks containers but achieves workflows and scheduling of compute via
+[Slurm](https://slurm.schedmd.com/documentation.html) and keeping
+[services
+available](https://www.freedesktop.org/wiki/Software/systemd/).
+
+Questions to help determine which solution may be best for you:
+
+1. Are you deploying to the cloud e.g. AWS, GCP, Azure, or Digital Ocean?
+
+QHub is likely your best option. The auto-scalability of QHub compute
+allows for cost effective usage of the cloud while taking advantage of
+a managed Kubernetes.
+
+2. Are you deploying to a bare metal cluster?
+
+QHub-HPC may be your best option since deployment does not require the
+complexity of managing a kubernetes cluster. If you do have a devops
+or IT team to help manage kubernetes on bare metal QHub could be a
+great option. But be advised that managing Kubernetes comes with quite
+a lot of complexity which the cloud providers handle for us.
+
+3. Are you concerned about absolute best performance?
+
+QHub-HPC is likely your best option. But note when we say absolute
+performance we mean your software is able to fully take advantage of
+your networks Infiniband hardware, uses MPI, and SIMD
+instructions. Few users fall into this camp and should rarely be a
+reason to chose QHub-HPC (unless you know why you are making this
+choice).
+
+# Feature Matrix
+
+| Core                                             | QHub                                | QHub-HPC          |
+| ------------------------------------------------ | ----------------------------------- | ----------------- |
+| Scheduler                                        | Kubernetes                          | SystemD and Slurm |
+| User Isolation                                   | Containers (cgroups and namespaces) | Slurm (cgroups)   |
+| Auto-scaling compute nodes                       | X                                   |                   |
+| Cost efficient compute support (Spot/Premptible) | X                                   |                   |
+| Static compute nodes                             |                                     | X                 |
+
+| User Services                      | QHub | QHub-HPC |
+| ---------------------------------- | ---- | -------- |
+| Dask Gateway                       | X    | X        |
+| JupyterHub                         | X    | X        |
+| JupyterHub-ssh                     | X    | X        |
+| CDSDashboards                      | X    | X        |
+| Conda-Store environment management | X    | X        |
+| ipyparallel                        |      | X        |
+| Native MPI support                 |      | X        |
+
+| Core Services                                                 | QHub | QHub-HPC |
+| ------------------------------------------------------------- | ---- | -------- |
+| Monitoring via Grafana and Prometheus                         | X    | X        |
+| Auth integration (OAuth2, OpenID, ldap, kerberos)             | X    | X        |
+| Role based authorization on JupyterHub, Grafana, Dask-Gateway | X    | X        |
+| Configurable user groups                                      | X    | X        |
+| Shared folders for each user's group                          | X    | X        |
+| Traefik proxy                                                 | X    | X        |
+| Automated Let's Encrypt and manual TLS certificates           | X    | X        |
+| Forward authentication ensuring all endpoints authenticated   | X    |          |
+| Backups via Restic                                            |      | X        |
+
+| Integrations | QHub | QHub-HPC |
+| ------------ | ---- | -------- |
+| ClearML      | X    |          |
+| Prefect      | X    |          |
+| Bodo         |      | X        |