Skip to content

Commit

Permalink
Migrate nebari slurm docs (#535)
Browse files Browse the repository at this point in the history
Co-authored-by: Vinicius D. Cerutti <[email protected]>
  • Loading branch information
aktech and viniciusdc authored Oct 29, 2024
1 parent 2188a1d commit e6db343
Show file tree
Hide file tree
Showing 19 changed files with 1,180 additions and 8 deletions.
2 changes: 1 addition & 1 deletion docs/docs/references/RELEASE.md
Original file line number Diff line number Diff line change
Expand Up @@ -1312,7 +1312,7 @@ Explicit user facing changes:

- `qhub deploy -c qhub-config.yaml` no longer prompts unsupported argument for `load_config_file`.
- Minor changes on the Step-by-Step walkthrough on the docs.
- Revamp of README.md to make it concise and highlight QHub HPC.
- Revamp of README.md to make it concise and highlight Nebari Slurm.

### Breaking changes

Expand Down
Binary file added docs/nebari-slurm/_static/images/architecture.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
99 changes: 99 additions & 0 deletions docs/nebari-slurm/benchmark.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# Benchmarking

There are many factors that go into HPC performance. Aside from the
obvious cpu performance network and storage are equally important in a
distributed context.

## Storage

[fio](https://fio.readthedocs.io/en/latest/fio_doc.html) is a powerful
tool for benchmarking filesystems. Measuring maximum performance
especially on extremely high performance filesystems can be tricky to
measure and will require research on how to effectively use the
tool. Often times measuring maximum performance on high performance
distributed filesystems will require multiple nodes and threads for
reading/writing. However it should provide a good ballpark of
performance.

Substitute `<directory>` with the filesystem that you want to
test. `df -h` can be a great way to see where each drive is
mounted. `fio` will need the ability to read/write in the given
directory.

IOPs (input/output operations per second)

### Maximum Write Throughput

```shell
fio --ioengine=sync --direct=0 \
--fsync_on_close=1 --randrepeat=0 --nrfiles=1 --name=seqwrite --rw=write \
--bs=1m --size=20G --end_fsync=1 --fallocate=none --overwrite=0 --numjobs=1 \
--directory=<directory> --loops=10
```

### Maximum Write IOPs

```shell
fio --ioengine=sync --direct=0 \
--fsync_on_close=1 --randrepeat=0 --nrfiles=1 --name=randwrite --rw=randwrite \
--bs=4K --size=1G --end_fsync=1 --fallocate=none --overwrite=0 --numjobs=80 \
--sync=1 --directory=<directory> --loops=10
```

### Maximum Read Throughput

```shell
fio --ioengine=sync --direct=0 \
--fsync_on_close=1 --randrepeat=0 --nrfiles=1 --name=seqread --rw=read \
--bs=1m --size=240G --end_fsync=1 --fallocate=none --overwrite=0 --numjobs=1 \
--directory=<directory> --invalidate=1 --loops=10
```

### Maximum Read IOPs

```shell
fio --ioengine=sync --direct=0 \
--fsync_on_close=1 --randrepeat=0 --nrfiles=1 --name=randread --rw=randread \
--bs=4K --size=1G --end_fsync=1 --fallocate=none --overwrite=0 --numjobs=20 \
--sync=1 --invalidate=1 --directory=<directory> --loops=10
```

## Network

To test network latency and bandwidth there needs to be a source and
destination that you wish to test. It will expose a given port by
default `5201` with iperf.

### Bandwidth

Start a server on a given `<dest>`

```shell
iperf3 -s
```

No on the `<src>` machine run

```shell
iperf3 -c <ip address>
```

This will measure the bandwidth of the link between the nodes from
`<src>` to `<dest>`. This means that if you are using a provider where
your Internet have very different upload vs. download speeds you will
see very different results in the direction. Add a `-R` flag to the
client to test the other direction.

### Latency

[ping](https://linux.die.net/man/8/ping) is a great way to watch the
latency between `<src>` and `<dest>`.

From the src machine run

```shell
ping -4 <dest> -c 10
```

Keep in mind that ping is the bi-directional (round trip) time. So
dividing by 2 is roughly the latency.
79 changes: 79 additions & 0 deletions docs/nebari-slurm/comparison.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# Nebari and Nebari Slurm Comparison

At a high level QHub is focused on a Kubernetes and container based
deployment of all of its components. Many of the advantages of a
container based deployment allow for better security, scalability of
components and compute nodes.

QHub-HPC is focused on bringing many of the same features but within a
bare metal installation allowing users to fully take advantage of
their hardware for performance. Additionally these installations tend
to be easier to manage and debug when issues arise (traditional linux
sys-admin experience works well here). Due to this approach QHub-HPC
lacks containers but achieves workflows and scheduling of compute via
[Slurm](https://slurm.schedmd.com/documentation.html) and keeping
[services
available](https://www.freedesktop.org/wiki/Software/systemd/).

Questions to help determine which solution may be best for you:

1. Are you deploying to the cloud e.g. AWS, GCP, Azure, or Digital Ocean?

QHub is likely your best option. The auto-scalability of QHub compute
allows for cost effective usage of the cloud while taking advantage of
a managed Kubernetes.

2. Are you deploying to a bare metal cluster?

QHub-HPC may be your best option since deployment does not require the
complexity of managing a kubernetes cluster. If you do have a devops
or IT team to help manage kubernetes on bare metal QHub could be a
great option. But be advised that managing Kubernetes comes with quite
a lot of complexity which the cloud providers handle for us.

3. Are you concerned about absolute best performance?

QHub-HPC is likely your best option. But note when we say absolute
performance we mean your software is able to fully take advantage of
your networks Infiniband hardware, uses MPI, and SIMD
instructions. Few users fall into this camp and should rarely be a
reason to chose QHub-HPC (unless you know why you are making this
choice).

# Feature Matrix

| Core | QHub | QHub-HPC |
| ------------------------------------------------ | ----------------------------------- | ----------------- |
| Scheduler | Kubernetes | SystemD and Slurm |
| User Isolation | Containers (cgroups and namespaces) | Slurm (cgroups) |
| Auto-scaling compute nodes | X | |
| Cost efficient compute support (Spot/Premptible) | X | |
| Static compute nodes | | X |

| User Services | QHub | QHub-HPC |
| ---------------------------------- | ---- | -------- |
| Dask Gateway | X | X |
| JupyterHub | X | X |
| JupyterHub-ssh | X | X |
| CDSDashboards | X | X |
| Conda-Store environment management | X | X |
| ipyparallel | | X |
| Native MPI support | | X |

| Core Services | QHub | QHub-HPC |
| ------------------------------------------------------------- | ---- | -------- |
| Monitoring via Grafana and Prometheus | X | X |
| Auth integration (OAuth2, OpenID, ldap, kerberos) | X | X |
| Role based authorization on JupyterHub, Grafana, Dask-Gateway | X | X |
| Configurable user groups | X | X |
| Shared folders for each user's group | X | X |
| Traefik proxy | X | X |
| Automated Let's Encrypt and manual TLS certificates | X | X |
| Forward authentication ensuring all endpoints authenticated | X | |
| Backups via Restic | | X |

| Integrations | QHub | QHub-HPC |
| ------------ | ---- | -------- |
| ClearML | X | |
| Prefect | X | |
| Bodo | | X |
Loading

0 comments on commit e6db343

Please sign in to comment.