Merge branch 'main' into 536-eks-cluster-encryption

nebari-dev · Oct 31, 2024 · 1b4e6ed · 1b4e6ed
2 parents 7e9b31c + 7a7a491
commit 1b4e6ed
Show file tree

Hide file tree

Showing 22 changed files with 1,260 additions and 18 deletions.
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -36,7 +36,7 @@ repos:
 
   # Misc...
   - repo: https://github.com/pre-commit/pre-commit-hooks
-    rev: v4.6.0
+    rev: v5.0.0
     # ref: https://github.com/pre-commit/pre-commit-hooks#hooks-available
     hooks:
       # Autoformat: Makes sure files end in a newline and only a newline

diff --git a/docs/docs/how-tos/nebari-gcp.md b/docs/docs/how-tos/nebari-gcp.md
@@ -66,7 +66,10 @@ management.
 
 If it's your first time creating a service account, please follow
 [these detailed instructions](https://cloud.google.com/iam/docs/creating-managing-service-accounts) to create a Google Service Account with the following roles attached:
-"roles/editor", "roles/resourcemanager.projectIamAdmin" and "roles/container.admin".
+- [`roles/editor`](https://cloud.google.com/iam/docs/understanding-roles#editor)
+- [`roles/resourcemanager.projectIamAdmin`](https://cloud.google.com/iam/docs/understanding-roles#resourcemanager.projectIamAdmin)
+- [`roles/container.admin`](https://cloud.google.com/iam/docs/understanding-roles#container.admin)
+- [`roles/storage.admin`](https://cloud.google.com/iam/docs/understanding-roles#storage.admin)
 
 For more information about roles and permissions, see the
 [Google Cloud Platform IAM documentation](https://cloud.google.com/iam/docs/choose-predefined-roles). Remember to check the active project before creating resources, especially if

diff --git a/docs/docs/how-tos/nebari-local.md b/docs/docs/how-tos/nebari-local.md
@@ -159,6 +159,20 @@ security:
         tag: sha-b4a2d1e
 ```
 
+### Increase fs watches
+
+Depending on your host system, you may need to increase the `fs.inotify.max_user_watches` and
+`fs.inotify.max_user_instances kernel parameters` if you see the error "too many open files" in the logs of
+a failing pod.
+
+```bash
+sudo sysctl fs.inotify.max_user_watches=524288
+sudo sysctl fs.inotify.max_user_instances=512
+```
+
+See the [kind troubleshooting
+docs](https://kind.sigs.k8s.io/docs/user/known-issues/#pod-errors-due-to-too-many-open-files) for more information.
+
 ## Deploying Nebari
 
 With the `nebari-config.yaml` configuration file now created, Nebari can be deployed for the first time with:

diff --git a/docs/docs/references/RELEASE.md b/docs/docs/references/RELEASE.md
@@ -9,11 +9,64 @@ This file is copied to nebari-dev/nebari-docs using a GitHub Action. -->
 
 ---
 
-### Release 2024.7.1 - August 8, 2024
+## Release 2024.9.1 - September 27, 2024
+
+> WARNING: This release changes how group directories are mounted in JupyterLab pods: only groups with specific permissions will have their directories mounted. If you rely on custom group mounts, we strongly recommend running `nebari upgrade` before updating. This will prompt you to confirm how Nebari should handle your groups—either keep them mounted or allow unmounting. **No data will be lost**, and you can reverse this anytime.
+
+### What's Changed
+
+- Fix: KeyValueDict error when deploying to existing infrastructure by @oftheaxe in https://github.com/nebari-dev/nebari/pull/2560
+- Remove unused AWS terraform modules by @marcelovilla in https://github.com/nebari-dev/nebari/pull/2623
+- Upgrade Hashicorp Vault action by @aktech in https://github.com/nebari-dev/nebari/pull/2616
+- Pass `oauth_no_confirm=True` to jhub-apps by @krassowski in https://github.com/nebari-dev/nebari/pull/2631
+- Use Rook Ceph for Jupyterhub and Conda Store drives by @Adam-D-Lewis in https://github.com/nebari-dev/nebari/pull/2541
+- Fix typo in guided init by @marcelovilla in https://github.com/nebari-dev/nebari/pull/2635
+- Action var tests off by @BrianCashProf in https://github.com/nebari-dev/nebari/pull/2632
+- add a "moved" block to account for refactored terraform code without deleting/recreating NFS disks by @Adam-D-Lewis in https://github.com/nebari-dev/nebari/pull/2639
+- Use Helm Chart for JupyterHub 5.1.0 by @krassowski in https://github.com/nebari-dev/nebari/pull/2661
+- Add a how to test section to PR template by @marcelovilla in https://github.com/nebari-dev/nebari/pull/2659
+- Support disallowed nebari config changes by @Adam-D-Lewis in https://github.com/nebari-dev/nebari/pull/2660
+- Fix converted init command in guided init by @marcelovilla in https://github.com/nebari-dev/nebari/pull/2666
+- Add initial uptime metrics by @dcmcand in https://github.com/nebari-dev/nebari/pull/2609
+- Refactor and extend Playwright tests by @viniciusdc in https://github.com/nebari-dev/nebari/pull/2644
+- Remove Cypress remaining tests/files by @viniciusdc in https://github.com/nebari-dev/nebari/pull/2672
+- refactor jupyterhub user token retrieval within pytest by @viniciusdc in https://github.com/nebari-dev/nebari/pull/2645
+- add moved block to account for terraform changes on AWS only by @Adam-D-Lewis in https://github.com/nebari-dev/nebari/pull/2673
+- Refactor shared group mounting using RBAC by @viniciusdc in https://github.com/nebari-dev/nebari/pull/2593
+- Dashboard fix usage report by @kenafoster in https://github.com/nebari-dev/nebari/pull/2671
+- only capture stdout not stdout+stderr when capture_output=True by @Adam-D-Lewis in https://github.com/nebari-dev/nebari/pull/2704
+- revert breaking change to azure deployment test by @Adam-D-Lewis in https://github.com/nebari-dev/nebari/pull/2706
+- Refactor GitOps approach prompt flow in guided init by @marcelovilla in https://github.com/nebari-dev/nebari/pull/2269
+- template the kustomization.yaml file by @dcmcand in https://github.com/nebari-dev/nebari/pull/2667
+- Fix auto-provisioned GitHub repo description after guided init by @marcelovilla in https://github.com/nebari-dev/nebari/pull/2708
+- Add amazon_web_services configuration option to specify EKS cluster api server endpoint access setting by @joneszc in https://github.com/nebari-dev/nebari/pull/2618
+- Use Google Auth and Cloud Python APIs instead of `gcloud` CLI by @swastik959 in https://github.com/nebari-dev/nebari/pull/2083
+- fix broken links in README.md, SECURITY.md, and CONTRIBUTING.md by @blakerosenthal in https://github.com/nebari-dev/nebari/pull/2720
+- add test for changing dicts and lists by @Adam-D-Lewis in https://github.com/nebari-dev/nebari/pull/2724
+- 2024.9.1 upgrade notes by @Adam-D-Lewis in https://github.com/nebari-dev/nebari/pull/2726
+- Add Support for AWS Launch Template Configuration by @viniciusdc in https://github.com/nebari-dev/nebari/pull/2668
+- Run terraform init before running terraform show by @marcelovilla in https://github.com/nebari-dev/nebari/pull/2734
+- Release Process Checklist Updates by @Adam-D-Lewis in https://github.com/nebari-dev/nebari/pull/2727
+- Test implicit aiohttp's TCP to HTTP connector change by @viniciusdc in https://github.com/nebari-dev/nebari/pull/2741
+- remove comments by @Adam-D-Lewis in https://github.com/nebari-dev/nebari/pull/2743
+- Deploy Rook Ceph Helm only when Ceph FS Needed by @kenafoster in https://github.com/nebari-dev/nebari/pull/2742
+- fix group mounting paths by @viniciusdc in https://github.com/nebari-dev/nebari/pull/2738
+- Add compatibility prompt and notes for shared group mounting by @viniciusdc in https://github.com/nebari-dev/nebari/pull/2739
+
+### New Contributors
+
+- @oftheaxe made their first contribution in https://github.com/nebari-dev/nebari/pull/2560
+- @joneszc made their first contribution in https://github.com/nebari-dev/nebari/pull/2618
+- @swastik959 made their first contribution in https://github.com/nebari-dev/nebari/pull/2083
+- @blakerosenthal made their first contribution in https://github.com/nebari-dev/nebari/pull/2720
+
+**Full Changelog**: https://github.com/nebari-dev/nebari/compare/2024.7.1...2024.9.1
+
+## Release 2024.7.1 - August 8, 2024
 
 > NOTE: Support for Digital Ocean deployments using CLI commands and related Terraform modules is being deprecated. Although Digital Ocean will no longer be directly supported in future releases, you can still deploy to Digital Ocean infrastructure using the current `existing` deployment option.
 
-## What's Changed
+### What's Changed
 
 - Enable authentication by default in jupyter-server by @krassowski in https://github.com/nebari-dev/nebari/pull/2288
 - remove dns sleep by @Adam-D-Lewis in https://github.com/nebari-dev/nebari/pull/2550
@@ -35,14 +88,14 @@ This file is copied to nebari-dev/nebari-docs using a GitHub Action. -->
 - Move codespell config to pyproject.toml only by @Adam-D-Lewis in https://github.com/nebari-dev/nebari/pull/2611
 - Add `depends_on` for bucket encryption by @viniciusdc in https://github.com/nebari-dev/nebari/pull/2615
 
-## New Contributors
+### New Contributors
 
 - @BrianCashProf made their first contribution in https://github.com/nebari-dev/nebari/pull/2569
 - @yarikoptic made their first contribution in https://github.com/nebari-dev/nebari/pull/2583
 
 **Full Changelog**: https://github.com/nebari-dev/nebari/compare/2024.6.1...2024.7.1
 
-### Release 2024.6.1 - June 26, 2024
+## Release 2024.6.1 - June 26, 2024
 
 > NOTE: This release includes an upgrade to the `kube-prometheus-stack` Helm chart, resulting in a newer version of Grafana. When upgrading your Nebari cluster, you will be prompted to have Nebari update some CRDs and delete a DaemonSet on your behalf. If you prefer, you can also run the commands yourself, which will be shown to you. If you have any custom dashboards, you'll also need to back them up by [exporting them as JSON](https://grafana.com/docs/grafana/latest/dashboards/share-dashboards-panels/#export-a-dashboard-as-json), so you can [import them](https://grafana.com/docs/grafana/latest/dashboards/build-dashboards/import-dashboards/#import-a-dashboard) after upgrading.
 
@@ -83,9 +136,9 @@ This file is copied to nebari-dev/nebari-docs using a GitHub Action. -->
 
 **Full Changelog**: https://github.com/nebari-dev/nebari/compare/2024.5.1...2024.6.1
 
-### Release 2024.5.1 - May 13, 2024
+## Release 2024.5.1 - May 13, 2024
 
-## What's Changed
+### What's Changed
 
 - make userscheduler run on general node group by @Adam-D-Lewis in https://github.com/nebari-dev/nebari/pull/2415
 - Upgrade to Pydantic V2 by @Adam-D-Lewis in https://github.com/nebari-dev/nebari/pull/2348
@@ -323,7 +376,7 @@ command and follow the instructions
 - paginator for boto3 ec2 instance types by @sblair-metrostar in https://github.com/nebari-dev/nebari/pull/1923
 - Update README.md -- fix typo. by @teoliphant in https://github.com/nebari-dev/nebari/pull/1925
 - Add more unit tests, add cleanup step for Digital Ocean integration test by @iameskild in https://github.com/nebari-dev/nebari/pull/1910
-- Add cleanup step for AWS integration test, ensure diable_prompt is passed through by @iameskild in https://github.com/nebari-dev/nebari/pull/1921
+- Add cleanup step for AWS integration test, ensure disable_prompt is passed through by @iameskild in https://github.com/nebari-dev/nebari/pull/1921
 - K8s 1.25 + More Improvements by @Adam-D-Lewis in https://github.com/nebari-dev/nebari/pull/1856
 - adding lifecycle ignore to eks node group by @sblair-metrostar in https://github.com/nebari-dev/nebari/pull/1905
 - nebari init unit tests by @sblair-metrostar in https://github.com/nebari-dev/nebari/pull/1931
@@ -471,7 +524,7 @@ This is a hot-fix release that resolves an issue whereby users in the `analyst`
 - improve CLI tests by @pmeier in https://github.com/nebari-dev/nebari/pull/1710
 - Fix Existing dashboards by @Adam-D-Lewis in https://github.com/nebari-dev/nebari/pull/1723
 - Fix dashboards by @Adam-D-Lewis in https://github.com/nebari-dev/nebari/pull/1727
-- Typo in the conda_store key by @costrouc in https://github.com/nebari-dev/nebari/pull/1740
+- Typo in the conda-store - conda_store key by @costrouc in https://github.com/nebari-dev/nebari/pull/1740
 - use -V (upper case) for --version short form by @pmeier in https://github.com/nebari-dev/nebari/pull/1720
 - [pre-commit.ci] pre-commit autoupdate by @pre-commit-ci in https://github.com/nebari-dev/nebari/pull/1692
 - improve pytest configuration by @pmeier in https://github.com/nebari-dev/nebari/pull/1700
@@ -1312,7 +1365,7 @@ Explicit user facing changes:
 
 - `qhub deploy -c qhub-config.yaml` no longer prompts unsupported argument for `load_config_file`.
 - Minor changes on the Step-by-Step walkthrough on the docs.
-- Revamp of README.md to make it concise and highlight QHub HPC.
+- Revamp of README.md to make it concise and highlight Nebari Slurm.
 
 ### Breaking changes
 

diff --git a/docs/nebari-slurm/_static/images/architecture.png b/docs/nebari-slurm/_static/images/architecture.png
diff --git a/docs/nebari-slurm/_static/images/qhub-dashboards-bug.png b/docs/nebari-slurm/_static/images/qhub-dashboards-bug.png
diff --git a/docs/nebari-slurm/_static/images/qhub-dask-gateway-cluster.png b/docs/nebari-slurm/_static/images/qhub-dask-gateway-cluster.png
diff --git a/docs/nebari-slurm/_static/images/qhub-dask-gateway.png b/docs/nebari-slurm/_static/images/qhub-dask-gateway.png
diff --git a/docs/nebari-slurm/_static/images/qhub-grafana-node-exporter.png b/docs/nebari-slurm/_static/images/qhub-grafana-node-exporter.png
diff --git a/docs/nebari-slurm/_static/images/qhub-grafana-slurm.png b/docs/nebari-slurm/_static/images/qhub-grafana-slurm.png
diff --git a/docs/nebari-slurm/_static/images/qhub-grafana-traefik.png b/docs/nebari-slurm/_static/images/qhub-grafana-traefik.png
diff --git a/docs/nebari-slurm/_static/images/qhub-jupyterlab-profile.png b/docs/nebari-slurm/_static/images/qhub-jupyterlab-profile.png
diff --git a/docs/nebari-slurm/_static/images/qhub-landing-page.png b/docs/nebari-slurm/_static/images/qhub-landing-page.png
diff --git a/docs/nebari-slurm/benchmark.md b/docs/nebari-slurm/benchmark.md
@@ -0,0 +1,99 @@
+# Benchmarking
+
+There are many factors that go into HPC performance. Aside from the
+obvious cpu performance network and storage are equally important in a
+distributed context.
+
+## Storage
+
+[fio](https://fio.readthedocs.io/en/latest/fio_doc.html) is a powerful
+tool for benchmarking filesystems. Measuring maximum performance
+especially on extremely high performance filesystems can be tricky to
+measure and will require research on how to effectively use the
+tool. Often times measuring maximum performance on high performance
+distributed filesystems will require multiple nodes and threads for
+reading/writing. However it should provide a good ballpark of
+performance.
+
+Substitute `<directory>` with the filesystem that you want to
+test. `df -h` can be a great way to see where each drive is
+mounted. `fio` will need the ability to read/write in the given
+directory.
+
+IOPs (input/output operations per second)
+
+### Maximum Write Throughput
+
+```shell
+fio --ioengine=sync --direct=0 \
+  --fsync_on_close=1 --randrepeat=0 --nrfiles=1  --name=seqwrite --rw=write \
+  --bs=1m --size=20G --end_fsync=1 --fallocate=none  --overwrite=0 --numjobs=1 \
+  --directory=<directory> --loops=10
+```
+
+### Maximum Write IOPs
+
+```shell
+fio --ioengine=sync --direct=0 \
+  --fsync_on_close=1 --randrepeat=0 --nrfiles=1  --name=randwrite --rw=randwrite \
+  --bs=4K --size=1G --end_fsync=1 --fallocate=none  --overwrite=0 --numjobs=80 \
+  --sync=1 --directory=<directory> --loops=10
+```
+
+### Maximum Read Throughput
+
+```shell
+fio --ioengine=sync --direct=0 \
+  --fsync_on_close=1 --randrepeat=0 --nrfiles=1  --name=seqread --rw=read \
+  --bs=1m --size=240G --end_fsync=1 --fallocate=none  --overwrite=0 --numjobs=1 \
+  --directory=<directory> --invalidate=1 --loops=10
+```
+
+### Maximum Read IOPs
+
+```shell
+fio --ioengine=sync --direct=0 \
+  --fsync_on_close=1 --randrepeat=0 --nrfiles=1  --name=randread --rw=randread \
+  --bs=4K --size=1G --end_fsync=1 --fallocate=none  --overwrite=0 --numjobs=20 \
+  --sync=1 --invalidate=1 --directory=<directory> --loops=10
+```
+
+## Network
+
+To test network latency and bandwidth there needs to be a source and
+destination that you wish to test. It will expose a given port by
+default `5201` with iperf.
+
+### Bandwidth
+
+Start a server on a given `<dest>`
+
+```shell
+iperf3 -s
+```
+
+No on the `<src>` machine run
+
+```shell
+iperf3 -c <ip address>
+```
+
+This will measure the bandwidth of the link between the nodes from
+`<src>` to `<dest>`. This means that if you are using a provider where
+your Internet have very different upload vs. download speeds you will
+see very different results in the direction. Add a `-R` flag to the
+client to test the other direction.
+
+### Latency
+
+[ping](https://linux.die.net/man/8/ping) is a great way to watch the
+latency between `<src>` and `<dest>`.
+
+From the src machine run
+
+```shell
+ping -4 <dest> -c 10
+```
+
+Keep in mind that ping is the bi-directional (round trip) time. So
+dividing by 2 is roughly the latency.
diff --git a/docs/nebari-slurm/comparison.md b/docs/nebari-slurm/comparison.md
@@ -0,0 +1,79 @@
+# Nebari and Nebari Slurm Comparison
+
+At a high level QHub is focused on a Kubernetes and container based
+deployment of all of its components. Many of the advantages of a
+container based deployment allow for better security, scalability of
+components and compute nodes.
+
+QHub-HPC is focused on bringing many of the same features but within a
+bare metal installation allowing users to fully take advantage of
+their hardware for performance. Additionally these installations tend
+to be easier to manage and debug when issues arise (traditional linux
+sys-admin experience works well here). Due to this approach QHub-HPC
+lacks containers but achieves workflows and scheduling of compute via
+[Slurm](https://slurm.schedmd.com/documentation.html) and keeping
+[services
+available](https://www.freedesktop.org/wiki/Software/systemd/).
+
+Questions to help determine which solution may be best for you:
+
+1. Are you deploying to the cloud e.g. AWS, GCP, Azure, or Digital Ocean?
+
+QHub is likely your best option. The auto-scalability of QHub compute
+allows for cost effective usage of the cloud while taking advantage of
+a managed Kubernetes.
+
+2. Are you deploying to a bare metal cluster?
+
+QHub-HPC may be your best option since deployment does not require the
+complexity of managing a kubernetes cluster. If you do have a devops
+or IT team to help manage kubernetes on bare metal QHub could be a
+great option. But be advised that managing Kubernetes comes with quite
+a lot of complexity which the cloud providers handle for us.
+
+3. Are you concerned about absolute best performance?
+
+QHub-HPC is likely your best option. But note when we say absolute
+performance we mean your software is able to fully take advantage of
+your networks Infiniband hardware, uses MPI, and SIMD
+instructions. Few users fall into this camp and should rarely be a
+reason to chose QHub-HPC (unless you know why you are making this
+choice).
+
+# Feature Matrix
+
+| Core                                             | QHub                                | QHub-HPC          |
+| ------------------------------------------------ | ----------------------------------- | ----------------- |
+| Scheduler                                        | Kubernetes                          | SystemD and Slurm |
+| User Isolation                                   | Containers (cgroups and namespaces) | Slurm (cgroups)   |
+| Auto-scaling compute nodes                       | X                                   |                   |
+| Cost efficient compute support (Spot/Premptible) | X                                   |                   |
+| Static compute nodes                             |                                     | X                 |
+
+| User Services                      | QHub | QHub-HPC |
+| ---------------------------------- | ---- | -------- |
+| Dask Gateway                       | X    | X        |
+| JupyterHub                         | X    | X        |
+| JupyterHub-ssh                     | X    | X        |
+| CDSDashboards                      | X    | X        |
+| Conda-Store environment management | X    | X        |
+| ipyparallel                        |      | X        |
+| Native MPI support                 |      | X        |
+
+| Core Services                                                 | QHub | QHub-HPC |
+| ------------------------------------------------------------- | ---- | -------- |
+| Monitoring via Grafana and Prometheus                         | X    | X        |
+| Auth integration (OAuth2, OpenID, ldap, kerberos)             | X    | X        |
+| Role based authorization on JupyterHub, Grafana, Dask-Gateway | X    | X        |
+| Configurable user groups                                      | X    | X        |
+| Shared folders for each user's group                          | X    | X        |
+| Traefik proxy                                                 | X    | X        |
+| Automated Let's Encrypt and manual TLS certificates           | X    | X        |
+| Forward authentication ensuring all endpoints authenticated   | X    |          |
+| Backups via Restic                                            |      | X        |
+
+| Integrations | QHub | QHub-HPC |
+| ------------ | ---- | -------- |
+| ClearML      | X    |          |
+| Prefect      | X    |          |
+| Bodo         |      | X        |