Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleaning up metalc #221

Merged
merged 12 commits into from
Mar 3, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 5 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,17 @@
# Metal Cluster

This is our repo for everything involving our bare-metal Kubernetes cluster. At this moment,
This is our repo for everything related to the now-depreciated "flock" Kubernetes cluster. At this moment,
it also contains documentation for setting up Kubernetes, JupyterHub, and BinderHub on Google Cloud.
The folder [docs](./docs) contains all documentation related to these topics.
The root of this repository contains files relating to the set-up of the bare-metal cluster,
The folder [flock-archive](./flock-archive) contains files relating to the set-up of the flock cluster.

## Table of Contents
This discusses the table of contents of [documentation](./docs) folder based on the following topics.

### Bare-Metal
1. [Bare Metal Cluster Setup](./docs/Bare-Metal/baremetal.md) has what you should read first about
the cluster. The file gives an overview of the cluster set-up, including networking, publishing services,
instructions on adding nodes, and useful resources.
1. [Bare Metal Cluster Setup](./docs/Bare-Metal/baremetal.md) has reading material on nearly every
aspect of the flock cluster. The file gives an overview of the cluster set-up, including networking, publishing services,
instructions on adding nodes, and useful resources.

#### Concepts
1. [RAID.md](./docs/Bare-Metal/concepts/RAID.md) describes the purpose and different levels of RAID.
Expand All @@ -36,8 +36,6 @@ setting up JupyterHub on a virtual machine. It contains solutions to the problem
when installing JupyterHub through the [jupyterhub-deploy-teaching](https://github.com/mechmotum/jupyterhub-deploy-teaching)
repository. We keep this as a reference for those who might encounter the same problems in the future.



### JupyterHub on GCloud
This section teaches how to set-up and configure JupyterHub on Google Cloud.

Expand All @@ -57,5 +55,3 @@ on a Kubernetes cluster on Google Cloud.
We created a development [cluster of vms](./dev-env) using Vagrant that is nice
for testing stuff without having it on the main cluster. This section contains files
and instructions implementing the test cluster.


Binary file removed docs/.JupyterBareMetalWithLVM.md.swo
Binary file not shown.
Binary file removed docs/.JupyterBareMetalWithLVM.md.swp
Binary file not shown.
3 changes: 3 additions & 0 deletions docs/Bare-Metal/concepts/networking.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
# Networking

NOTE: This file is outdated for the [current galaxy cluster](https://github.com/LibreTexts/galaxy-control-repo/tree/production/router-configs).

*Relevant files: `/etc/netplan/`* *Summary: https://netplan.io/examples*

We use [netplan](https://netplan.io/) to configure networking on rooster.
Expand Down
2 changes: 2 additions & 0 deletions docs/Bare-Metal/concepts/nginx.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# NGINX

NOTE: This file is outdated for the [current galaxy cluster](https://github.com/LibreTexts/galaxy-control-repo/tree/production/router-configs).

*Relevant files: `/etc/nginx`*
*Summary: http://nginx.org/en/docs/http/load_balancing.html*

Expand Down
5 changes: 1 addition & 4 deletions docs/Bare-Metal/login.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,7 @@
# Login
This JupyterHub serves LibreTexts instructors and their students, as well as UC Davis faculty, staff, and students.

## Request an account
If you are a LibreTexts or UC Davis student, please request an account by sending your
Google OAuth enabled email to <email>. Your email address must have a Google Account
that can be used with Google OAuth, like `@gmail.com` or `@ucdavis.edu`.
If you are a UC Davis student, access to this JupyterHub is already granted. Just login with your school email.

## Getting started with Jupyter
[Jupyter](https://jupyter.org/index.html) is an environment where you can
Expand Down
12 changes: 7 additions & 5 deletions docs/Bare-Metal/troubleshooting/README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,15 @@
# Common Practices For Troubleshooting

NOTE: This file and others in this folder may be outdated for the [current galaxy cluster](https://github.com/LibreTexts/galaxy-control-repo/tree/production/kubernetes/).

If the troubleshooting that you are doing for a particular problem is ineffective it can largely be due to any of the following reasons:

1. You may be looking at symptoms unrelated to the problem.
* Fixing this is a matter of becoming better aquainted with the system. The following are some resources that you may want to review:
1. [Bare-Metal](https://github.com/LibreTexts/metalc/blob/docs/Troubleshooting-Summary/docs/Bare-Metal/baremetal.md)
2. [BinderHub](https://github.com/LibreTexts/metalc/blob/docs/Troubleshooting-Summary/docs/Binder-on-GCloud/01-BinderHub.md)
3. [Maintenance](https://github.com/LibreTexts/metalc/blob/docs/Troubleshooting-Summary/docs/maintenance-tasks.md)
4. [Adding new packages to the Dockerfile](https://github.com/LibreTexts/default-env/tree/master/rich-default)
1. Documentation in galaxy-control-repo
2. [BinderHub](/docs/Binder-on-GCloud/01-BinderHub.md)
3. [Maintenance](/docs/maintenance-tasks.md)
4. [Adding new packages to the Dockerfile](https://github.com/LibreTexts/default-env/)
2. Not fully understanding how to change the system such the inputs, outputs, the environment, and etc.
* This is similar to the previous example, make sure you have good knowledge of the system you are looking at. Visit any of the above links, and feel free
check out any other documentation. The following may potentially be relevant:
Expand All @@ -16,7 +18,7 @@ If the troubleshooting that you are doing for a particular problem is ineffectiv
3. [BinderHub FAQ](https://mybinder.readthedocs.io/en/latest/faq.html)
3. Assuming that the problem you are facing is the same as one you have previously dealt with given that the symptoms are the same.
* If you are dealing with similar symptoms definitly look into how you have previously handled the issue, or how solutions have been documented
[here](https://github.com/LibreTexts/metalc/tree/docs/Troubleshooting-Summary/docs/Bare-Metal/troubleshooting). Howevever, if you have tried everything you
[here](/docs/Bare-Metal/troubleshooting/). Howevever, if you have tried everything you
have done before, it might be worth trying something different. A good example of this kind of dilema is [this issue](https://github.com/LibreTexts/metalc/blob/master/docs/Bare-Metal/troubleshooting/KubeadmCert.md)
we had.

Expand Down
67 changes: 41 additions & 26 deletions docs/maintenance-tasks.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,48 +2,63 @@

This document lists all tasks that should be done regularly.

## Security Update on Rooster

* Frequency: weekly
* Command: `sudo unattended-upgrade -d`

Rooster is the only server that has a public network. It is essential to keep the system up to date. We use [unattended-upgrade](https://github.com/mvo5/unattended-upgrades) to upgrade packages safely. To minimize affecting the cluster, check the following before running the command:

1. Run `sudo unattended-upgrade -d --dry-run` to make sure that it will upgrade without error.
2. Check `kubectl get pods -n jhub` to see if there are a lot of people using the cluster. Try to upgrade when no one is there.

Additionally, there is a cron job on rooster that runs `sudo unattended-upgrade -d --dry-run` and sends out weekly emails on Friday. Do `sudo crontab -e` to edit the cron job. If you wish to change the code of the cron job, the shell script is located at `/home/spicy/metalc-configurations/cronjob/weekly-security-update`. The shell script is a python script that uses pipes to run commands.

## Scrub Checks on Hen (ZFS)
## Scrub Checks on Blackhole (ZFS)

* Frequency: monthly
* Command: `sudo zpool scrub nest`

To execute manually, you must first ssh into hen from rooster with the command `ssh hen`. Scrub checks the file system's integrity, and repairs any issues that it finds. After the scrub is finished, it is good to also run `zpool status` to check if there is anything wrong.

This command is run on a cronjob, so there should be no need for manual intervention. The cronjob runs at 8:00AM the first day of every month. Rooster will also send out an email at 8:10AM on the same day with the results. If the scrub is fine, the email should be titled `[All clear] Hen monthly ZFS report`. If the title says `[POTENTIAL ZFS ISSUE]` instead, there may be something wrong with the disk, and the email contains the `zpool status` output which you can use to debug disk issues. Details on how the cronjob is setup are in the private configuration repo, under `cronjob/monthly-zfs-report.py`.
To execute manually, you must first ssh into blackhole.
Scrub checks the file system's integrity, and repairs
any issues that it finds. After the scrub is finished,
it is good to also run `zpool status` to check if there
is anything wrong.

This command is run on a cronjob. The cronjob runs at 8:00AM
the first day of every month. Gravity also sends out an
email at 8:10AM on the same day with the results. If the
scrub is fine, the email is titled
`[All clear] Hen monthly ZFS report`. If the title
says `[POTENTIAL ZFS ISSUE]` instead, there may be
something wrong with the disk, and the email contains
the `zpool status` output which can be used to debug disk
issues. Details on how the cronjob was setup are in the
private configuration repo, under `cronjob/monthly-zfs-report.py`.

## Cluster control plane upgrade

The Kubernetes control plane should be upgraded regularly. There is a cronjob sending out a triyearly reminder (Jan, May, Sept 1st of every year) reminding you to do the upgrade. (The cronjob can be found in the private configuration repo, under `cronjob/cluster-upgrade-reminder.sh).

This must be done at least once a year, otherwise the Kubernetes certificates may expire, which will [break the entire cluster](https://github.com/LibreTexts/metalc/blob/master/docs/Bare-Metal/troubleshooting/KubeadmCert.md#more-complex-solution-renewing-kubeadm-certificates). The email sent out should contain the date on which the certificates expire. If you apply routine upgrades, certificates should be renewed automatically and this should not be an issue. If the certificates do expire, follow that guide instead of this one. If you need to renew the certificates without upgrading the cluster, follow [the guide to renew certificates without upgrading](#renew-certificates-without-upgrade) instead, but this should not be an excuse to put off cluster upgrades indefinitely.
The Kubernetes control plane should be upgraded regularly.
There is a cronjob sending out a triyearly reminder
(Jan, May, Sept 1st of every year) reminding you to do the
upgrade. (The cronjob can be found in galaxy-control-repo.)

This must be done at least once a year, otherwise the
Kubernetes certificates may expire, which will
[break the entire cluster](/docs/Bare-Metal/troubleshooting/KubeadmCert.md#more-complex-solution-renewing-kubeadm-certificates).
The email sent out contains the date on which the
certificates expire. If you apply routine upgrades,
certificates should be renewed automatically and this
should not be an issue. If the certificates do expire,
follow that guide instead of this one. If you need to
renew the certificates without upgrading the cluster,
follow [the guide to renew certificates without upgrading](#renew-certificates-without-upgrade)
instead, but this should not be an excuse to put off
cluster upgrades indefinitely.

### Upgrading the control plane

tl;dr: Follow https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/ and upgrade the control plane, update apt packages on the nodes, then copy `/etc/kubernetes/admin.conf` on chick0 into `/home/spicy/.kube/config` on rooster.
tl;dr: Follow https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/ and upgrade the control planes, update apt packages on the nodes, then copy `/etc/kubernetes/admin.conf` on a nebula into `/home/milky/.kube/config` on gravity.

Note: This will take a while and will cause some downtime. Be sure to notify users beforehand.

1. Follow the instructions in the [official cluster upgrade guide](https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/). We only have one control plane node (chick0), so follow the "Upgrade the first control plane node" secion on chick0 and ignore "Upgrade additional control plane nodes". Also follow "Upgrade worker nodes" on all other chicks. This is also a good time to [upgrade all packages on the chicks](https://github.com/LibreTexts/metalc/blob/447a459bacfbc6a29d80229e7df2f2bfb953cd7a/docs/updating-ubuntu-kubelet.md) as well, since the chicks are cordoned during the upgrade.
2. Once you verified the cluster is working using `kubectl get nodes` on rooster, copy over the newer admin certificate/key.
1. SSH into chick0, then do `sudo cp /etc/kubernetes/admin.conf /home/spicy/.kube/config`. Also `chown spicy:spicy /home/spicy/.kube/config` to make it readable to us.
2. Go back into rooster and do `scp chick0:.kube/config ~/.kube/config` to copy the file onto rooster.
3. Verify `kubectl `works on both rooster and chick0 by running any kubectl command (such as `kubectl get nodes`).
1. Follow the instructions in the [official cluster upgrade guide](https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/). We have multiple control plane nodes (nebulas), so follow both the "Upgrade the first control plane node" and "Upgrade additional control plane nodes" sections. Also follow "Upgrade worker nodes" on all other worker nodes. This is also a good time to [upgrade all packages on the stars](https://github.com/LibreTexts/metalc/blob/447a459bacfbc6a29d80229e7df2f2bfb953cd7a/docs/updating-ubuntu-kubelet.md) as well, since the stars are cordoned during the upgrade.
2. Once you verified the cluster is working using `kubectl get nodes` on gravity/quantum, copy over the newer admin certificate/key.
1. SSH into a nebula, then do `sudo cp /etc/kubernetes/admin.conf /home/milky/.kube/config`. Also `chown milky:milky /home/milky/.kube/config` to make it readable to us.
2. Go back into gravity and do `scp nebula{1,5}:.kube/config ~/.kube/config` to copy the file onto gravity.
3. Verify `kubectl` works on both gravity and the nebulas by running any kubectl command (such as `kubectl get nodes`).

### Renew certificates without upgrade

Sometimes you want to renew the certificates without doing a proper upgrade and causing downtime. In that case, do the following:

1. Follow [the official guide](https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-certs/#manual-certificate-renewal). tl;dr: Just run `sudo kubeadm alpha certs renew` on chick0.
1. Follow [the official guide](https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-certs/#manual-certificate-renewal). tl;dr: Just run `sudo kubeadm alpha certs renew` on all of the nebulas.
2. Follow the same step 2 as [Upgrading the control plane](#upgrading-the-control-plane).
21 changes: 18 additions & 3 deletions docs/updating-ubuntu-kubernetes.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,19 @@
# Updating Ubuntu and Kubernetes

This document lists the procedure for updating Ubuntu and Kubernetes on the chick nodes.
Note: This document is mostly outdated for the [current galaxy cluster](https://github.com/LibreTexts/galaxy-control-repo/#upgrading-the-kubernetes-cluster).

This document lists the procedure for updating Ubuntu and
Kubernetes on the chick nodes.

## Checking Software Versions on the Nodes

You can check the versions of kubernetes, Ubuntu and the kernel as well as the status of each node by executing the command `kubectl get nodes -o wide` from rooster. When you do Kubernetes upgrades, make sure that you do not upgrade more than one minor version at a time. For example, if the cluster is at verison 1.19 and the latest available version is 1.21, you should first upgrade everything to 1.20, then 1.21.
You can check the versions of kubernetes, Ubuntu and the
kernel as well as the status of each node by executing the
command `kubectl get nodes -o wide` from rooster. When you
do Kubernetes upgrades, make sure that you do not upgrade
more than one minor version at a time. For example, if the
cluster is at verison 1.19 and the latest available version
is 1.21, you should first upgrade everything to 1.20, then 1.21.

## Preparing to Update

Expand All @@ -16,7 +25,13 @@ You can check the versions of kubernetes, Ubuntu and the kernel as well as the s

## Updating Kubernetes

The official documentation for upgrading Kubernetes is available [here](https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/). You will first have to upgrade kubeadm and then use it to upgrade kubelet and kubectl. The processis pretty straightforward if you follow the official documentation. Also checkout `maintenance-tasks.md` for more information about cluster upgrades and nuances.
The official documentation for upgrading Kubernetes is available
[here](https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/).
You will first have to upgrade kubeadm and then use it to
upgrade kubelet and kubectl. The processis pretty straightforward
if you follow the official documentation. Also checkout
`maintenance-tasks.md` for more information about cluster
upgrades and nuances.

## Updating Ubuntu

Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.