Skip to content

Commit

Permalink
Merge branch 'CompileSlurm' of github.com:cfe-lab/Kive into CompileSlurm
Browse files Browse the repository at this point in the history
  • Loading branch information
Richard Liang committed Nov 25, 2023
2 parents fb029bf + 924cf38 commit 8fdeeee
Show file tree
Hide file tree
Showing 3 changed files with 98 additions and 15 deletions.
107 changes: 94 additions & 13 deletions cluster-setup/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,28 @@ This directory contains code and instructions for setting up a multi-host comput

## Deployment to Octomore

This procedure, as of October 26, 2023, looks like the following.
This procedure, as of November 22, 2023, looks like the following.

### Before you wipe the old machine

Make sure your backups are in order. System backups are typically kept using `rsnapshot`,
and a backup of the Kive PostgreSQL database is kept using `barman`. For example,
on our production server, these are kept on a NAS mounted at `/media/dragonite`.

Preserve copies of your system's `/etc/passwd`, `/etc/group`, and `/etc/shadow`. This
information will be used to populate the new system with the same users and groups
from the old system.

Create a dump of the Kive PostgreSQL database using `pg_dumpall`. As the upgrade may
involve moving to a newer version of PostgreSQL, we likely can't use the Barman
backups to migrate from; thus we must do it the "old-fashioned" way.

Preserve a copy of `/etc/kive/kive_apache.conf`. This file contains the database
password used by Kive (via `apache2`) to access PostgreSQL. (You can also just preserve
this password and discard the file; the file should be present in the old system's
`rsnapshot` backups anyway if needed later.)

### Install Ubuntu and do basic network setup on the head node

First, manually install Ubuntu Jammy on the head node using an Ubuntu live USB drive.
- Create a user with username `ubuntu` when prompted during installation. This will be
Expand All @@ -19,35 +40,61 @@ This sets up the root user's SSH key and `/etc/hosts`, and installs Ansible on t
Now that Ansible is available on the root node, most of the rest of the procedure will be done
using Ansible playbooks defined in the [deployment] directory.

#### Prepare Ansible configuration

Go to the `deployment/group_vars` directory and create an `all.yaml` file from the
`octomore_template.yaml` file by copying and filling in some details. Then go to
`deployment/` and create an `ansible.cfg` from one of the provided templates, probably
`ansible_octomore.cfg`. These files will be necessary for Ansible to work.
`octomore_template.yaml` file by copying and filling in some details.

> For the passwords, it probably makes sense to use a password generator of some form.
> However, for `kive_db_password` it makes sense to plug in the password you preserved
> from the old system, as this is the password that will be used when we restore
> the database from old backups.
Then go to `deployment/` and create an `ansible.cfg` from one of the provided templates,
probably `ansible_octomore.cfg`. These files will be necessary for Ansible to work.

> Note: all playbooks should be run using `sudo`!
#### General preliminary setup

The first thing to do with Ansible is to run the `octomore_preliminary_setup.yaml`
playbook. Find the `/dev/sd*` entry that corresponds to the 10GB volume on the system
and put its name into `group_vars/all.yml` as the lone entry in the
playbook. Find the `/dev/disk/by-id/` entry that corresponds to the 10GB volume on the system
and put the *basename* (i.e. the name of the soft link in the directory without the
`/dev/disk/by-id/` part of the path) into `group_vars/all.yml` as the lone entry in the
`data_physical_volumes` list. (Or, if you wish to use several volumes combined into
one logical volume, put all their names in this list.) This sets up the `/data` partition,
prepares some other system stuff on the head node, and configures the internal-facing networking.
With this in place, the playbook should set up an `ext4` volume at `/data` on the drive
you specified.

At this point, go back into the server room and install Ubuntu Jammy on the compute node.
This machine only has one hard drive, and its ethernet should automatically be set up
#### Set up your backup drive

Next, set up a backup drive for your system. A sample of how this was done for Octomore
using all the leftover drives from the old system is detailed in `create_backup_filesystem.yaml`.
On another server you might use a NAS-based backup solution instead. The goal in the end
is to have a backup drive mounted at `/media/backup`.

### Install Ubuntu on the compute nodes

At this point, go back into the server room and install Ubuntu Jammy on the compute nodes.
These machines only have one hard drive, and their ethernet should automatically be set up
by default (the head node provides NAT and DHCP), so this should be a very straighforward
installation. Again, create a user with username `ubuntu` to be the bootstrap user.

Now, upload the contents of [cloud-init/worker] to the compute node, along with the SSH
Now, upload the contents of [cloud-init/worker] to each compute node, along with the SSH
public key generated by the root user on the head node during the running of
`head_configuration.bash`. Then, run `worker_configuration.bash`, which will install
the necessary packages and set up the necessary SSH access for this node to be used with Ansible.
the necessary packages and set up the necessary SSH access for the node to be used with Ansible.

### Annoying detour: reassign the bootstrap user's UID and GID

At this point, you can run `reassign_bootstrap_user_uid.yaml`, which is necessary because
the `ubuntu` bootstrap user on both machines has a UID and GID that overlaps with
a user account that will later be imported into this machine.
a user account that will later be imported into this machine. You may need to create a *second*
bootstrap user to do this, as running the playbook as `ubuntu` may fail because the user
is currently being used (even if you use `sudo`).

### Import users and groups from the old system

The next playbook to run imports users from the old system. First, a YAML file must be prepared
using `export_users_and_groups.py` from the old system's `/etc/shadow`, `/etc/passwd`, and
Expand All @@ -58,8 +105,42 @@ using `export_users_and_groups.py` from the old system's `/etc/shadow`, `/etc/pa
This will import user accounts into the head node. (These will later be synchronized to the
compute node as part of a subsequent playbook.)

With all of that table-setting in place, the main playbook to run is `kive_setup.yml`.
This is the point at which I'm currently at, as Kive isn't running yet after running this script.
### Install Kive

With all of that table-setting in place, the main playbook to run is `kive_setup.yml`. This is
the "main" playbook, and will take longer to run.

### Restore the Kive database

At this point, you should have a fresh, "empty" server. You can now restore the Kive database
from the database dump you made earlier on the old system.

First, restore the Kive data folders from the old backups. On our prod and dev
clusters this folder was `/data/kive`; use `rsync -avz` to copy this information
into place on your new server.

With these in place, you can now restore the PostgreSQL database. First,
shut down `apache2` and `postgresql`:

```
sudo systemctl stop apache2
sudo systemctl stop postgresql@14-main
```

Next you can restore the data using `psql` as the `postgres` user:

```
sudo su -l postgres
psql -f [dumped file from the old system] postgres
```

(In the `psql` command, we specified the database `postgres`. This will actually
be ignored but should still be specified or else `psql` will complain.)

At this point, the database will have been restored to the old settings. If you didn't
use it before in your Ansible configuration (i.e. in `group_vars/all.yaml`), you should
now specify the PostgreSQL password preserved from the old system in
`/etc/kive/kive_apache.conf`.

[cloud-init/head]: ./cloud-init/head
[cloud-init/worker]: ./cloud-init/worker
Expand Down
2 changes: 2 additions & 0 deletions cluster-setup/deployment/group_vars/default_template.yml
Original file line number Diff line number Diff line change
Expand Up @@ -77,3 +77,5 @@ kive_httpd_group: kive
copied_groups:
- kive
- sudo

default_shell: /usr/bin/bash
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
- name: create a single partition on each of the physical volumes
loop: "{{ data_physical_volumes }}"
community.general.parted:
device: "{{ item }}"
device: "/dev/disk/by-id/{{ item }}"
number: 1
state: present
label: gpt
Expand All @@ -16,7 +16,7 @@
- name: append names to the list
loop: "{{ data_physical_volumes }}"
set_fact:
data_partition_names: "{{ data_partition_names + [item ~ '1'] }}"
data_partition_names: "{{ data_partition_names + ['/dev/disk/by-id/' ~ item ~ '-part1'] }}"

- name: create a volume group out of the data partitions
lvg:
Expand Down

0 comments on commit 8fdeeee

Please sign in to comment.