From 924cf38cfc0fbb379e5d09fb1ef941bc1c12f3fb Mon Sep 17 00:00:00 2001 From: rhliang Date: Wed, 22 Nov 2023 16:49:39 -0800 Subject: [PATCH] Fleshed out the "manual" deployment procedure, i.e. the instructions for a human to follow in README.md. --- cluster-setup/README.md | 107 +++++++++++++++--- .../group_vars/default_template.yml | 2 + .../create_data_filesystem/tasks/main.yaml | 4 +- 3 files changed, 98 insertions(+), 15 deletions(-) diff --git a/cluster-setup/README.md b/cluster-setup/README.md index a54a2b3a..1b2c0280 100644 --- a/cluster-setup/README.md +++ b/cluster-setup/README.md @@ -4,7 +4,28 @@ This directory contains code and instructions for setting up a multi-host comput ## Deployment to Octomore -This procedure, as of October 26, 2023, looks like the following. +This procedure, as of November 22, 2023, looks like the following. + +### Before you wipe the old machine + +Make sure your backups are in order. System backups are typically kept using `rsnapshot`, +and a backup of the Kive PostgreSQL database is kept using `barman`. For example, +on our production server, these are kept on a NAS mounted at `/media/dragonite`. + +Preserve copies of your system's `/etc/passwd`, `/etc/group`, and `/etc/shadow`. This +information will be used to populate the new system with the same users and groups +from the old system. + +Create a dump of the Kive PostgreSQL database using `pg_dumpall`. As the upgrade may +involve moving to a newer version of PostgreSQL, we likely can't use the Barman +backups to migrate from; thus we must do it the "old-fashioned" way. + +Preserve a copy of `/etc/kive/kive_apache.conf`. This file contains the database +password used by Kive (via `apache2`) to access PostgreSQL. (You can also just preserve +this password and discard the file; the file should be present in the old system's +`rsnapshot` backups anyway if needed later.) + +### Install Ubuntu and do basic network setup on the head node First, manually install Ubuntu Jammy on the head node using an Ubuntu live USB drive. - Create a user with username `ubuntu` when prompted during installation. This will be @@ -19,35 +40,61 @@ This sets up the root user's SSH key and `/etc/hosts`, and installs Ansible on t Now that Ansible is available on the root node, most of the rest of the procedure will be done using Ansible playbooks defined in the [deployment] directory. +#### Prepare Ansible configuration + Go to the `deployment/group_vars` directory and create an `all.yaml` file from the -`octomore_template.yaml` file by copying and filling in some details. Then go to -`deployment/` and create an `ansible.cfg` from one of the provided templates, probably -`ansible_octomore.cfg`. These files will be necessary for Ansible to work. +`octomore_template.yaml` file by copying and filling in some details. + +> For the passwords, it probably makes sense to use a password generator of some form. +> However, for `kive_db_password` it makes sense to plug in the password you preserved +> from the old system, as this is the password that will be used when we restore +> the database from old backups. + +Then go to `deployment/` and create an `ansible.cfg` from one of the provided templates, +probably `ansible_octomore.cfg`. These files will be necessary for Ansible to work. > Note: all playbooks should be run using `sudo`! +#### General preliminary setup + The first thing to do with Ansible is to run the `octomore_preliminary_setup.yaml` -playbook. Find the `/dev/sd*` entry that corresponds to the 10GB volume on the system -and put its name into `group_vars/all.yml` as the lone entry in the +playbook. Find the `/dev/disk/by-id/` entry that corresponds to the 10GB volume on the system +and put the *basename* (i.e. the name of the soft link in the directory without the +`/dev/disk/by-id/` part of the path) into `group_vars/all.yml` as the lone entry in the `data_physical_volumes` list. (Or, if you wish to use several volumes combined into one logical volume, put all their names in this list.) This sets up the `/data` partition, prepares some other system stuff on the head node, and configures the internal-facing networking. With this in place, the playbook should set up an `ext4` volume at `/data` on the drive you specified. -At this point, go back into the server room and install Ubuntu Jammy on the compute node. -This machine only has one hard drive, and its ethernet should automatically be set up +#### Set up your backup drive + +Next, set up a backup drive for your system. A sample of how this was done for Octomore +using all the leftover drives from the old system is detailed in `create_backup_filesystem.yaml`. +On another server you might use a NAS-based backup solution instead. The goal in the end +is to have a backup drive mounted at `/media/backup`. + +### Install Ubuntu on the compute nodes + +At this point, go back into the server room and install Ubuntu Jammy on the compute nodes. +These machines only have one hard drive, and their ethernet should automatically be set up by default (the head node provides NAT and DHCP), so this should be a very straighforward installation. Again, create a user with username `ubuntu` to be the bootstrap user. -Now, upload the contents of [cloud-init/worker] to the compute node, along with the SSH +Now, upload the contents of [cloud-init/worker] to each compute node, along with the SSH public key generated by the root user on the head node during the running of `head_configuration.bash`. Then, run `worker_configuration.bash`, which will install -the necessary packages and set up the necessary SSH access for this node to be used with Ansible. +the necessary packages and set up the necessary SSH access for the node to be used with Ansible. + +### Annoying detour: reassign the bootstrap user's UID and GID At this point, you can run `reassign_bootstrap_user_uid.yaml`, which is necessary because the `ubuntu` bootstrap user on both machines has a UID and GID that overlaps with -a user account that will later be imported into this machine. +a user account that will later be imported into this machine. You may need to create a *second* +bootstrap user to do this, as running the playbook as `ubuntu` may fail because the user +is currently being used (even if you use `sudo`). + +### Import users and groups from the old system The next playbook to run imports users from the old system. First, a YAML file must be prepared using `export_users_and_groups.py` from the old system's `/etc/shadow`, `/etc/passwd`, and @@ -58,8 +105,42 @@ using `export_users_and_groups.py` from the old system's `/etc/shadow`, `/etc/pa This will import user accounts into the head node. (These will later be synchronized to the compute node as part of a subsequent playbook.) -With all of that table-setting in place, the main playbook to run is `kive_setup.yml`. -This is the point at which I'm currently at, as Kive isn't running yet after running this script. +### Install Kive + +With all of that table-setting in place, the main playbook to run is `kive_setup.yml`. This is +the "main" playbook, and will take longer to run. + +### Restore the Kive database + +At this point, you should have a fresh, "empty" server. You can now restore the Kive database +from the database dump you made earlier on the old system. + +First, restore the Kive data folders from the old backups. On our prod and dev +clusters this folder was `/data/kive`; use `rsync -avz` to copy this information +into place on your new server. + +With these in place, you can now restore the PostgreSQL database. First, +shut down `apache2` and `postgresql`: + +``` +sudo systemctl stop apache2 +sudo systemctl stop postgresql@14-main +``` + +Next you can restore the data using `psql` as the `postgres` user: + +``` +sudo su -l postgres +psql -f [dumped file from the old system] postgres +``` + +(In the `psql` command, we specified the database `postgres`. This will actually +be ignored but should still be specified or else `psql` will complain.) + +At this point, the database will have been restored to the old settings. If you didn't +use it before in your Ansible configuration (i.e. in `group_vars/all.yaml`), you should +now specify the PostgreSQL password preserved from the old system in +`/etc/kive/kive_apache.conf`. [cloud-init/head]: ./cloud-init/head [cloud-init/worker]: ./cloud-init/worker diff --git a/cluster-setup/deployment/group_vars/default_template.yml b/cluster-setup/deployment/group_vars/default_template.yml index 87f99ccc..699e85a8 100644 --- a/cluster-setup/deployment/group_vars/default_template.yml +++ b/cluster-setup/deployment/group_vars/default_template.yml @@ -77,3 +77,5 @@ kive_httpd_group: kive copied_groups: - kive - sudo + +default_shell: /usr/bin/bash diff --git a/cluster-setup/deployment/roles/create_data_filesystem/tasks/main.yaml b/cluster-setup/deployment/roles/create_data_filesystem/tasks/main.yaml index 9d7b90a0..2b2ba430 100644 --- a/cluster-setup/deployment/roles/create_data_filesystem/tasks/main.yaml +++ b/cluster-setup/deployment/roles/create_data_filesystem/tasks/main.yaml @@ -3,7 +3,7 @@ - name: create a single partition on each of the physical volumes loop: "{{ data_physical_volumes }}" community.general.parted: - device: "{{ item }}" + device: "/dev/disk/by-id/{{ item }}" number: 1 state: present label: gpt @@ -16,7 +16,7 @@ - name: append names to the list loop: "{{ data_physical_volumes }}" set_fact: - data_partition_names: "{{ data_partition_names + [item ~ '1'] }}" + data_partition_names: "{{ data_partition_names + ['/dev/disk/by-id/' ~ item ~ '-part1'] }}" - name: create a volume group out of the data partitions lvg: