Merge branch 'CompileSlurm' of github.com:cfe-lab/Kive into CompileSlurm

cfe-lab · Nov 25, 2023 · 8fdeeee · 8fdeeee
2 parents fb029bf + 924cf38
commit 8fdeeee
Show file tree

Hide file tree

Showing 3 changed files with 98 additions and 15 deletions.
diff --git a/cluster-setup/README.md b/cluster-setup/README.md
@@ -4,7 +4,28 @@ This directory contains code and instructions for setting up a multi-host comput
 
 ## Deployment to Octomore
 
-This procedure, as of October 26, 2023, looks like the following.
+This procedure, as of November 22, 2023, looks like the following.
+
+### Before you wipe the old machine
+
+Make sure your backups are in order.  System backups are typically kept using `rsnapshot`,
+and a backup of the Kive PostgreSQL database is kept using `barman`.  For example,
+on our production server, these are kept on a NAS mounted at `/media/dragonite`.
+
+Preserve copies of your system's `/etc/passwd`, `/etc/group`, and `/etc/shadow`.  This 
+information will be used to populate the new system with the same users and groups
+from the old system.
+
+Create a dump of the Kive PostgreSQL database using `pg_dumpall`.  As the upgrade may
+involve moving to a newer version of PostgreSQL, we likely can't use the Barman
+backups to migrate from; thus we must do it the "old-fashioned" way.
+
+Preserve a copy of `/etc/kive/kive_apache.conf`.  This file contains the database
+password used by Kive (via `apache2`) to access PostgreSQL.  (You can also just preserve
+this password and discard the file; the file should be present in the old system's
+`rsnapshot` backups anyway if needed later.)
+
+### Install Ubuntu and do basic network setup on the head node
 
 First, manually install Ubuntu Jammy on the head node using an Ubuntu live USB drive.
 - Create a user with username `ubuntu` when prompted during installation.  This will be
@@ -19,35 +40,61 @@ This sets up the root user's SSH key and `/etc/hosts`, and installs Ansible on t
 Now that Ansible is available on the root node, most of the rest of the procedure will be done
 using Ansible playbooks defined in the [deployment] directory.
 
+#### Prepare Ansible configuration
+
 Go to the `deployment/group_vars` directory and create an `all.yaml` file from the
-`octomore_template.yaml` file by copying and filling in some details.  Then go to 
-`deployment/` and create an `ansible.cfg` from one of the provided templates, probably
-`ansible_octomore.cfg`.  These files will be necessary for Ansible to work.
+`octomore_template.yaml` file by copying and filling in some details.
+
+> For the passwords, it probably makes sense to use a password generator of some form.
+> However, for `kive_db_password` it makes sense to plug in the password you preserved
+> from the old system, as this is the password that will be used when we restore
+> the database from old backups.
+
+Then go to `deployment/` and create an `ansible.cfg` from one of the provided templates, 
+probably `ansible_octomore.cfg`.  These files will be necessary for Ansible to work.
 
 > Note: all playbooks should be run using `sudo`!
 
+#### General preliminary setup
+
 The first thing to do with Ansible is to run the `octomore_preliminary_setup.yaml`
-playbook.  Find the `/dev/sd*` entry that corresponds to the 10GB volume on the system 
-and put its name into `group_vars/all.yml` as the lone entry in the 
+playbook.  Find the `/dev/disk/by-id/` entry that corresponds to the 10GB volume on the system 
+and put the *basename* (i.e. the name of the soft link in the directory without the 
+`/dev/disk/by-id/` part of the path) into `group_vars/all.yml` as the lone entry in the 
 `data_physical_volumes` list.  (Or, if you wish to use several volumes combined into 
 one logical volume, put all their names in this list.)  This sets up the `/data` partition,
 prepares some other system stuff on the head node, and configures the internal-facing networking.
 With this in place, the playbook should set up an `ext4` volume at `/data` on the drive 
 you specified.
 
-At this point, go back into the server room and install Ubuntu Jammy on the compute node.
-This machine only has one hard drive, and its ethernet should automatically be set up
+#### Set up your backup drive
+
+Next, set up a backup drive for your system.  A sample of how this was done for Octomore
+using all the leftover drives from the old system is detailed in `create_backup_filesystem.yaml`.
+On another server you might use a NAS-based backup solution instead.  The goal in the end 
+is to have a backup drive mounted at `/media/backup`.
+
+### Install Ubuntu on the compute nodes
+
+At this point, go back into the server room and install Ubuntu Jammy on the compute nodes.
+These machines only have one hard drive, and their ethernet should automatically be set up
 by default (the head node provides NAT and DHCP), so this should be a very straighforward
 installation.  Again, create a user with username `ubuntu` to be the bootstrap user.
 
-Now, upload the contents of [cloud-init/worker] to the compute node, along with the SSH
+Now, upload the contents of [cloud-init/worker] to each compute node, along with the SSH
 public key generated by the root user on the head node during the running of 
 `head_configuration.bash`.  Then, run `worker_configuration.bash`, which will install
-the necessary packages and set up the necessary SSH access for this node to be used with Ansible.
+the necessary packages and set up the necessary SSH access for the node to be used with Ansible.
+
+### Annoying detour: reassign the bootstrap user's UID and GID
 
 At this point, you can run `reassign_bootstrap_user_uid.yaml`, which is necessary because
 the `ubuntu` bootstrap user on both machines has a UID and GID that overlaps with 
-a user account that will later be imported into this machine.
+a user account that will later be imported into this machine.  You may need to create a *second* 
+bootstrap user to do this, as running the playbook as `ubuntu` may fail because the user
+is currently being used (even if you use `sudo`).
+
+### Import users and groups from the old system
 
 The next playbook to run imports users from the old system.  First, a YAML file must be prepared
 using `export_users_and_groups.py` from the old system's `/etc/shadow`, `/etc/passwd`, and 
@@ -58,8 +105,42 @@ using `export_users_and_groups.py` from the old system's `/etc/shadow`, `/etc/pa
 This will import user accounts into the head node.  (These will later be synchronized to the
 compute node as part of a subsequent playbook.)
 
-With all of that table-setting in place, the main playbook to run is `kive_setup.yml`.
-This is the point at which I'm currently at, as Kive isn't running yet after running this script.
+### Install Kive
+
+With all of that table-setting in place, the main playbook to run is `kive_setup.yml`.  This is
+the "main" playbook, and will take longer to run.
+
+### Restore the Kive database
+
+At this point, you should have a fresh, "empty" server.  You can now restore the Kive database
+from the database dump you made earlier on the old system.
+
+First, restore the Kive data folders from the old backups.  On our prod and dev 
+clusters this folder was `/data/kive`; use `rsync -avz` to copy this information 
+into place on your new server.
+
+With these in place, you can now restore the PostgreSQL database.  First,
+shut down `apache2` and `postgresql`:
+
+```
+sudo systemctl stop apache2
+sudo systemctl stop postgresql@14-main
+```
+
+Next you can restore the data using `psql` as the `postgres` user:
+
+```
+sudo su -l postgres
+psql -f [dumped file from the old system] postgres
+```
+
+(In the `psql` command, we specified the database `postgres`.  This will actually 
+be ignored but should still be specified or else `psql` will complain.)
+
+At this point, the database will have been restored to the old settings.  If you didn't
+use it before in your Ansible configuration (i.e. in `group_vars/all.yaml`), you should
+now specify the PostgreSQL password preserved from the old system in 
+`/etc/kive/kive_apache.conf`.
 
 [cloud-init/head]: ./cloud-init/head
 [cloud-init/worker]: ./cloud-init/worker

diff --git a/cluster-setup/deployment/group_vars/default_template.yml b/cluster-setup/deployment/group_vars/default_template.yml
@@ -77,3 +77,5 @@ kive_httpd_group: kive
 copied_groups:
   - kive
   - sudo
+
+default_shell: /usr/bin/bash
diff --git a/cluster-setup/deployment/roles/create_data_filesystem/tasks/main.yaml b/cluster-setup/deployment/roles/create_data_filesystem/tasks/main.yaml
@@ -3,7 +3,7 @@
 - name: create a single partition on each of the physical volumes
   loop: "{{ data_physical_volumes }}"
   community.general.parted:
-    device: "{{ item }}"
+    device: "/dev/disk/by-id/{{ item }}"
     number: 1
     state: present
     label: gpt
@@ -16,7 +16,7 @@
     - name: append names to the list
       loop: "{{ data_physical_volumes }}"
       set_fact:
-        data_partition_names: "{{ data_partition_names + [item ~ '1'] }}"
+        data_partition_names: "{{ data_partition_names + ['/dev/disk/by-id/' ~ item ~ '-part1'] }}"
 
 - name: create a volume group out of the data partitions
   lvg:
-Original file line number
+Diff line change
@@ Expand Up / @@ -77,3 +77,5 @@ kive_httpd_group: kive @@
     copied_groups:
       - kive
       - sudo
+    default_shell: /usr/bin/bash