You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Restoring or rebuilding from a backup still has a lot of manual steps and some pain points. Here are some problems I ran into during the current rebuild:
Some network drive mounts depend on others, so they should be loaded as present first, then mounted.
The symbolic link from /home to /data/home breaks in check mode, because /data hasn't been mounted. Force it?
PostgreSQL tools have a version conflict. Should we uninstall the old version of PostgreSQL? The workaround is to specify the full path, like /usr/pgsql-12/bin/psql.
The PostgreSQL service needs to be restarted sooner. If the configuration hasn't been reloaded, then the Django migrations can't run. Maybe move the migrations to a handler?
SSL configuration isn't included for apache.
Firewall configuration is missing slurm ports.
Stop purge service timers.
Stop rsnapshot timer.
Purge all container run folders, because we don't back them up.
Starting the munge service fails the first time.
We're not using weights or MemSpecLimit in the slurm node config.
slurm reconfigure fails on the head node, possibly because the slurmctld service has shut down.
barman configuration is incomplete? Try barman check kive as the barman user. ssh from barman to postgres and postgres to barman need to accept SSH host keys the first time. You need to force the WAL to switch over the first time. Run as the barman user: barman switch-wal --force --archive kive.
There are several steps in the ansible script that fail the first time they run, or fail when they are in check mode.
NFS exports from the head node aren't included in the ansible playbooks.
Document the rebuild process, including kickstart, ansible, database, and file system.
Document how to debug the DHCP configuration for compute nodes.
Consider moving database and file system backups to a third server, so they can be accessed directly from the target server during a rebuild.
Some of those items will probably get split into separate issues.
The text was updated successfully, but these errors were encountered:
Restoring or rebuilding from a backup still has a lot of manual steps and some pain points. Here are some problems I ran into during the current rebuild:
/home
to/data/home
breaks in check mode, because/data
hasn't been mounted. Force it?/usr/pgsql-12/bin/psql
.MemSpecLimit
in the slurm node config.slurm reconfigure
fails on the head node, possibly because theslurmctld
service has shut down.barman check kive
as the barman user.ssh
from barman to postgres and postgres to barman need to accept SSH host keys the first time. You need to force the WAL to switch over the first time. Run as the barman user:barman switch-wal --force --archive kive
.Some of those items will probably get split into separate issues.
The text was updated successfully, but these errors were encountered: