Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore / rebuild server #1138

Open
1 of 18 tasks
donkirkby opened this issue Feb 18, 2021 · 0 comments
Open
1 of 18 tasks

Restore / rebuild server #1138

donkirkby opened this issue Feb 18, 2021 · 0 comments

Comments

@donkirkby
Copy link
Member

donkirkby commented Feb 18, 2021

Restoring or rebuilding from a backup still has a lot of manual steps and some pain points. Here are some problems I ran into during the current rebuild:

  • Some network drive mounts depend on others, so they should be loaded as present first, then mounted.
  • The symbolic link from /home to /data/home breaks in check mode, because /data hasn't been mounted. Force it?
  • PostgreSQL tools have a version conflict. Should we uninstall the old version of PostgreSQL? The workaround is to specify the full path, like /usr/pgsql-12/bin/psql.
  • The PostgreSQL service needs to be restarted sooner. If the configuration hasn't been reloaded, then the Django migrations can't run. Maybe move the migrations to a handler?
  • SSL configuration isn't included for apache.
  • Firewall configuration is missing slurm ports.
  • Stop purge service timers.
  • Stop rsnapshot timer.
  • Purge all container run folders, because we don't back them up.
  • Starting the munge service fails the first time.
  • We're not using weights or MemSpecLimit in the slurm node config.
  • slurm reconfigure fails on the head node, possibly because the slurmctld service has shut down.
  • barman configuration is incomplete? Try barman check kive as the barman user. ssh from barman to postgres and postgres to barman need to accept SSH host keys the first time. You need to force the WAL to switch over the first time. Run as the barman user: barman switch-wal --force --archive kive.
  • There are several steps in the ansible script that fail the first time they run, or fail when they are in check mode.
  • NFS exports from the head node aren't included in the ansible playbooks.
  • Document the rebuild process, including kickstart, ansible, database, and file system.
  • Document how to debug the DHCP configuration for compute nodes.
  • Consider moving database and file system backups to a third server, so they can be accessed directly from the target server during a rebuild.

Some of those items will probably get split into separate issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant