Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a backup restore cron on staging #812

Merged
merged 39 commits into from
Jan 5, 2024

Conversation

euanmillar
Copy link
Collaborator

No description provided.

uses: mathieudutour/[email protected]
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
tag_prefix: ${{ github.event.repository.name }}-
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially country config repos are tagged like this: opencrvs-farajaland-v1.3.2

@rikukissa rikukissa force-pushed the infra-improvements branch 4 times, most recently from cb4524e to 1c51bbb Compare December 18, 2023 12:51
when I cleared the database and then restored data there, the restore process failed if the running OpenHIM process had written new documents during this period
@rikukissa rikukissa merged commit 7fe32bb into infra-improvements Jan 5, 2024
4 checks passed
rikukissa added a commit that referenced this pull request Jan 24, 2024
…ates (#789)

* fix conflicts

* add amends from cdpi-living-lab repository

* setup pem file

* fix merge conflict

* add libsodium to dev dependencies

* configure provisioning and deployment script so that any user with privilege escalation access can provision the host machine

* compress and encrypt backup directories before sending to backup server

* supply backup password to backup cronjob

* supply backup encryption passphrase from github secrets

* hide openhim-console by default

* hide openhim-api by default

* Modularise playbook tasks, use only one playbook for all deployment sizes (#798)

* split playbooks to different task modules, use only one playbook for all deployment sizes

* update provisioning pipeline

* try initialising the provision pipeline by adding a temporary push trigger

* setup ssh key before trying to provision

* add known hosts file

* do not try to mount cryptfs partition to /data if it's already mounted

* add filebeat so logs can be accessed, monitored by kibana

* fix kibana address

* Setup new alerts: SSH login, error in backup logs, available disk space in data partition

* add ansible task for creating user accounts for maintainers with 2FA login enabled

* add new alerts for log alerts and ssh alerts

* pass initial metabase sequal file to metabase as a config file so deployment doesn't have to touch the /data directory

* temporarily allow root login again until we set up deployment users

* add port to port forwarding container names so multiple ports can be opened from one container

* Changes to environment provisioning script and log file handling

* remove vagrant files

* remove references to sudo password

sudo operation should only be performed by humans as it gives permission to do root-level operations. automated users should have required permissions set by provisioning playbooks

* remove VPN mentions for now

* remove elastalert slack alert environment variable as it's not referred anywhere

* remove extra environment variables from deploy script call

* remove proxy config from backup script

* generate BACKUP_ENCRYPTION_PASSPHRASE for all github environments

* make log files be accessible by application group so SSH_USER can read and write to them

* remove node version matrices from new pipelines

* add separate inventory files for all environments

* make docker manager1 reference dynamic

* Combine country config compose files to base deployment compose files, include replica compose files in environmet-specific compose files (#808)

* Production VPN (#809)

* add initial wireguard server setup

* move vpn to QA server

* remove unused HOSTNAME parameter

* fix a bug in environment creator script, make sure secrets are never committed

* add development environment to provisioning scripts

* add development machine to inventory

* remove unnecessary PEM setup step

* always use the same ansible variables

* fix ansible variable reference

* remove global ansible user setting

* add back missing dockerhub username

* disable SSH login with root login if provisioning is not done as root

* convert inventory files to yml so ssh keys and users can be directly defined in them

* add Tahmid's public key

* fix inventory file reference

* add development to machines that can be deployed to

* fix known hosts mechanism in deployment pipelines

* make environment seletion in deploy.sh dynamic

* volume mount metabase init file as docker has a file size limit of 500kb for config files

* copy the whole project directory to the server

* send core compose files to the server

* fix common file paths

* fix environment compose file

* use absolute paths in the compose file

* add debug log

* remove deploy log file temporarily

* remove matrices from deployment pipelines

* add debug log

* debug github action

* fix deploy pipeline syntax

* add variables to debug step

* make debugging an option

* fix pipeline syntax

* just a commit to make pipeline update on github

* more syntax fixes

* more syntax fixes

* more syntax fixes

* only define overlay net in the main deploy docker compose so that it keeps attachable

* remove files from target server infrastructure directory if those files do not exist on in repo anymore

* fix deploy path

* do a docker login as part of deployment

* only volume link minio admin's config to the container so it wont write anything new to the source code directory

* remove container names as docker swarm do not support those

* fix path for elasticsearch config

* change the clear data script so that it doesn't touch /data directory directly. This helps us restrict deployment user's access to data

* add missing env variables

* do not use interactive shell

* stop debug mode from starting if its not explicitly enables

* add development to seed pipeline

* add pipeline for clearing an environment

* rename pipeline

* temporarily adda a push trigger to clear environment

* Revert "temporarily adda a push trigger to clear environment"

This reverts commit 882c432.

* fix reset script file reference, reuse clear-environment pipeline in deploy pipeline

* run clearing through ssh

* add missing ssh secrets

* fix pipeline reference in deploy script

* make clear-environment reusable

* debug why no reset

* add migration run to clear-environment pipeline

* remove data clearing from deploy script

* try without conditionals

* try with a true string

* use singlequotes

* update staging server fingerprint

* add output for reset step

* fix synta

* change staging IP

* fix pexpect reference

* remove pyexpect completely

* remove python3-docker module as we do not have any ansible docker commands

* try again with the module as its needed for logging in to docker

* run provisioning tasks through qa

* add jump host

* update known hosts once more

* add more logging

* update qa fingerprint

* lower timeout limits

* restart ssh as root

* change ssh restart method for ubuntu 23

* make a 1-1 mapping to github environments and deployed environments. Demo should have its own Github environment and not use production

* add back docker login

* make it possible to pass SSH args to deploy script

* fix

* make it possible to supply additional ssh parameters for clear script

* updates to create environment script

* configure jump host for production

* update production ssh fingerprint

* make production a 2-server deployment

* add missing jump host definition for docker-workers

* ignore VPN and other allowed addresses in fail2ban

* update staging and prod docker composed

* fix jinja template

* configure rsync to not change file permissions

* add debug

* remove -a from rsync so it doesnt try to change permissions

* add wireguard data partition, ensure files in deployment directory are owned by application group

* make setting ownership recursive

* set read parmissions to others in /opt/opencrvs so docker users can read the files

* increase fail2ban limits

* attach traefik to vpn network

* make ssh user configurable for port-forwarding script

* update wg-easy

* update wg-east

* fix cert resolver for vpn

* use github container registry and latest version for wg-easy

* pass wireguard password variable through deployment pipeline

* pass all github deployment environment variables to docker swarm deployment

* move environments variables to right function

* make a separate function that reads and supplies the env variables

* remove KNOWN_HOSTS from env variables

* remove more variables, fix escape

* make sure KNOWN_HOSTS wont leak to deploy step

* remove debug logging

* only set traefik to vpn network on QA where Wireguard server is

* add validation to make sure all environment variables are set

* download core compose iles before validating environment variables

* fix curl urls when downloading core compose files

* remove default latest value from country config version

* fix country config version variable not going to docker compose files

* fix compose env file order

* fix environment variable filtering

* add pipeline for resetting user's 2FA

* fix name of the pipeline

* trick github into showing the new pipeline

* fetch repo first

* use jump host

* add debug step

* remove unnecessary matrix definition

* remove debugging code

* use docker config instead of volume mounts where possible

* add read and execute rights for others to the deployment directory as sometimes users inside docker containers do not match the host machine users

* create a jump user for QA, allow definining multiple ssh keys for users

* do not add 2factor for jump users

* use new jump user in inventory files as well

* set infobip environment variables as optional, add missing required environment variables to environment creator script

* add support for 1-infinite replicas

* add missing network

* add missing export to VERSION variable

* remove demo deployment configuration for now

* Create a backup restore cron on staging (#812)

* Create a backup restore cron on staging

* allowed label to be passed to script for snapshot usage

* Updated release action

* Add approval step to production deploys

* Add Riku's username to prod deploys

* add separate config flag for provisioning for indicating if the server should backup its data somewhere else of if it should periodically restore data

* make configuration so that qa can allow connections through the provision user to other machines

* create playbook for backup servers and the connection between app servers and backups

* add tags

* add tag to workflow

* add task to ensure ssh dir exists for backup user

* create home directory for backup

* ensure backup task is always applied for root's crontab

* add default value for periodic_restore_from_backup

* make it possible to deploy production with current infrastructure

* Revert "make it possible to deploy production with current infrastructure"

This reverts commit 36edf30.

* fix wait hosts definition for migrations

* make production a qa environment temporarily

* add shell for backup user so rsync works

* explicitly define which user is the one running crontab, ensure that user's key gets to backup server

* ensure .ssh directory exists for crontab user

* get user home directories dynamically

* add missing tags

* add become

* fix file path

* define backup machine in staging config as well

* remove condition from fetch

* always create public key from private key

* use hadcoded file name for public key

* fix syntax

* make staging a QA environment so it reflects production

* separate backup downloading and restoring to two different scripts, use production server's encryptin key on the machine that restres the backup (staging)

* fix an issue with a running OpenHIM while we restore backup

when I cleared the database and then restored data there, the restore process failed if the running OpenHIM process had written new documents during this period

* restart minio after restoring data

---------

Co-authored-by: Riku Rouvila <[email protected]>

* fix snapshot script restore reference

* remove openhim base config

* remove WIREGUARD_ADMIN_PASSWORD reference from production deployment pipelines

* remove authorized_keys file

* add debug logging for clear all data script

* define REPLICAS variable before validating it

* fix syntax error in clear script

* automate updating branches on release

* switch back to previous traefik port definition

https://github.com/opencrvs/opencrvs-farajaland/pull/789/files/7a034732d3f38cfdb00d919f470bb7e48d587cdd#r1449976486

* rename 2factor to two_factor

* add default true value for two_factor

* [OCRVS-6437] Forward Elastalert emails through country config (#851)

* forward Elastalert emails first to country config's new /email endpoint and forward from there

* add NOTIFICATION_TRANSPORT variable to deployments scripts

* fix deployment

* move dotenv to normal deps

* add back removed environment variable

* fix email route definition

* make default route ignore the /email path

* add missing environment variables for dev environment

* [OCRVS-6350] Disable root (#849)

* disable root login completely

* stop users from using 'su'

* only disable root login if ansible user being used is not root

* add history timestamps for user terminal history (#848)

* add playbook for ubuntu to update security patches automatically (#846)

* fix staging + prod key access to backup server

* update prod & staging jump keys

* fix manager hostname reference

* add a mechanism for defining additional SSH public keys that can login to the provisioning user

---------

Co-authored-by: naftis <[email protected]>
Co-authored-by: Riku Rouvila <[email protected]>
rikukissa added a commit that referenced this pull request Jan 24, 2024
…ates (#789)

* fix conflicts

* add amends from cdpi-living-lab repository

* setup pem file

* fix merge conflict

* add libsodium to dev dependencies

* configure provisioning and deployment script so that any user with privilege escalation access can provision the host machine

* compress and encrypt backup directories before sending to backup server

* supply backup password to backup cronjob

* supply backup encryption passphrase from github secrets

* hide openhim-console by default

* hide openhim-api by default

* Modularise playbook tasks, use only one playbook for all deployment sizes (#798)

* split playbooks to different task modules, use only one playbook for all deployment sizes

* update provisioning pipeline

* try initialising the provision pipeline by adding a temporary push trigger

* setup ssh key before trying to provision

* add known hosts file

* do not try to mount cryptfs partition to /data if it's already mounted

* add filebeat so logs can be accessed, monitored by kibana

* fix kibana address

* Setup new alerts: SSH login, error in backup logs, available disk space in data partition

* add ansible task for creating user accounts for maintainers with 2FA login enabled

* add new alerts for log alerts and ssh alerts

* pass initial metabase sequal file to metabase as a config file so deployment doesn't have to touch the /data directory

* temporarily allow root login again until we set up deployment users

* add port to port forwarding container names so multiple ports can be opened from one container

* Changes to environment provisioning script and log file handling

* remove vagrant files

* remove references to sudo password

sudo operation should only be performed by humans as it gives permission to do root-level operations. automated users should have required permissions set by provisioning playbooks

* remove VPN mentions for now

* remove elastalert slack alert environment variable as it's not referred anywhere

* remove extra environment variables from deploy script call

* remove proxy config from backup script

* generate BACKUP_ENCRYPTION_PASSPHRASE for all github environments

* make log files be accessible by application group so SSH_USER can read and write to them

* remove node version matrices from new pipelines

* add separate inventory files for all environments

* make docker manager1 reference dynamic

* Combine country config compose files to base deployment compose files, include replica compose files in environmet-specific compose files (#808)

* Production VPN (#809)

* add initial wireguard server setup

* move vpn to QA server

* remove unused HOSTNAME parameter

* fix a bug in environment creator script, make sure secrets are never committed

* add development environment to provisioning scripts

* add development machine to inventory

* remove unnecessary PEM setup step

* always use the same ansible variables

* fix ansible variable reference

* remove global ansible user setting

* add back missing dockerhub username

* disable SSH login with root login if provisioning is not done as root

* convert inventory files to yml so ssh keys and users can be directly defined in them

* add Tahmid's public key

* fix inventory file reference

* add development to machines that can be deployed to

* fix known hosts mechanism in deployment pipelines

* make environment seletion in deploy.sh dynamic

* volume mount metabase init file as docker has a file size limit of 500kb for config files

* copy the whole project directory to the server

* send core compose files to the server

* fix common file paths

* fix environment compose file

* use absolute paths in the compose file

* add debug log

* remove deploy log file temporarily

* remove matrices from deployment pipelines

* add debug log

* debug github action

* fix deploy pipeline syntax

* add variables to debug step

* make debugging an option

* fix pipeline syntax

* just a commit to make pipeline update on github

* more syntax fixes

* more syntax fixes

* more syntax fixes

* only define overlay net in the main deploy docker compose so that it keeps attachable

* remove files from target server infrastructure directory if those files do not exist on in repo anymore

* fix deploy path

* do a docker login as part of deployment

* only volume link minio admin's config to the container so it wont write anything new to the source code directory

* remove container names as docker swarm do not support those

* fix path for elasticsearch config

* change the clear data script so that it doesn't touch /data directory directly. This helps us restrict deployment user's access to data

* add missing env variables

* do not use interactive shell

* stop debug mode from starting if its not explicitly enables

* add development to seed pipeline

* add pipeline for clearing an environment

* rename pipeline

* temporarily adda a push trigger to clear environment

* Revert "temporarily adda a push trigger to clear environment"

This reverts commit 882c432.

* fix reset script file reference, reuse clear-environment pipeline in deploy pipeline

* run clearing through ssh

* add missing ssh secrets

* fix pipeline reference in deploy script

* make clear-environment reusable

* debug why no reset

* add migration run to clear-environment pipeline

* remove data clearing from deploy script

* try without conditionals

* try with a true string

* use singlequotes

* update staging server fingerprint

* add output for reset step

* fix synta

* change staging IP

* fix pexpect reference

* remove pyexpect completely

* remove python3-docker module as we do not have any ansible docker commands

* try again with the module as its needed for logging in to docker

* run provisioning tasks through qa

* add jump host

* update known hosts once more

* add more logging

* update qa fingerprint

* lower timeout limits

* restart ssh as root

* change ssh restart method for ubuntu 23

* make a 1-1 mapping to github environments and deployed environments. Demo should have its own Github environment and not use production

* add back docker login

* make it possible to pass SSH args to deploy script

* fix

* make it possible to supply additional ssh parameters for clear script

* updates to create environment script

* configure jump host for production

* update production ssh fingerprint

* make production a 2-server deployment

* add missing jump host definition for docker-workers

* ignore VPN and other allowed addresses in fail2ban

* update staging and prod docker composed

* fix jinja template

* configure rsync to not change file permissions

* add debug

* remove -a from rsync so it doesnt try to change permissions

* add wireguard data partition, ensure files in deployment directory are owned by application group

* make setting ownership recursive

* set read parmissions to others in /opt/opencrvs so docker users can read the files

* increase fail2ban limits

* attach traefik to vpn network

* make ssh user configurable for port-forwarding script

* update wg-easy

* update wg-east

* fix cert resolver for vpn

* use github container registry and latest version for wg-easy

* pass wireguard password variable through deployment pipeline

* pass all github deployment environment variables to docker swarm deployment

* move environments variables to right function

* make a separate function that reads and supplies the env variables

* remove KNOWN_HOSTS from env variables

* remove more variables, fix escape

* make sure KNOWN_HOSTS wont leak to deploy step

* remove debug logging

* only set traefik to vpn network on QA where Wireguard server is

* add validation to make sure all environment variables are set

* download core compose iles before validating environment variables

* fix curl urls when downloading core compose files

* remove default latest value from country config version

* fix country config version variable not going to docker compose files

* fix compose env file order

* fix environment variable filtering

* add pipeline for resetting user's 2FA

* fix name of the pipeline

* trick github into showing the new pipeline

* fetch repo first

* use jump host

* add debug step

* remove unnecessary matrix definition

* remove debugging code

* use docker config instead of volume mounts where possible

* add read and execute rights for others to the deployment directory as sometimes users inside docker containers do not match the host machine users

* create a jump user for QA, allow definining multiple ssh keys for users

* do not add 2factor for jump users

* use new jump user in inventory files as well

* set infobip environment variables as optional, add missing required environment variables to environment creator script

* add support for 1-infinite replicas

* add missing network

* add missing export to VERSION variable

* remove demo deployment configuration for now

* Create a backup restore cron on staging (#812)

* Create a backup restore cron on staging

* allowed label to be passed to script for snapshot usage

* Updated release action

* Add approval step to production deploys

* Add Riku's username to prod deploys

* add separate config flag for provisioning for indicating if the server should backup its data somewhere else of if it should periodically restore data

* make configuration so that qa can allow connections through the provision user to other machines

* create playbook for backup servers and the connection between app servers and backups

* add tags

* add tag to workflow

* add task to ensure ssh dir exists for backup user

* create home directory for backup

* ensure backup task is always applied for root's crontab

* add default value for periodic_restore_from_backup

* make it possible to deploy production with current infrastructure

* Revert "make it possible to deploy production with current infrastructure"

This reverts commit 36edf30.

* fix wait hosts definition for migrations

* make production a qa environment temporarily

* add shell for backup user so rsync works

* explicitly define which user is the one running crontab, ensure that user's key gets to backup server

* ensure .ssh directory exists for crontab user

* get user home directories dynamically

* add missing tags

* add become

* fix file path

* define backup machine in staging config as well

* remove condition from fetch

* always create public key from private key

* use hadcoded file name for public key

* fix syntax

* make staging a QA environment so it reflects production

* separate backup downloading and restoring to two different scripts, use production server's encryptin key on the machine that restres the backup (staging)

* fix an issue with a running OpenHIM while we restore backup

when I cleared the database and then restored data there, the restore process failed if the running OpenHIM process had written new documents during this period

* restart minio after restoring data

---------

Co-authored-by: Riku Rouvila <[email protected]>

* fix snapshot script restore reference

* remove openhim base config

* remove WIREGUARD_ADMIN_PASSWORD reference from production deployment pipelines

* remove authorized_keys file

* add debug logging for clear all data script

* define REPLICAS variable before validating it

* fix syntax error in clear script

* automate updating branches on release

* switch back to previous traefik port definition

https://github.com/opencrvs/opencrvs-farajaland/pull/789/files/7a034732d3f38cfdb00d919f470bb7e48d587cdd#r1449976486

* rename 2factor to two_factor

* add default true value for two_factor

* [OCRVS-6437] Forward Elastalert emails through country config (#851)

* forward Elastalert emails first to country config's new /email endpoint and forward from there

* add NOTIFICATION_TRANSPORT variable to deployments scripts

* fix deployment

* move dotenv to normal deps

* add back removed environment variable

* fix email route definition

* make default route ignore the /email path

* add missing environment variables for dev environment

* [OCRVS-6350] Disable root (#849)

* disable root login completely

* stop users from using 'su'

* only disable root login if ansible user being used is not root

* add history timestamps for user terminal history (#848)

* add playbook for ubuntu to update security patches automatically (#846)

* fix staging + prod key access to backup server

* update prod & staging jump keys

* fix manager hostname reference

* add a mechanism for defining additional SSH public keys that can login to the provisioning user

---------

Co-authored-by: naftis <[email protected]>
Co-authored-by: Riku Rouvila <[email protected]>
@rikukissa rikukissa deleted the restore-backup-cron branch May 7, 2024 12:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants