Teaching record system - TRS Disaster Recovery testing
This is an accompaniment to the Teacher Services Disaster Recovery testing documentation and Teacher Services Disaster Recovery documentation.
See also the Data recovery testing document.
Prerequisites
We use the pentest environment for doing disaster recovery testing. Unlike production, the pentest environment does not have scheduled backups so we will need to manually create a backup prior to testing.
- Create a backup of the postgres server using the Backup database to Azure storage action:
- Environment:
pentest - Backup file name: (leave blank)
- Database server name:
s189t01-trs-pt-pg
- Environment:
- Once complete, view the Backup database summary and copy the backup filename
Scenario 1: Loss of database server
Delete the postgres database instance
- Log onto Azure Portal and delete postgres server
s189t01-trs-pt-pgon environmentpentest
Start the incident process (if not already in progress)
- Skip this step for DR testing
Freeze pipeline
- Skip this step for DR testing - there are no active pipelines merging into the
pentestenvironment
Enable maintenance mode
- Skip this step - TRS does not have a maintenance mode
Recreate the lost postgres database server
Option 1. Recover from Azure backups
-
Run the Recover deleted postgres database workflow:
- Enviroment to restore:
pentest - Restore to production:
false - Restore point in time: This should be a point in time after the backup in Prerequisites section was created but before the database was deleted
- Deleted postgres server:
s189t01-trs-pt-pg
Note: a point-in-time restore can only be run 10 minutes after the database has been deleted (something to do with Azure) - but the point in time should be for a time before the database was deleted
Note also: the GitHub action will complete but this will only trigger a request to Azure to restore the server, which will take an unspecified amount of time.
- Enviroment to restore:
Option 2. Recreate via terraform and restore from scheduled offline backup
-
Check there aren't any diagnostic settings on Azure Portal
- Subscription:
s189-teacher-services-cloud-test - Resource group:
s189t01-trs-pt-rg - Database:
s189t01-trs-pt-pg-ptr
- Subscription:
-
Ignore the terraform changes as these are just to allow deployment to continue while in maintenenace mode - TRS does not have a maintenance mode
-
Run the Build and deploy workflow against
mainbranch -
If the following error occurs:
Error: A resource with the ID "/subscriptions/***/resourceGroups/s189t01-trs-pt-rg/providers/Microsoft.DBforPostgreSQL/flexibleServers/s189t01-trs-pt-pg/databases/trs_pentest" already exists - to be managed via Terraform this resource needs to be imported into the State. Please see the resource documentation for "azurerm_postgresql_flexible_server_database" for more information.You will have to go into Azure portal and delete the
trs_pentestdatabase and retry step 3
Restore the data from previous backup in Azure storage
- Ignore this if you used Option 1 in the previous section.
- Run the Restore database from Azure storage workflow:
- Enviroment to restore:
pentest - Restore to production:
false - Name of the backup file: Backup file created in Prerequisites section
- Enviroment to restore:
Validate app
- See Data recovery testing document for steps
Disable maintenance mode
- Skip this step - TRS does not have a maintenance mode
Unfreeze pipeline
- Skip this step for DR testing - there are no active pipelines merging into the
pentestenvironment
Scenario 2: Loss of data
Stop the service as soon as possible
-
Connect to the
pentestenvironment as in Connecting to environments:- Subscription:
s189-teacher-services-cloud-test - Resource group:
s189t01-tsc-pt-rg - Cluster:
s189t01-tsc-platform-test-aks - Namespace:
development
e.g.:
az account set --subscription s189-teacher-services-cloud-test az aks get-credentials --overwrite-existing -g s189t01-tsc-pt-rg --name s189t01-tsc-platform-test-aks kubectl get pods -n development --insecure-skip-tls-verify kubectl -n development get deployments kubectl -n development scale deployment trs-pentest-ui-xxxxxxxxxx-xxxxx --replicas 0 kubectl -n development scale deployment trs-pentest-worker-xxxxxxxxxx-xxxxx --replicas 0 - Subscription:
Start the incident process (if not already in progress)
- Skip this step for DR testing
Freeze pipeline
- Skip this step for DR testing - there are no active pipelines merging into the
pentestenvironment
Enable maintenance mode
- Skip this step - TRS does not have a maintenance mode
Consider backing up the database
- Skip this step for DR testing
Restore postgres database
- Run the Restore database from point in time to new database server workflow
- Enviroment to restore:
pentest - Restore to production:
false - Restore point in time: This should be a point in time after the backup in Prerequisites section was created
- Name of the new postgres server: (any name as long as it's different to existing server names)
- Enviroment to restore:
Upload restored database to Azure storage
- Run the Backup database to Azure storage workflow
- Enviroment to backup:
pentest - Backup file name: (leave blank)
- Database server name: Name of the server created in previous step
- Enviroment to backup:
- Once complete, view the Backup database summary and copy the backup filename
Validate data
- Skip this step for DR testing -
makedoes not work on Windows machines
Restore data into the live server
Restart applications
-
Connect to the
pentestenvironment as in Connecting to environments:- Subscription:
s189-teacher-services-cloud-test - Resource group:
s189t01-tsc-pt-rg - Cluster:
s189t01-tsc-platform-test-aks - Namespace:
development
e.g.:
az account set --subscription s189-teacher-services-cloud-test az aks get-credentials --overwrite-existing -g s189t01-tsc-pt-rg --name s189t01-tsc-platform-test-aks kubectl get pods -n development --insecure-skip-tls-verify kubectl -n development get deployments kubectl -n development scale deployment trs-pentest-ui-xxxxxxxxxx-xxxxx --replicas 1 kubectl -n development scale deployment trs-pentest-worker-xxxxxxxxxx-xxxxx --replicas 1 - Subscription:
Validate app
- See Data recovery testing document for steps
Disable maintenance mode
- Skip this step - TRS does not have a maintenance mode
Unfreeze pipeline
- Skip this step for DR testing - there are no active pipelines merging into the
pentestenvironment
Tidy up
- In Azure Portal, delete the server created in "Restore postgres database" step above