Teacher Services Cloud - Disaster Recovery testing
This document covers the Disaster Recovery testing procedure for applications hosted on the Teacher Services AKS clusters based on scenarios detailed in the Disaster recovery document.
Prerequisites
- Identified environment for the test e.g. qa, staging, test, etc
- Identified scenario(s) that are to be tested
- Repository workflows that should utilise existing DFE github-actions
- Deploy selected env
- Backup postgres database to Azure storage [required for scenario 1 above]
- Recover deleted postgres database server [required for scenario 1 above]
- Restore database from Azure storage [required for scenario 1 above]
- Restore database from point in time to new database server [required for scenario 2 above]
- Repo workflows to enable and disable the maintenance page.
- see https://github.com/DFE-Digital/teacher-services-cloud/blob/main/documentation/maintenance-page.md
- confirm workflows exists for the selected environment to be tested. Examples:
- an app url that identifies the current docker image sha. Can be part of the healthcheck e.g. https://github.com/sdglhm/okcomputer/blob/master/lib/ok_computer/built_in_checks/app_version_check.rb
- Identify the technical and non technical stakeholders who will participate in the test, based on the Teacher services list
Documentation requirements
Copy the template DR testing document which will be a record of the scenarios run, time taken, and any issues.
Initial set-up
Participants must have access to Github and the repositories.
Schedule virtual meeting for the test to take place
- teams or slack
- invite the relevant stakeholders
Regularly provide updates on the service Slack channel to keep product owners abreast of developments.
Scenario 1: Loss of database instance
See DR scenario 1.
Delete the postgres database instance
Note that you must have a previously created backup on azure storage before starting this step. If not, create one now before continuing.
- Delete the existing postgres database
- manually delete via UI https://portal.azure.com/#browse/Microsoft.DBforPostgreSQL%2FflexibleServers
- Confirm it's deleted
Follow the disaster recovery instructions.
Scenario 2: Loss of data
See DR scenario 2.
Delete data from the postgres database instance
Make a note of the time this step is being started as the restore point must be before you delete any data.
- Delete a table manually
- connect via konduit and delete the table
- it must be possible to confirm the data has been deleted either within the app, by errors messages being logged, the app crashing or users observing inconsistent content.
Follow the disaster recovery instructions.
Post scenario steps
Documentation requirements
- Complete the DR testing document and save in the DR test Reports folder
- Update the service on the infra team sharepoint service list with the DR date and status (success/fail)
Post DR test review
- Review the just completed DR test, and raise trello cards for any process improvements.
- Review the contact list in the Teacher services list