Teacher Services Cloud - Disaster recovery

The systems are built with resiliency in mind, but they may fail in different ways and could cause an incident.

This document covers the most critical scenarios and should be used in case of an incident. They should be regularly tested by following the Disaster recovery testing document. Scenarios:

Loss of database server
Loss of data

Scenario 1: Loss of database server

In this scenario, the Azure Postgres flexible server and the database it contains have been completely lost.

There are two main options for recovery.

Recover the deleted server from the Azure backups. These can be used to recover a dropped Azure Database for PostgreSQL flexible server resource within five days from the time of server deletion. Note that Microsoft do not guarantee this will work as there are other factors involved. See https://learn.microsoft.com/en-us/azure/postgresql/flexible-server/how-to-restore-dropped-server
Recreate the Postgres server via terraform, and then restore from the nightly github workflow scheduled database backups. These backups are stored in Azure storage accounts and kept for 7 days.

Option 1 should be attempted first, as it can recover very close to the point of server loss, minimising any potential data loss. Option 2 would be used if the first option fails to work.

The objectives are:

Recover the deleted server from the Azure backups (option 1)
Recreate the lost postgres database server (option 2)
Restore data from nightly backup stored in Azure (option 2)

Start the incident process (if not already in progress)

Follow the incident playbook and contact the relevant stakeholders as described in create-an-incident-slack-channel-and-inform-the-stakeholders-comms-lead.

Freeze pipeline

Alert developers that no one should merge to main.

In github setings, a user with repo admin privileges should update the Branch protection rules and set required PR approvers to 6

Enable maintenance mode

Run the Enable maintenance or Set maintenance mode workflow for the service and environment affected.

The maintenance page message can be updated at any time during the incident.

e.g. https://claim-additional-payments-for-teaching-test-web.test.teacherservices.cloud will now display the maintenance page and

https://claim-additional-payments-for-teaching-temp.test.teacherservices.cloud will display the application.

Note that the available temp route can be seen on the completed maintenance workflow summary view in github.

Recreate the lost postgres database server

Option 1. Recover from Azure backups

Run the restore-deleted-postgres workflow to recreate the missing postgres database.

provide the environment, server name to be restored and restore point in time in UTC. e.g. 2024-07-24T06:00:00. This should be at least 10 minutes after the server was deleted.

Option 2. Recreate via terraform and restore from scheduled offline backup

Check Monitor Diagnostics

Check and delete any postgres diagnostics remaining for the deleted instance in https://portal.azure.com/#view/Microsoft_Azure_Monitoring/AzureMonitoringBrowseBlade/~/diagnosticsLogs as the later deploy to rebuild postgres will fail if it remains. e.g. search using subscription s189-teacher-services-cloud-test and resource group s189t01-ittms-stg-pg and look for enabled Diagnostic settings. To do this select the subscription & resource group from the dropdown filter and select Azure database for PostgresSQL flexible server from the resource type. If the postgres flexible server has Diagnostic Status of Enabled then select the server, choose edit setting, and delete. Note the diagnostic setting name will end with -diagnostics. If you don't see the s189 subscription check you don't have it excluded in your default subscription filter at https://portal.azure.com/#settings/directory
Azure Monitor Diagnostics can be viewed using the CLI command. The CLI command requires a resource id which can be obtained via the portal.
- Goto to the portal and navigate to the database Overview
- Click on JSON View.
- Copy the Resource Id shown on the top.
- az monitor diagnostic-settings list --resource <resource-id>

Run Workflow

Run the deploy workflow to recreate the missing postgres database as detailed below.

As the maintenance page has been enabled, you will need to:

create a branch from main
update the terraform application config as per: configure-terraform-to-keep-deploying-the-application
push the branch to github (no need to create a PR)
run the deploy workflow using your branch

Note: The deploy workflow may fail on steps after the postgres server creation e.g. smoke tests or database migrations. This is expected due to the enabling of maintenance page. You can confirm the server is available via a healthcheck url that checks the database status (if your service has one), or via the azure portal. The healthcheck url will need to use the temp route.

Restore the data from previous backup in Azure storage

This step isn't required if using the restore-deleted-postgres workflow i.e. option 1 in the previous step. Run the Restore database from Azure storage workflow.

Validate app

Confirm the app is working and can see the restored data. The app is available on the temporary ingress URL.

e.g. https://claim-additional-payments-for-teaching-temp.test.teacherservices.cloud will display the application.

You may also want to check any healthcheck urls (e.g. /healthcheck), admin interfaces, api requests, etc

Disable maintenance mode

Run the Disable maintenance or Set maintenance mode workflow for the service and environment affected.

Unfreeze pipeline

Alert developers that merge to main is allowed.

In github settings, update the Branch protection rules and set required PR approvers back to 1

Scenario 2: Loss of data

In the case of data loss or corruption, we need to recover the data as soon as possible in order to resume normal service.

The application database is an Azure flexible postgres server. This server has a point-in-time restore (PTR) ability with the resolution of 1 second, available between 5min and 7days. PTR allows you to restore the live server to a point-in-time on a new copy of the server. It does not update the live server itself in any way. Once the new server is available it can be accessed using konduit.sh to check previous data, and data can then be recovered to the original server.

The objectives are:

Create a separate new postgres database server
Restore data from the current live database to the new postgres database server from a particular point in time
Update data into the live database from the new PTR server

Stop the service as soon as possible

If the service is available, even in a degraded mode, there is a risk users may make edits and corrupt the data even more. Or they might access data they should not have access to. To prevent this, stop the web app and/or workers as soon as possible. This can be completed using the kubectl scale command

e.g. [update namespace and deployment names as required]

kubectl -n bat-staging get deployments
kubectl -n bat-staging scale deployment itt-mentor-services-staging --replicas 0
kubectl -n bat-staging scale deployment itt-mentor-services-staging-worker --replicas 0

Note: You can enable maintenance mode first, however it is still recommended to scale down the web and worker apps to prevent any side effects from occuring.

Start the incident process (if not already in progress)

Follow the incident playbook and contact the relevant stakeholders as described in create-an-incident-slack-channel-and-inform-the-stakeholders-comms-lead.

Freeze pipeline

Alert developers that no one should merge to main.

In github setings, a user with repo admin privileges should update the Branch protection rules and set required PR approvers to 6

Enable maintenance mode

Run the Enable maintenance or Set maintenance mode workflow for the service and environment affected.

The maintenance page message can be updated at any time during the incident

Note that the available temp route can be seen on the completed maintenance workflow summary view in github.

Consider backing up the database

If users have entered data or new users have signed up, we may need to keep this data for reconciliation later on. Use the Backup database to Azure storage workflow to save a copy of the flawed database. Use a specific name to identify the backup file later on.

Restore postgres database

Run the Restore database from point in time to new database server workflow using a time before the data was deleted. If you need to rerun the workflow, it may fail if the new server was already created. Override the new server name to work around the issue.

Important: You should convert the time to UTC before actually using it. When you record the time, note what timezone you are using. Especially during BST (British Summer Time).

Upload restored database to Azure storage

Use the Backup database to Azure storage workflow and choose the restored server as input. Use a specific name to identify the backup file later on.

Validate data

It may be necessary to connect to the PTR postgres server for troubleshooting, before deciding on a full restore or otherwise. For instance, the PTR restore may have to be rerun with a different date/time.

To connect to the PTR postgres copy using psql via konduit:

Install konduit.sh locally using the make command
Run: bin/konduit.sh -x -n <namespace-of-deployment> -s <name-of-ptr-server> <name-of-deployment> -- psql

e.g. bin/konduit.sh -x -n tra-test -s s189t01-ittms-stg-pg-ptr itt-mentor-services-staging -- psql

To connect to the existing live postgres server for comparison:

Run: bin/konduit.sh -x name-of-deployment -- psql

e.g. bin/konduit.sh -x itt-mentor-services-staging -- psql

Restore data into the live server

To perform a complete restore of the live server from the PTR copy, use the Restore database from Azure storage workflow and choose the backup file created above to restore to the live postgres server.

Restart applications

e.g. [update namespace, deployment names and replicas as required]

kubectl -n bat-staging get deployments
kubectl -n bat-staging scale deployment itt-mentor-services-staging --replicas 2
kubectl -n bat-staging scale deployment itt-mentor-services-staging-worker --replicas 1

Validate app

Confirm the app is working and can see the restored data. If the maintenance page is enabled, the app is available on the temporary ingress URL.

You may also want to check any healthcheck urls (e.g. /healthcheck), admin interfaces, api requests, etc

Disable maintenance mode

Run the Disable maintenance or Set maintenance mode workflow for the service and environment affected

Unfreeze pipeline

Alert developers that merge to main is allowed.

In github settings, update the Branch protection rules and set required PR approvers back to 1

Tidy up

If a PTR was run, the database copy server should be deleted.

If this document is being followed as part of a DR test, then complete DR test post scenario steps

Post DR review

Schedule an incident retro meeting with all the stakeholders
Review the incident and fill in the incident report
Raise trello cards for any process improvements