Skip to main content

Teaching record system - TRS Disaster Recovery testing

This is an accompaniment to the Teacher Services Disaster Recovery testing documentation and Teacher Services Disaster Recovery documentation.

See also the Data recovery testing document.

Prerequisites

We use the pentest environment for doing disaster recovery testing. Unlike production, the pentest environment does not have scheduled backups so we will need to manually create a backup prior to testing.

  • Create a backup of the postgres server using the Backup database to Azure storage action:
    • Environment: pentest
    • Backup file name: (leave blank)
    • Database server name: s189t01-trs-pt-pg
  • Once complete, view the Backup database summary and copy the backup filename

Scenario 1: Loss of database server

Delete the postgres database instance

  • Log onto Azure Portal and delete postgres server s189t01-trs-pt-pg on environment pentest

Start the incident process (if not already in progress)

  • Skip this step for DR testing

Freeze pipeline

  • Skip this step for DR testing - there are no active pipelines merging into the pentest environment

Enable maintenance mode

  • Skip this step - TRS does not have a maintenance mode

Recreate the lost postgres database server

Option 1. Recover from Azure backups

  • Run the Recover deleted postgres database workflow:

    • Enviroment to restore: pentest
    • Restore to production: false
    • Restore point in time: This should be a point in time after the backup in Prerequisites section was created but before the database was deleted
    • Deleted postgres server: s189t01-trs-pt-pg

    Note: a point-in-time restore can only be run 10 minutes after the database has been deleted (something to do with Azure) - but the point in time should be for a time before the database was deleted

    Note also: the GitHub action will complete but this will only trigger a request to Azure to restore the server, which will take an unspecified amount of time.

Option 2. Recreate via terraform and restore from scheduled offline backup

  1. Check there aren't any diagnostic settings on Azure Portal

    • Subscription: s189-teacher-services-cloud-test
    • Resource group: s189t01-trs-pt-rg
    • Database: s189t01-trs-pt-pg-ptr
  2. Ignore the terraform changes as these are just to allow deployment to continue while in maintenenace mode - TRS does not have a maintenance mode

  3. Run the Build and deploy workflow against main branch

  4. If the following error occurs:

    Error: A resource with the ID "/subscriptions/***/resourceGroups/s189t01-trs-pt-rg/providers/Microsoft.DBforPostgreSQL/flexibleServers/s189t01-trs-pt-pg/databases/trs_pentest" already exists - to be managed via Terraform this resource needs to be imported into the State. Please see the resource documentation for "azurerm_postgresql_flexible_server_database" for more information.
    

    You will have to go into Azure portal and delete the trs_pentest database and retry step 3

Restore the data from previous backup in Azure storage

  • Ignore this if you used Option 1 in the previous section.
  • Run the Restore database from Azure storage workflow:
    • Enviroment to restore: pentest
    • Restore to production: false
    • Name of the backup file: Backup file created in Prerequisites section

Validate app

Disable maintenance mode

  • Skip this step - TRS does not have a maintenance mode

Unfreeze pipeline

  • Skip this step for DR testing - there are no active pipelines merging into the pentest environment

Scenario 2: Loss of data

Stop the service as soon as possible

  • Connect to the pentest environment as in Connecting to environments:

    • Subscription: s189-teacher-services-cloud-test
    • Resource group: s189t01-tsc-pt-rg
    • Cluster: s189t01-tsc-platform-test-aks
    • Namespace: development

    e.g.:

    az account set --subscription s189-teacher-services-cloud-test
    az aks get-credentials --overwrite-existing -g s189t01-tsc-pt-rg --name s189t01-tsc-platform-test-aks
    kubectl get pods -n development --insecure-skip-tls-verify
    
    kubectl -n development get deployments
    kubectl -n development scale deployment trs-pentest-ui-xxxxxxxxxx-xxxxx --replicas 0
    kubectl -n development scale deployment trs-pentest-worker-xxxxxxxxxx-xxxxx --replicas 0
    

Start the incident process (if not already in progress)

  • Skip this step for DR testing

Freeze pipeline

  • Skip this step for DR testing - there are no active pipelines merging into the pentest environment

Enable maintenance mode

  • Skip this step - TRS does not have a maintenance mode

Consider backing up the database

  • Skip this step for DR testing

Restore postgres database

  • Run the Restore database from point in time to new database server workflow
    • Enviroment to restore: pentest
    • Restore to production: false
    • Restore point in time: This should be a point in time after the backup in Prerequisites section was created
    • Name of the new postgres server: (any name as long as it's different to existing server names)

Upload restored database to Azure storage

  • Run the Backup database to Azure storage workflow
    • Enviroment to backup: pentest
    • Backup file name: (leave blank)
    • Database server name: Name of the server created in previous step
  • Once complete, view the Backup database summary and copy the backup filename

Validate data

  • Skip this step for DR testing - make does not work on Windows machines

Restore data into the live server

Restart applications

  • Connect to the pentest environment as in Connecting to environments:

    • Subscription: s189-teacher-services-cloud-test
    • Resource group: s189t01-tsc-pt-rg
    • Cluster: s189t01-tsc-platform-test-aks
    • Namespace: development

    e.g.:

    az account set --subscription s189-teacher-services-cloud-test
    az aks get-credentials --overwrite-existing -g s189t01-tsc-pt-rg --name s189t01-tsc-platform-test-aks
    kubectl get pods -n development --insecure-skip-tls-verify
    
    kubectl -n development get deployments
    kubectl -n development scale deployment trs-pentest-ui-xxxxxxxxxx-xxxxx --replicas 1
    kubectl -n development scale deployment trs-pentest-worker-xxxxxxxxxx-xxxxx --replicas 1
    

Validate app

Disable maintenance mode

  • Skip this step - TRS does not have a maintenance mode

Unfreeze pipeline

  • Skip this step for DR testing - there are no active pipelines merging into the pentest environment

Tidy up

  • In Azure Portal, delete the server created in "Restore postgres database" step above