Teacher Services Cloud - Production checklist

For the service to be ready for end users, it must be reliable, performant and sustainable.

Multiple replicas

By default the template deploys only 1 replica for each kubernetes deployment. This is not sufficient for production as if the container is unavailable, there is no other replica to serve the requests. It may be unavailable because of high usage or simply because the cluster is moving the container to another node. This will happen when the cluster version is updated.

Use at least 2 replicas or as many as required by performance testing.

Database plan

The template deploys a default plan for postgres and redis.

It may be sufficient for the test environments, but it may not offer enough CPU, memory or network bandwidth for production. Performance testing will help determine the right plans.

Note that for redis, all azure_family, azure_sku_name and azure_capacity must be changed jointly. Check terraform postgres documentation for the allowed values.

High availability

Each Azure region provides multiple availability zones. The kubernetes cluster is deployed across 3 zones so in case one is failing, the workload continues on the 2 others.

The same should be applied to database clusters. For postgres, set azure_enable_high_availability to true. For redis, use a Premium plan.

Note the cost is doubled for postgres, and much higher for redis, so this should be used carefully.

Performance testing

Simulate load from user traffic to determine the right number of instances and the database plan. This should cover the most typical user journeys. We recommend K6 as it can be deployed to the cluster to minimise latency. Check the example in teacher pay calculator.

If time is short or user traffic is expected to be low, make sure to monitor the application and database usage after launch, and everytime there is a new significant feature. And be ready to scale up.

Postgres backups to Azure storage

Azure postgres provides an automatic backup with a 7 days retention period. It can be restored from a point in time to a new database server.

In case there is a major issue and the above doesn't work, we strongly suggest taking another daily backup every night and storing it in Azure storage. Set azure_enable_backup_storage variable to true to create the storage account. Then create a workflow using the backup-postgres github action and schedule it nightly.

Logging

Container logs are available temporarily in the cluster. To store the logs, all applications should ship logs to Logit.io. The Teacher services UK account stores all the data in the UK region.

Set enable_logit to true to ship the logs. Logs must sent as json, normally using the standard libraries for the language.

Developers need to request access to Logit.io to visualise the logs.

Monitoring

StatusCake

Statuscake is the most essential monitoring tool as if it alerts, it means users cannot access the site. Use the terraform module to monitor:

Ask the infra team for help with these steps:

Create the dev team contact group if necessary. Add the team email, developer emails and phone numbers if desired.
Get the dev team contact group id from the URL
Obtain an existing API key or request a new one. Ideally there should be one per service or at least one per area.
Create a secret "STATUSCAKE-API-TOKEN" in the "inf" keyvault, with the API key as value. The statuscake provider is configured to get the token from module.infrastructure_secrets.map.STATUSCAKE-API-TOKEN.
Fill in enable_monitoring, external_url, statuscake_contact_groups and content_matchers variables in the environment tfvars.json file. The content_matchers is an optional variable that can be added to ensure not just uptime but also correctness of content returned. Example:
```
"enable_monitoring" : true,
"external_url": "https://calculate-teacher-pay.education.gov.uk/healthcheck",
"statuscake_contact_groups": [195955],
"content_matchers": [
  {
    "matcher": "CONTAINS_STRING",
    "content": "create a jobseeker account"
  }
]
```
For production, add the infra team contact group id: 282453

Upptime

We provide a status page of all services in Teacher services. It uses Github actions to ping websites running every 5 min (more or less) and produce a dashboard for external users.

When a website is offline, it shows the error in the daashboard, sends an alert to the infra Slack channel and records an incident as a Github issue. The team can post comments to the issue to send incident updates.

Request write access to the repository and edit the upptimerc.yml without PR to add your production website.

Postgres and redis

We use Azure monitor to define alerts on postgres and redis metrics. Alerts are sent via email, using a monitoring action group. The new_service template includes the make action-group command to automate this task. Ask the infra team to set it up. By default it is set up to alert the infrastructure team by default, but any email address or distribution list (preferred) may be used.

Set azure_enable_monitoring to true to enable logging, monitoring and alerting.

Front door

Set azure_enable_monitoring to true in the domains/infrastructure module to enable logging on front door. It is verbose and costly and should not be used by default (check with the infra team). But it can be extremely useful for troubleshooting.

Pods

Pods CPU, memory, restarts... are monitored using prometheus. To enable it follow:

Create a webhook slack app in the Teacher services cloud Slack app or reuse one if it has the desired channel
If using a new webhook, create a secret in the Teacher services cloud keyvault (s189t01-tsc-ts-kv or s189p01-tsc-pd-kv). It must be named SLACK-WEBHOOK-XXX where XXX is a service like ATT or an area like CPD.
If using a new webhook, add the secret name to alertmanager_slack_receiver_list
Enable alerting on each deployment you want to monitor by adding to alertable_apps, each entry is: "namespace/deployment": { "receiver": "RECEIVER"}, such as:
```
"bat-production/itt-mentor-services-sandbox": {
    "receiver": "SLACK_WEBHOOK_ITTMS"
  },
```
If the receiver is not specified, SLACK_WEBHOOK_GENERIC will be used to alert the infra channel.

Custom prometheus monitoring

If any of the deployments serve custom prometheus metrics on a /metrics endpoint, then you can enable scraping for that deployment

Enable prometheus scraping on each deployment you want to monitor

Workflows

Get notified of workflow failures on a Slack channel.

Create a webhook slack app in the Teacher services cloud Slack app or reuse one if it has the desired channel
Use it in the existing Github actions with ${{ secrets.SLACK_WEBHOOK }}
Add your own messages using the rtCamp/action-slack-notify action

Custom domain

The default web application domain in production is teacherservices.cloud, and the application domain is <application_name>.teacherservices.cloud. It should not be used by end users. Rather we normally create a subdomain of either education.gov.uk or service.gov.uk. Here is the process:

Create the domains infrastructure (Azure DNS and Azure front door) in the production subscription, using the domains/infrastructure module
Delegate the DNS zone from either education.gov.uk or service.gov.uk. This is described in the technical guidance.
Create a custom domain for each environment using the domain/environment_domains module

If an apex domain is used, make sure to configure StatusCake SSL monitoring as the certificate must be regenerated manually every 180 days.

Caching

The custom domains are implemented using the Azure front door CDN. It provides simple caching of HTTP requests by path. For instance, rails apps usually cache assets (javascripts, CSS, fonts...) under the /assets path.

CDN Caching makes requests faster for users and reduces the load on the application. Use the environment_domains module cached_paths variable to cache all the paths as required.

Redirects

It is possible to support multiple domains and subdomains, and create a redirect between them to catch more user traffic. For instance:

Use the environment_domains module redirect_rules variable.

Rate limiting

A global rate limit of requests allowed per source IP address should be set for each environment.

We can set per 1 minute or 5 minute interval, but unless there's a good reason it's best to set over 5 minutes.

This will require some discussion with the app team, and the request profile may not be well understood to start with. If so, a relatively high limit can be set initially e.g. 1000+ requests per 5 minute interval.

Use the environment_domains module rate limit terraform.

Rate limit rules can be added via

set rate_limit_max as the max number of requests in a 5 minute period.

This creates a block rule that will limit any source IP that goes above var.rate_limit_max in a 5 minute period.

set aks_allow to true

If the service recieves a high number of requests originating from other services in our AKS clusters, then setting aks_allow will allow all traffic from AKS. This is only required if a general rate_limit rule is in place.

set block_ip to true.

creates a block rule that will limit all traffic from a particular source IP. It is created disabled with a dummy IP address. It can then be updated manually if the need to quickly block an IP occurs.

create custom rules using rate_limit

for custom rules not covered by the above, they can be added to the rate_limit list.

"rate_limit": [
      {
        "agent": "all",
        "priority": 100,
        "duration": 5,
        "limit": 1000,
        "selector": "Host",
        "operator": "GreaterThanOrEqual",
        "match_values": "0"
      }
    ]

Pin all versions

The infrastructure code should pin the versions of all components to avoid receiving different versions. The build must be predictable between environments and over time. We should upgrade versions frequently, but only when it is desired and fully tested.

Components with versions:

Base docker image: pin language version (e.g. ruby 3.3.0) and Alpine version (e.g. alpine-3.20)
Terraform (in application, domains infrastructure and environment_domains)
Terraform providers (azure, kubernetes, StatusCake)
Postgres
Redis
Terraform modules: the TERRAFORM_MODULES_TAG variable should point at either main, testing or stable according to the terraform modules release process

Maintenance window

Azure applies patches and minor updates to postgres and redis. Since this may cause a minor disruption, use the azure_maintenance_window and azure_patch_schedule variables to set them to a convenient time, when the service receives less traffic.

Note the postgres patches will always be applied first to environments where the maintenance window is not set.

Service offering

The new service template uses the default "Teacher services cloud" value for the Product tag. This tag is used to identify the service in the Azure finance reporting. Each service must register a new service offering and product and replace "Teacher services cloud" with the right name so that Azure costs are allocated accordingly.

Maintenance page

Optional but recommended for user facing services. See Maintenance page for more details.

Lock critical resources

Add a lock to critical Azure resources to prevent against accidental deletion. We currently create locks for two types of resources. (Members of the s189-teacher-services-cloud-ResLock Admin Entra ID group (infra team) can manage locks.)

Production database servers.

Open the resource in the Azure portal
Settings > Locks > + Add > Lock name: Delete, Lock type: Delete > OK

DNS zones. We lock the SOA record in the zone as this prevents zone deletion while enabling records to be added, deleted and updated. Currently, this type of lock can only be added via powershell, but it can be removed via powershell or the portal.

Connect-AzAccount
New-AzResourceLock -LockLevel "CanNotDelete" -LockName "s189p01-<SERVICE_SHORT>-lock" -ResourceName "<DNS_ZONE_NAME>/@" -ResourceType "Microsoft.Network/DNSZones/SOA" -ResourceGroupName "<DOMAINS_RESOURCE_GROUP_NAME>"
e.g.
New-AzResourceLock -LockLevel "CanNotDelete" -LockName "s189p01-att-lock" -ResourceName "apply-for-teacher-training.education.gov.uk/@" -ResourceType "Microsoft.Network/DNSZones/SOA" -ResourceGroupName "s189p01-applydomains-rg"

Build image security scanning

We use SNYK scanning to check build images for vulnerabilities.

This is enabled by passing a valid SNYK-TOKEN to the build-and-deploy github action.

Secrets

We keep application secrets in Azure key vault. There is always a risk of an attack or a mistake leading to a leak, especially when using public repositories. In case an incident happens, it is important to rotate all the secrets as soon as possible.

We want to minimise the time to recovery, and help the team members rotating the secrets, especially when they are not familiar with them. Secrets are not stored in the Github repository and don't have comments nor git commit history. We recommend keeping an exhaustive list of all secrets, preferably in Sharepoint and not in the public repository.

Document for each secret:

Environment variable name
What is it used for
How to generate or request a new secret