Incident playbook
Once an incident happens
1. Form an incident response team
Self-organise to appoint:
- the Incident comms lead (responsible for communicating and alerts to users and stakeholders)
- the Incident tech lead (responsible for technical direction for fixing the incident and communicating to the comms lead)
- the Incident support lead (responsible for monitoring Zendesk and alerting the team to any changes in the queues and severity of experience of users).
Who will usually be in these lead roles - The comms lead will typically be the delivery manager of the main affected service. If that person is not available, the product manager could take on this role, or a DM from another service. - The tech lead will typically be the tech lead of the main affected service, or another developer from the team. - The support lead will typically be assigned by the support team.
For an incident that is cross-service, try to appoint whoever is available. The comms lead would usually be the program delivery manager or an experienced delivery manager from one of the service teams. The tech lead would usually be the lead developer or a tech lead from one of the service teams.
If an incident is confirmed, it is best to declare the incident as soon as possible. You can always update who is in these roles later.
2. Triage the incident (any incident lead)
The person declaring the incident should attempt to triage the incident (P1, P2, P3).
Note that for Google BigQuery the incident should not be higher than a P2.
You can always update this later if needed.
3. Create an incident Slack channel (any incident lead)
Initiate the Slack IncidentBot by sending the message /incident open
in your service Slack channel or #teacher-services-infra.
Complete the details in the IncidentBot template, and press Enter, which will automatically create a dedicated Slack channel for the incident.
Add vital people as soon as you can to the Slack channel. The comms lead can focus on adding any others later.
4. Focus on fixing the incident (usually tech lead)
The tech lead should be focused on either fixing the incident themselves or giving direction to other developers to do so.
If the incident is not of a technical nature, for example, as the result of a data breach, others might lead on resolving the incident.
5. Determine who to contact and how (comms lead)
Determine who needs to be contacted, based on the incident priority and affected services.
Use the contacts from the Teacher services list and your service’s contact list if you have one.
It may be best to include critical user groups like lead providers, or other external organisations we work with. Make sure to include programme delivery managers, deputy directors and service owners in case of a P1 incident.
You need to decide how to contact them as well. You might need to use email, Teams, Slack or a combination to make sure you reach everyone. You can also add them to the Slack incident channel.
6. Identify any other services involved (any leads)
The leads should identify:
- any upstream services (both inside and outside DfE) which could be contributing to the issue
- any downstream services (both inside and outside DfE) likely to be affected by the issue
After doing this, you might review the priority of the incident. It may be necessary to contact the upstream or downstream service teams so they can raise incidents in those services. This is the responsibility of the comms lead.
7. Start the incident report (comms lead)
Create the incident report using the template in Sharepoint:
- Create a running Incident Report using this template
- Rename the created file to include today’s date and save as a new file in the Incident reports folder
8. Decide whether to contact users about an incident (all leads)
Contact your users if:
- The incident will negatively impact them for a prolonged period of time
- It can pre-empt a high volume of support tickets
Informing users about incidents is generally considered best practice, but should be decided on a case by case basis with the product and service managers.
The Teacher Services team maintains a publicly available service status dashboard, generated by our Upptime instance on GitHub.
The updates are managed via GitHub Actions and issues on the teacher-services-upptime repository on GitHub. If a service’s automatic health check is failing continuously, an issue will be created within 5 minutes of the failure occurring and the dashboard will start reporting a service issue.
To update the dashboard:
- Navigate to the appropriate incident issue on the GitHub issues page
- Add a comment to the issue
9. If the incident requires invocation of Disaster Recovery procedures
For TS hosted services follow the TS Cloud Disaster Recovery procedure and any other specific documentation for the service.
For Google BigQuery follow the TS Analytics Cloud Disaster Recovery procedure
Provide updates during the incident (usually comms lead)
Keep all conversations and status updates about the incident on the dedicated Slack incident channel.
Use the Slack IncidentBot /incident update
command to update:
- Description
- Priority
- Leads
Update stakeholders on the Slack incident channel regularly, until the incident has been resolved. This might be every hour for high risk incidents, but at least every time something changes. Make sure your team is kept updated too by linking them to the incident channel.
You might also need to send emails or Teams messages to make sure all contacts are updated.
Close and finish reporting on the incident
- Update the running incident report
- Close the incident on using
/incident close
command in Slack - Confirm that the incident has been automatically resolved on the service status dashboard (it may take 5 mins to update)
- If this was a P1 incident, then it needs to be reported as a Major Incident to the central DfE team. See Reporting a Major incident
Review the incident to try and prevent it reoccurring
- Hold an incident and lesson learned review following a blameless post mortem culture so your service can improve.
- Write up an incident review with recommendations.
- The report introduction should be written in plain English, avoiding technical jargon whenever possible.
- Publish the incident review to the incident reports folder in Sharepoint.
- Report on the incident as part of the A3 report to the Teacher Services Board.
- If this was a P1, update the previously created Major incident report with any lessons learnt. See Reporting a Major incident
You might want to consider doing a retro for any services involved in the incident as well, to ensure all lessons are learnt. You could also share your findings on the wider Teacher Services Slack channel or in Show and tell to help others better address or avoid incidents in future.