Incident playbook
Once an incident happens
1. Form an incident response team
Self-organise to appoint:
- the Incident comms lead (responsible for communicating and alerts to users and stakeholders)
- the Incident tech lead (responsible for technical direction and communicating to the comms lead)
- the Incident support lead (responsible for monitoring Zendesk and alerting the team to any changes in the queues and severity of experience of users).
Notes:
- The comms lead will typically be the delivery manager of the main affected service. If that person is not available, the product manager could take on this role, or a DM from another service.
- The tech lead will typically be the tech lead of the main affected service, or another developer from the team.
- The support lead will typically be assigned by the support team.
2. Triage the incident (all incident leads)
The incident leads should triage the incident (P1, P2, P3).
3. Identify any other services involved (tech lead)
After triaging the issue, the tech lead should identify:
- any upstream services (both inside and outside DfE) which could be contributing to the issue
- any downstream services (both inside and outside DfE) likely to be affected by the issue and raise incidents where needed
4. Create an incident Slack channel and inform the stakeholders (comms lead)
- Initiate the Slack IncidentBot by typing
/incident open
in the message box on the service Slack channel or #teacher-services-infra. Hit Enter. - Complete the details in the IncidentBot template, and press Enter, which will automatically create a dedicated Slack channel for the incident.
- Determine who needs to be contacted, based on the incident priority and affected services. Use the contacts from the Teacher services list and the incident contact list if you have one for your service. It may include critical user groups like lead providers. Make sure to include PDM, SRO, DD in case of a P1 incident.
- Invite the appropriate people from the contact lists to the incident channel.
5. Provide a service update to users outside DfE (comms lead)
The Teacher Services team maintains a publicly available service status dashboard. During an incident, the comms lead needs to explain what’s happening to users outside DfE. The comms lead will need a GitHub account to do this, or delegate updates to a colleague who has one.
The updates are managed via GitHub Actions and issues on the teacher-services-upptime repository on GitHub. If a service’s automatic health check is failing continuously, an issue will be created within 5 minutes of the failure occurring and the dashboard will start reporting a service issue.
To update the dashboard:
- Navigate to the appropriate incident issue on the GitHub issues page
- Add a comment to the issue
6. Start the incident report (any incident lead)
Create the incident report using the template in Sharepoint:
- Create a running Incident Report using this template
- Rename the created file to include today’s date and save as a new file in the Incident reports folder
7. Decide whether to contact users about an incident (support lead)
Contact your users if:
- The incident will negatively impact them for a prolonged period of time
- It can pre-empt a high volume of support tickets
Informing users about incidents is generally considered best practice, but should be decided on a case by case basis with the product and service managers.
8. If the incident requires invocation of Disaster Recovery procedures
Follow the Disaster Recovery procedure
While the incident is in progress (all incident leads)
Keep all conversations and status updates about the incident on the dedicated Slack incident channel.
Use these incident stages in your Slack updates:
- Incident has occurred
- Incident is being assessed
- Incident is being fixed
- Incident is resolved
Use the Slack IncidentBot /incident update
command to update:
- Description
- Priority
- Leads
Provide regular updates every 60 minutes (comms lead)
Update stakeholders on the Slack incident channel every 60 minutes, until the incident has been resolved. Ensure they receive the alert even if their Slack alert notifications are turned off, and check in with them face-to-face (once back in the office).
Once the incident is resolved
- Update the running incident report
- Close the incident on using
/incident close
command in Slack - Confirm that the incident has been automatically resolved on the service status dashboard (it may take 5 mins to update)
- If this was a P1 incident, then it needs to be reported as a Major Incident to the central DfE team. See Reporting a Major incident
Incident review
- Hold an incident and lesson learned review following a blameless post mortem culture so your service can improve.
- Write up an incident review with recommendations.
- The report introduction should be written in plain English, avoiding technical jargon whenever possible.
- Publish the incident review to the incident reports folder in Sharepoint.
- Report on the incident as part of the A3 report to the Teacher Services Board.
- If this was a P1, update the previously created Major incident report with any lessons learnt. See Reporting a Major incident