How to categorise technical incidents
The incident tech lead needs to triage and categorise the incident to decide on how to respond.
Ask:
- What’s the urgency and why?
- What’s the impact on our users and services?
- What’s the extent of the issue and what services and users are affected?
- Are there any security or data privacy implications?
Use the answers to:
- categorise the incident from the categories below
- direct the necessary response.
The categories are fluid and incidents can increase / decrease in priority following further diagnosis or incident resolution work.
Priority 1 (P1) - High (H)
Description
Highest and most serious level of incident where one or more of the following factors apply:
- 60-100% of users affected.
- Damage to reputation of service is likely to be high.
- Support staff are mostly or completely unable to resolve tickets due to service being down.
- Personally identifiable information (PII) or other sensitive data is at risk.
Examples:
- Total outage of ‘Apply for teacher training’ at any time
- Total outage of ‘Publish postgraduate teacher training courses’ during ‘peak’ time, which is July-October.
- Primary search filter on ‘Find postgraduate teacher training courses’ going down. This includes location search and subject selection.
- Users of ‘Register trainee teachers’ are able to view PII for trainees from other institutions.
Stakeholder groups to inform
Team:
- Team
- Service support staff
- Architecture team if there are security/privacy implications
Senior stakeholders:
- Service Owner
- Deputy Service Owners
Priority 2 (P2) - Medium (M)
Description
Next highest level of incident where two or more of the following factors apply:
- 20-59% of users affected.
- Damage to reputation of service is likely to be moderate.
- Damage caused by incident increases moderately over time.
- Support staff are in some cases unable to resolve tickets for users.
This incident level requires the full attention of the incident responders during business hours and takes priority over other non-emergency work.
Examples:
- Total outage of ‘Publish’ during off-peak time.
- The bulk publishing function is down.
Stakeholder groups to inform
Team:
- Team
- Service support staff
Senior stakeholders:
- Service Owner
- Deputy Service Owners
Priority 3 (P3) - Low (L)
Description
Lowest level of incident where two or more of the following factors apply:
- Less than 20% of users affected.
- Damage to reputation of service is likely to be minimal.
- Damage caused by incident only marginally increases over time.
- Support staff are mostly able to resolve tickets for users and support may largely be unaffected, or moderately affected in some cases.
This incident level requires the attention of incident responders as a priority over regular work during business hours. Higher priority incidents take priority over this incident level.
Examples:
- Some minor functionality or internal tools are broken, such as one of the secondary filters.
- The support console is down.
Stakeholder groups to inform
Team:
- Team
- Service support staff
Senior stakeholders:
- Deputy Service Owners