Debugging Distributed Systems
Systematic workflow for debugging distributed systems incidents, as presented in Google’s approach to incident response.
DETECT
Receiving the Alert
- What is the severity of the issue?
TRIAGE
Evaluating the Alert
- Should I create an incident?
- Should I escalate to a more severe incident?
- Should I close as inactionable?
- Is it local or global? Can it become global?
INVESTIGATE
Isolating the Error
- Is there a spike in errors?
- Is there a spike in latency?
- Has there been a change in demand? (QPS)
- What is the error? Check logs and group by error type.
Determine if it’s a service issue or dependency issue
- What are the problematic dependencies?
Validate if service is healthy
- What are the SLOs?
- Will they be exceeded?
Determining changes around the service
- Any production changes? (Rollouts, configs, data push, experiments)
- Is there a spike in demand?
MITIGATE
Mitigate the issue
- What mitigations should I take?
- How confident am I that this is the right approach?
Validate if service is healthy
- Have the mitigations fixed the issue?
POST MORTEM
- Conduct a blameless post-mortem
- Document what the issue was
- Identify how to prevent it in the future
- Assess urgency: Is it urgent? When will it happen again?
Last updated on