Debugging Distributed Systems

Systematic workflow for debugging distributed systems incidents, as presented in Google’s approach to incident response.

DETECT

Receiving the Alert

What is the severity of the issue?

TRIAGE

Evaluating the Alert

Should I create an incident?
Should I escalate to a more severe incident?
Should I close as inactionable?
Is it local or global? Can it become global?

INVESTIGATE

Isolating the Error

Is there a spike in errors?
Is there a spike in latency?
Has there been a change in demand? (QPS)
What is the error? Check logs and group by error type.

Determine if it’s a service issue or dependency issue

What are the problematic dependencies?

Validate if service is healthy

What are the SLOs?
Will they be exceeded?

Determining changes around the service

Any production changes? (Rollouts, configs, data push, experiments)
Is there a spike in demand?

MITIGATE

Mitigate the issue

What mitigations should I take?
How confident am I that this is the right approach?

Validate if service is healthy

Have the mitigations fixed the issue?

POST MORTEM

Conduct a blameless post-mortem
Document what the issue was
Identify how to prevent it in the future
Assess urgency: Is it urgent? When will it happen again?

Last updated on January 18, 2026

Crushing It Book Summary – Personal Branding & Social Media Success Effective Java 3rd Edition Book Summary – Java Best Practices & Programming