Skip to Content
Learning & KnowledgeBooks & SummariesDebugging Distributed Systems

Debugging Distributed Systems

Systematic workflow for debugging distributed systems incidents, as presented in Google’s approach to incident response.

DETECT

Receiving the Alert

  • What is the severity of the issue?

TRIAGE

Evaluating the Alert

  • Should I create an incident?
  • Should I escalate to a more severe incident?
  • Should I close as inactionable?
  • Is it local or global? Can it become global?

INVESTIGATE

Isolating the Error

  • Is there a spike in errors?
  • Is there a spike in latency?
  • Has there been a change in demand? (QPS)
  • What is the error? Check logs and group by error type.

Determine if it’s a service issue or dependency issue

  • What are the problematic dependencies?

Validate if service is healthy

  • What are the SLOs?
  • Will they be exceeded?

Determining changes around the service

  • Any production changes? (Rollouts, configs, data push, experiments)
  • Is there a spike in demand?

MITIGATE

Mitigate the issue

  • What mitigations should I take?
  • How confident am I that this is the right approach?

Validate if service is healthy

  • Have the mitigations fixed the issue?

POST MORTEM

  • Conduct a blameless post-mortem
  • Document what the issue was
  • Identify how to prevent it in the future
  • Assess urgency: Is it urgent? When will it happen again?
Last updated on