O’Reilly SRE
Summary — Main Ideas & Key Points
Core Principles
- Security and reliability are interconnected — redundancy increases reliability but also attack vectors
- CIA Triad: Confidentiality, Integrity, Availability — balance all three
- Zero Trust Architecture: Never trust, always verify — least privilege, MFA, continuous validation
- Zero Touch Production: All production changes through automation, not manual intervention
- Risk-based approach: Calculate opportunity cost vs. prevention cost for negative events
System Design Fundamentals
- System Invariants: Authentication/authorization, audit trails, input validation, graceful degradation, overload handling
- Resilience by design: Each layer independently resilient, automated failure handling, graceful degradation
- Failure domains: Functional and data isolation to contain failures
- Load management: Spread requests, make them cheaper, load shedding, throttling
- Fail-safe vs. fail-secure: Balance reliability (serve as much as possible) vs. security (lock down on uncertainty)
Security Practices
- Threat modeling: Cyber Kill Chain, TTP (Tactics, Techniques, Procedures), insider risk
- Safe proxies: Single entry point for auditing, access control, production protection
- Multi-Party Authorization (MPA): Require multiple approvals for critical changes
- Breakglass mechanism: Emergency bypass with alerting and review
- Top 10 vulnerabilities: SQL injection, broken auth, sensitive data exposure, XSS, insecure deserialization, etc.
Development & Deployment
- Incremental, tested, staged changes: Slow and steady approach with documentation
- Standardization: Software distribution, configuration-as-code, monitoring
- Code quality: Hardened frameworks, code simplicity, mandatory reviews, security-shaped test cases
- Testing strategy: Unit tests (hermetic), integration tests (real dependencies), fuzzing, static analysis
- Provenance-based deployment: Signed artifacts, verified builds, policy enforcement, post-deploy verification
Observability & Debugging
- Methodical debugging: Data → hypothesis → experiment → confirmation
- Observability built-in: Logs, metrics, traces designed from the start
- Understand normal behavior: Baselines critical for spotting anomalies
- Immutable logging: Preserve evidence, don’t log everything thoughtlessly
- Limited, auditable access: Security constraints apply even during incidents
Incident Response
- Disaster planning before crisis: Risk analysis, IR teams, playbooks, tabletop exercises
- Formal incident command: Clear roles, structured response, parallelized work
- Operational security: Attackers may observe your response — protect investigation
- Evidence preservation: Forensic imaging, memory analysis, log analysis before fixing
- Recovery planning: Separate from investigation, verify clean tools, consider attack variants, rebuild from known-good sources
Culture & Continuous Improvement
- Embedded practices: Security and reliability in workflows, not bolted on
- Review culture: Code, configuration, access reviews catch problems early
- Feedback loops: Learn from incidents without blame, create postmortems
- Incentives matter: Reward security and reliability efforts
- Transparency: Overcommunicate, document decisions, create feedback channels
- Continual improvement: Design for complexity and uncertainty, not just current needs
Key Takeaways
- Automation is essential: Standardize, automate resilience measures, CI/CD, policy enforcement
- Structure over speed: Formal processes prevent costly errors during incidents
- Design for failure: Failures are inevitable — plan recovery, graceful degradation, redundancy
- Security during incidents: Don’t compromise security practices even under pressure
- Culture drives behavior: Practices only work when embraced by the team
Details
Chapter 1
- Redundancy increases reliability, but increases vector of attacks
- CIA Triad - Confidentiality, integrity and availability
- Use risk based approaches for estimation of negative events
- Calculate opportunity cost and up-front cost for preventing them
- When systems fail from the reliability perspective - high load or component failures. For high load - To reduce the load - spread the volume of requests across instances or make requests cheaper (faster and easier to process). For component failures - redundancy and distinct failure domains.
Chapter 2
-
Model Threat Insider Risk
-
Limit insider risk by:
-
Least Privilige
-
Zero Trust
-
MFA
-
Business Justification
-
Auditing and Detection
-
Recoverability
-
Cyber Kill Chain - plot the progression of attack
-
TTP - Tactics, Techniques and Procedures
Chapter 3
- Safe proxies - single entry point between networks allowing for auditing operations, controlling access to resources and protecting production
- Zero Touch Prod - all the prod changes are done through automated software
- MPA - Multi Party Authorisation
- Breakglass mechanism - user can bypass policiees to allow engineers to quickly resolve the outage
Chapter 4
Chapter 5
- Least privilige
- Zero Trust Networking
- Zero Touch - everything through automation
- Classification based on risk
- Denial should almost always “be blind”
Chapter 6
- System’s understandability - small components
- System invariants:
- Only authenticated and authorized users can access a system’s persistent data store
- All operations on sensitive data are audited
- All values received from outside are validated
- Number of backend & frontend queries scale relatively
- Gracefull degradation
- Serve overload errors instead of crashing
- Mental Models
- Centralize responsibility for security and reliability
- Understandable interface specs
- Idempotency
- Understandable identities, auth, access control
- Identity - set of attributes / identifiers that relate to an entity
- Credentials - assert the identity of a given entity (i.e. password, OAuth2 token)
- Trusted Computing Base - set of components whose correct functioning is sufficient to ensure that a security policy is enforced (has to uphold security even if any entity outside of TCB misbehaves)
- Threat Models
Summary:
- Construct components that have clear and constrained purposes
Chapter 7
Design Changes (slow & steady approach)
- Incremental
- Documented
- Tested
- Isolated
- Qualified
- Staged
Key Points:
- Standardize Software Distribution
- Monitoring
- Reusable Incident / Vulnerability Response Plan
- Know which systems are non-standard or need special attention
Policies for deploying
- Start with the easiest - most traction & prove value
- Start with the hardest - most bugs & edge cases
Enterprise level change
- Dashboarding
- Instrumentation
Summary:
- Plan changes
- Dashboard
- Standardize as much as possible
Chapter 8
Design for resilience
- Each layer independently resilient
- Prioritize each feature and calculate its cost
- Define boundaries
- Defend against localized failures
- Automate as many of resilience measures as possible
Degrade gracefully
- Disable infrequently used features (least critical functions) to free up resources
- Aim for system response measures to take effect quickly and automatically
- Understand which systems are mission critical
Response mechanisms
- Load shedding by returning errors
- Based on the request priority
- Based on the request cost
- Delaying responses and throttling clients
Fail safe vs fail secure
- To maximize reliability i.e. serve as much as possible in the face of uncertainty
- To maximize security - lock fully in the face of uncertainty
Failure Domains
- Functional isolation
- Data isolation
Summary
- Automate resiliency as much as possible
- Analyze the domains
- Introduce load shedding and throttling
Chapter 9
Error categories:
- Random errors (physical)
- Accidental errors (typical human errors)
- Software errors
- Malicious actions
Design Principles for recovery
- Plan early
- Should you allow rollbacks?
- Allow Arbitrary - might lead to the security vulnerabilities
- Never allow - eliminates the path to return to a known stable state, always generate new version
- Deny Lists some versions
- Security Version Number & Minimum Security Version Number
- Rotating signing keys
- Use explicit revocation mechanism
- Should you allow rollbacks?
- Know the intended state
- Version of the code
- Expected configuration
- Testing and Continuous Validation - use policies
- Emergency access
- Critical for reliability and security
- Necessary for most severe outages
- Zero Trust properties
Additional:
- Do not rely on the time
Chapter 10
Designing for Defense
-
Layered defenses
- edge routers
- NLBs
- ALBs
- Anycast routing (to spread across different locations)
-
Defendable services
- Caching Proxies (Cache-Control)
- Reduce unnecessary requests
- Consider spriting (serving small icons in a single larger image)
- Minimize egress
-
Monitoring and alerting
- Mean time to detection (MTTD)
- Mean time to repair (MTTR)
- Alert only when demand exceeds service capacity and automated DoS defenses have engaged
-
Graceful Degradation
- Reduce user-facing impact to the extent possible
- i.e. - read-only mode / reduced feature set
- Reduce user-facing impact to the extent possible
-
Mitigation System
- Thorttling IP addresses
- CAPTCHA
- Those systems must be resilient and not rely on vulnerable production paths
-
Strategic Responses
- Do not teach attackers how to evade defenses
- Focus on structural defenses rather than reactive arms races
-
Amplification Attacks - making repeated requests to thousands of different servers
Summary:
- Prepare every service for DoS
- Combine all the design principles to have it built in structurally not reactively
Chapter 11
- Secure the code as much as you can for the critical parts
- Data Validation
- Process Isolation
- Memory allocator
- Protect against buffer overflows
Chapter 12
Top 10 Vulnerability Risks
- SQL Injection
- Broken Authentication
- Sensitive data exposure
- XML External entities
- Broken access controls
- Security misconfiguration
- Cross-Site scripting (XSS)
- Insecure Deserialization
- Using components with known vulnerabilities
- Insufficient logging & monitoring
Summary:
- Use hardened framework & libraries
- They maintain invariants i.e. No SQL Injection, Correct error handling
- Prioritize code simplicity
- Build a strong review culture
- Integrate automations early for checks & safeguards
Chapter 13
Unit Tests
- Fast & reliable
- Hermetic (repeatable in isolation)
- Include security-shaped cases: negative values, overflow edges, malformed inputs, and “should return safe errors” scenarios
Integration Tests
- Use real dependencies instead of mocks & stubs (i.e. DB)
- Ensure about logging
Dynamic Program Analysis (for security purposes)
- Analysing runtime flagging
- Analysing race conditions
- Uninitialized memory
Fuzzy Testing
- Generating large number of inputs to test the code (especially edge cases and unexpected inputs)
- Hardening both security and reliability
- Mainly for finding bugs liike memory corruption with security implications
Static Program Analysis
- Code inspection
- Abstract Syntax Tree (AST)
- i.e. Sonarqube
Chapter 14
Core best practices:
- Mandatory code reviews
- Rely fully on automation
- Verify artifacts
- Accept only images signed by CI/CD system
- Configuration as a Code
- Never save secrets into source / config repos
Advanced mitigation strategies:
- Binary provenance
- Authenticity
- Output
- Inputs (sources / dependencies)
- Command
- Environment
- Code signing
- Provenance-Based Deployment Policies
- Source code from approved repo
- Peer review happened
- Verified build
- Tests
- Artifact explicitly allowed for this deployment environment
- No vulnerabilities in the code
- Verifiable builds
- Reproducible
- Hermetic
- Verifiable
Ensure defending against
- Untrusted inputs
- Privilige separation (trusted orchestrator / sandboxed)
- Unauthenticated inputs
- Hermetic Fetching
Post-Deployment Verification
- Policy change
- Fail open
- Breakglass mechanism exists
- Dry run
- Forensics after incident
How to rollout it:
- Incrementally
- Rejection errors actionable
- i.e. Policy failed by X, fix by Y
- Ensure unambiguous provenance
- Ensure unambiguous policiees
- Include breakglass (but with caution)
Summary:
Implementation checklist:
- Mandatory reviews for code and pipeline changes
- CI builds only from source control
- CD deploys only CI-built artifacts
- Configuration-as-code
- No secrets in repositories
- Signed artifact provenance
- Policy enforcement at deployment choke point
- Post-deploy verification and audit logs
- Breakglass with alerting and review
- (Advanced) Privilege-separated, hermetic builds
- (Advanced) Provenance-based deployment policies
Chapter 15
- Failures are inevitable; investigation skill is what restores reliability.
- Debugging is a methodical process, not intuition or luck.
- Observability (logs, metrics, traces) is essential for diagnosis.
- You must understand normal system behavior to spot anomalies.
- Hypotheses should be tested, not assumed.
- Debugging access must be limited and auditable for security reasons.
- Poor tooling leads to slow, risky investigations.
- Sometimes repeated incidents mean the system needs redesign, not patching.
Common mistakes:
-
Mistake: “Let’s just try restarting it.”
Why dangerous: Hides root causes and causes repeat incidents. -
Mistake: Logging everything without thought.
Why dangerous: Performance issues and leaked sensitive data. -
Misconception: Rare bugs don’t matter.
Reality: At scale, rare events happen frequently. -
Mistake: Giving full prod access during incidents.
Why dangerous: Security breaches and accidental damage.
Summary:
- Debugging follows: data → hypothesis → experiment → confirmation.
- Observability must be designed in, not added later.
- Understanding baselines is critical for incident response.
- Security constraints still apply during outages.
- Repeated incidents often indicate architectural flaws.
- Immutable logging is important
Chapter 16
Key points:
- Real systems fail in many ways—disasters are inevitable.
- Disaster planning means **preparing before a crisis happens.
- Start with a risk analysis to prioritize what matters most.
- Define and staff an incident response (IR) team with clear roles.
- Build response plans and detailed playbooks for different scenarios.
- incident reporting
- triage
- SLO
- Roles and responsibilities
- Outreach
- Communications
- Adjust systems and access ahead of time (prestaging).
- Train teams and run exercises to institutionalize response skills.
- Regularly audit and test plans, tools, and procedures.
- Tabletop exercise - nonintrusive exercises to challenge responses & playbooks
- Define severity and priority models
Common Mistakes & Misconceptions
-
Mistake: Planning only after an incident.
Why it’s bad: Reaction-only responses are unstructured and slow. -
Mistake: Not defining team roles ahead of time.
Why it’s bad: Confusion and delays during the actual incident. -
Misconception: Having good monitoring is enough.
Reality: Monitoring alerts you — but plans and playbooks guide you. -
Mistake: Never testing or updating plans.
Why it’s bad: Stale plans fail when conditions change. -
Misconception: Only large companies need disaster plans.
Reality: Small systems fail too, and unprepared teams scramble.
Chapter 17
Key points:
- Security incidents are inevitable; chaos is optional.
- Not every alert is a crisis — triage comes first.
- Serious incidents require formal incident command.
- Clear roles and ownership reduce mistakes under pressure.
- Attackers may observe your response — operational security matters.
- Investigation must preserve evidence and timelines.
- Work should be parallelized across focused sub-teams.
- Cleanup and long-term fixes start while investigation is still ongoing.
- Security incidents require incident command, not ad-hoc heroics.
- Triage determines whether to escalate to crisis mode.
- Operational security protects the response itself.
- Evidence preservation is critical for root-cause analysis.
- Parallelizing work shortens recovery time without increasing risk.
Common mistakes:
-
Mistake: Treating every alert as a full crisis
Why dangerous: Causes fatigue and slows real responses. -
Mistake: Fixing systems before understanding the compromise
Why dangerous: Destroys evidence and hides attacker scope. -
Mistake: Discussing incident details in normal chats or email
Why dangerous: Attackers may monitor compromised systems. -
Misconception: Speed matters more than structure
Reality: Unstructured speed causes costly errors.
Investigation process:
- Forensic imaging
- Memory imaging
- File Carving
- Log analysis
- Malware analysis
Summary:
- Have incident commander
- Be prepared - analyse everything thoroughly
- Structure is necessary
Chapter 18
Key points:
- Recovery after an incident is different when an attacker may still be present.
- Prepare formal teams and roles for recovery separate from investigation.
- Establish good information management for notes, docs, and checklists.
- Plan and scope recovery based on what systems and data were compromised.
- Decide when and how to eject an attacker without provoking further harm.
- Ensure your recovery tools and infrastructure haven’t been compromised.
- Consider variants of the attack when restoring systems.
- Use recovery checklists to coordinate tasks and parallelize work.
- Balance short-term mitigation (technical debt) with long-term fixes.
- Recovery teams should be separate from investigation teams to avoid conflict.
- Recovery planning must consider attacker presence and possible reactions.
- Rebuilding from known-good sources is often safer than patching.
- Recovery checklists ensure structured, parallel execution of tasks.
- Recovery infrastructure must itself be verified clean before use.
- Create postmortem
Common mistakes:
-
Mistake: Jumping straight into recovery without planning
Danger: You may undo investigation work or trigger attacker reactions. -
Mistake: Using compromised tools or comms for recovery
Danger: Attackers can monitor and adapt to your actions. -
Misconception: Short-term fixes aren’t harmful
Reality: Temporary mitigations can become permanent technical debt. -
Mistake: Ignoring attack variants
Danger: You recover one breach only to leave the system vulnerable to another.
Summary:
- Always create postmortem afterwards
- Plan recovery - do not just jump into it
- Be aware of the compromised tools or comms
Chapter 19
Chapter 20
Key points:
-
Security and reliability will become even more interconnected.
-
Automation will play a bigger role in detecting and responding to issues faster.
-
Teams must design for complexity and uncertainty, not just current needs.
-
Engineers should think in terms of continual improvement, not one-time fixes.
-
Shared responsibility between security, reliability, and product teams is essential.
-
Tools that provide better system visibility will matter more.
-
Systems should be resilient by default, not by accident.
-
Strong culture and good processes scale better than any single tool.
-
Future systems will require cross-discipline ownership of security and reliability.
-
Automation is necessary for scaling detection, diagnosis, and response.
-
Systems should be designed with resilience built in, not bolted on.
-
Engineers should favor continual improvement and feedback loops.
-
Visibility tools like tracing, logs, and unified dashboards are critical for complex environments.
Chapter 21
Key points:
-
Culture drives behavior — practices only work when people embrace them.
-
Security and reliability should be “default” mindsets, not late checkboxes.
-
A review culture catches problems earlier and spreads shared responsibility.
-
Awareness and training help everyone understand their role and risks.
-
Feedback loops (not blame) help teams learn from incidents and improve.
-
Incentives and promotions should reward security and reliability efforts.
-
Transparency and communication strengthen trust and shared goals.
-
Change takes time — start with incremental practices that fit your team.
-
Security and reliability must be embedded into workflows, not bolted on later.
-
Strong review practices (code, configuration, access) are cultural investments, not burdens.
-
Awareness and education shouldn’t be one-off — use interactive and contextual learning.
-
Incentives matter — teams behave according to what gets rewarded.
-
Cultural change is continuous — pick a few practices, invest consistently, and measure impact.
Summary:
- Overcommunicate
- Be transparent
- Document decisions
- Create feedback channels