Reliability & Availability
Definitions
Reliability
The probability that a system performs correctly during a specific time duration.
Focus: Preventing failures from happening in the first place.
Availability
The percentage of time a system is operational and accessible.
Focus: Minimizing downtime when failures do occur.
Key insight: A system can be available but unreliable (returns wrong data), or reliable but not highly available (goes down often but works correctly when up).
Availability Levels & SLAs
Reliability Patterns
Common patterns for building reliable systems that can handle failures gracefully.
Implementation: Load balancers, database replicas, multi-AZ deployment
Implementation: Netflix Hystrix, fail-fast when dependency unavailable
Implementation: Separate thread pools, connection pools per service
Implementation: Exponential backoff, jitter, max retry limits
MTBF vs MTTR
MTBF (Mean Time Between Failures)
Average time between system failures. Focus on preventing failures.
- • Better testing and quality assurance
- • Redundant components and systems
- • Regular maintenance and updates
MTTR (Mean Time To Recovery)
Average time to restore service after a failure. Focus on fast recovery.
- • Automated monitoring and alerting
- • Quick rollback mechanisms
- • Clear incident response procedures
Formula: Availability = MTBF / (MTBF + MTTR)
Strategy: Increase MTBF and decrease MTTR for higher availability.
Reducing MTTR Strategies
Real-world Examples
Netflix: 99.99% Availability
Strategy: Chaos engineering, microservices, multi-region deployment
Can lose an entire AWS region without service impact. Intentionally breaks things in production to test resilience.
Google: 99.99% SLA
Strategy: Global load balancing, automatic failover, SRE practices
Site Reliability Engineering (SRE) teams focus on automation and reducing toil. Error budgets balance reliability vs feature velocity.
AWS: 99.99% SLA
Strategy: Multi-AZ deployment, auto-scaling, health checks
Each service has specific SLA commitments. Availability Zones are isolated from each other to prevent cascading failures.
Common Failure Modes
Hardware Failures
Disk crashes, network outages, power failures. Solution: Redundancy and quick replacement.
Software Bugs
Memory leaks, infinite loops, null pointer exceptions. Solution: Testing, monitoring, quick rollback.
Human Errors
Misconfigurations, accidental deletions, deployment mistakes. Solution: Automation, safeguards, training.
Cascading Failures
One component failure triggers others. Solution: Circuit breakers, bulkhead pattern, graceful degradation.
Load Spikes
Traffic exceeds capacity. Solution: Auto-scaling, load balancing, rate limiting.
Dependencies
External service failures. Solution: Timeouts, retries, fallback mechanisms, caching.
Building Reliable Systems: Checklist
Design Phase
- ✓ Identify single points of failure
- ✓ Plan for redundancy at each layer
- ✓ Design for graceful degradation
- ✓ Define SLA requirements early
- ✓ Plan monitoring and alerting strategy
Implementation Phase
- ✓ Implement circuit breakers
- ✓ Add comprehensive logging
- ✓ Set up health checks
- ✓ Create deployment automation
- ✓ Test failure scenarios
🧮 SLA Calculator
Calculate actual downtime and costs based on availability targets
Inputs
Result
99.9% availability means 8.76 hours downtime per year
🎯 Failure Response Scenarios
Learn from real outage responses and their business impact
Metrics
Outcome
Service restored quickly with minimal data loss. 99.995% monthly availability maintained.
Lessons Learned
- Automated failover is crucial for database reliability
- Read replicas can serve traffic during primary failures
- Monitoring with sub-minute detection prevents longer outages
- Regular failover testing ensures systems work under pressure
ScenariosClick to explore
Context
Main database goes down, automatic failover to replica
Metrics
Outcome
Service restored quickly with minimal data loss. 99.995% monthly availability maintained.
Key Lessons
- •Automated failover is crucial for database reliability
- •Read replicas can serve traffic during primary failures
- •Monitoring with sub-minute detection prevents longer outages
- •Regular failover testing ensures systems work under pressure
1. Database Primary Failure
Context
Main database goes down, automatic failover to replica
Metrics
Outcome
Service restored quickly with minimal data loss. 99.995% monthly availability maintained.
Key Lessons
- •Automated failover is crucial for database reliability
- •Read replicas can serve traffic during primary failures
- •Monitoring with sub-minute detection prevents longer outages
- •Regular failover testing ensures systems work under pressure
2. Cascading Service Failure
Context
Authentication service overload causes downstream failures
Metrics
Outcome
Major outage due to lack of circuit breakers. Lost $500K revenue and customer trust.
Key Lessons
- •Circuit breakers prevent cascading failures
- •Bulkhead pattern isolates critical services
- •Rate limiting protects upstream services
- •Service dependencies should be mapped and monitored
3. Regional Datacenter Outage
Context
Entire AWS region goes down, multi-region setup saves the day
Metrics
Outcome
Minimal user impact thanks to multi-region architecture. 99.99% availability maintained.
Key Lessons
- •Multi-region deployment prevents regional failures
- •DNS failover enables automatic region switching
- •Cross-region data replication is essential
- •Accept higher costs for critical availability requirements
4. Memory Leak in Production
Context
Gradual memory consumption causes rolling server crashes
Metrics
Outcome
Quick rollback prevented complete outage. Memory monitoring alerts improved.
Key Lessons
- •Memory monitoring with automatic alerts is essential
- •Canary deployments catch issues before full rollout
- •Quick rollback capability limits blast radius
- •Resource limits prevent single process from crashing server
5. DDoS Attack During Peak Hours
Context
Massive botnet targets API endpoints during Black Friday
Metrics
Outcome
CDN and rate limiting absorbed attack. Business continued normally.
Key Lessons
- •CDN provides first line of DDoS defense
- •Rate limiting by IP and API key prevents abuse
- •Geo-blocking can reduce attack surface
- •Have DDoS mitigation service on standby
6. Configuration Typo in Production
Context
Database connection string typo causes all new connections to fail
Metrics
Outcome
Automated rollback triggered by health check failures. Limited customer impact.
Key Lessons
- •Configuration validation before deployment is critical
- •Automated health checks enable fast detection
- •Staged rollouts limit blast radius
- •Configuration as code enables quick rollback
7. Third-party Payment Provider Outage
Context
Stripe goes down during checkout processing
Metrics
Outcome
Secondary payment provider prevented total checkout failure. Most customers completed purchase.
Key Lessons
- •Multiple payment providers provide resilience
- •Graceful degradation maintains partial functionality
- •Queue failed transactions for retry
- •Clear communication reduces customer frustration
8. Kubernetes Node Failure
Context
Worker node hardware failure during traffic spike
Metrics
Outcome
Kubernetes automatically redistributed pods. Zero customer impact.
Key Lessons
- •Container orchestration provides automatic recovery
- •Resource reservations ensure capacity for redistribution
- •Node auto-scaling handles capacity changes
- •Health checks trigger automatic pod restarts
9. Cache Stampede After Restart
Context
Redis restart causes all clients to hit database simultaneously
Metrics
Outcome
Database connection pool exhausted. Partial outage until cache warmed.
Key Lessons
- •Implement cache warming strategies
- •Use jittered TTLs to prevent simultaneous expiry
- •Circuit breakers protect database from overload
- •Consider cache preloading for critical data