System Designer

Definitions

Reliability

The probability that a system performs correctly during a specific time duration.

Focus: Preventing failures from happening in the first place.

Availability

The percentage of time a system is operational and accessible.

Focus: Minimizing downtime when failures do occur.

Key insight: A system can be available but unreliable (returns wrong data), or reliable but not highly available (goes down often but works correctly when up).

Availability Levels & SLAs

Availability %

99%basic

99.999%ultra-high

Downtime/year

3.7 daysbasic

0.005 daysultra-high

Cost Factor

1xbasic

100xultra-high

99%3.7 days/year

Basic service level

Example: Internal tools

99.9%8.8 hours/year

Standard web service

Example: E-commerce sites

99.99%52 minutes/year

High availability

Example: Banking systems

99.999%5.3 minutes/year

Ultra high availability

Example: Critical infrastructure

Reliability Patterns

Common patterns for building reliable systems that can handle failures gracefully.

Redundancy

Multiple instances of critical components

Cost: High

Complexity: Medium

Implementation: Load balancers, database replicas, multi-AZ deployment

Circuit Breaker

Prevent cascading failures

Cost: Low

Complexity: Low

Implementation: Netflix Hystrix, fail-fast when dependency unavailable

Bulkhead

Isolate resources to prevent total failure

Cost: Medium

Complexity: Medium

Implementation: Separate thread pools, connection pools per service

Timeout & Retry

Handle transient failures gracefully

Cost: Low

Complexity: Low

Implementation: Exponential backoff, jitter, max retry limits

MTBF vs MTTR

MTBF (Mean Time Between Failures)

Average time between system failures. Focus on preventing failures.

• Better testing and quality assurance
• Redundant components and systems
• Regular maintenance and updates

MTTR (Mean Time To Recovery)

Average time to restore service after a failure. Focus on fast recovery.

• Automated monitoring and alerting
• Quick rollback mechanisms
• Clear incident response procedures

Formula: Availability = MTBF / (MTBF + MTTR)
Strategy: Increase MTBF and decrease MTTR for higher availability.

Reducing MTTR Strategies

Monitoring & Alerting

Prometheus, Grafana, PagerDuty

Detect issues in < 1 min

Automated Rollback

Blue-green deployment, feature flags

Restore service in < 5 min

Runbooks

Documentation, automated scripts

Standardized response procedures

Post-mortem Process

Incident analysis, action items

Learn and prevent future issues

Real-world Examples

Netflix: 99.99% Availability

Strategy: Chaos engineering, microservices, multi-region deployment

Can lose an entire AWS region without service impact. Intentionally breaks things in production to test resilience.

Google: 99.99% SLA

Strategy: Global load balancing, automatic failover, SRE practices

Site Reliability Engineering (SRE) teams focus on automation and reducing toil. Error budgets balance reliability vs feature velocity.

AWS: 99.99% SLA

Strategy: Multi-AZ deployment, auto-scaling, health checks

Each service has specific SLA commitments. Availability Zones are isolated from each other to prevent cascading failures.

Common Failure Modes

Hardware Failures

Disk crashes, network outages, power failures. Solution: Redundancy and quick replacement.

Software Bugs

Memory leaks, infinite loops, null pointer exceptions. Solution: Testing, monitoring, quick rollback.

Human Errors

Misconfigurations, accidental deletions, deployment mistakes. Solution: Automation, safeguards, training.

Cascading Failures

One component failure triggers others. Solution: Circuit breakers, bulkhead pattern, graceful degradation.

Load Spikes

Traffic exceeds capacity. Solution: Auto-scaling, load balancing, rate limiting.

Dependencies

External service failures. Solution: Timeouts, retries, fallback mechanisms, caching.

Building Reliable Systems: Checklist

Design Phase

✓ Identify single points of failure
✓ Plan for redundancy at each layer
✓ Design for graceful degradation
✓ Define SLA requirements early
✓ Plan monitoring and alerting strategy

Implementation Phase

✓ Implement circuit breakers
✓ Add comprehensive logging
✓ Set up health checks
✓ Create deployment automation
✓ Test failure scenarios

🧮 SLA Calculator

Calculate actual downtime and costs based on availability targets

Inputs

Availability Target

Revenue per Hour

$/hour

Infrastructure Cost

$/month

Result

8.76 hours/year

99.9% availability means 8.76 hours downtime per year

Downtime/Year:8.76 hours

Revenue Loss:$87,600

Infra Cost Multiplier:2x

Total Infra Cost:$10,000/month

🎯 Failure Response Scenarios

Learn from real outage responses and their business impact

Metrics

Detection Time

30 seconds

Failover Time

2 minutes

Total Downtime

2.5 minutes

Affected Users

100%

Outcome

Service restored quickly with minimal data loss. 99.995% monthly availability maintained.

Lessons Learned

Automated failover is crucial for database reliability
Read replicas can serve traffic during primary failures
Monitoring with sub-minute detection prevents longer outages
Regular failover testing ensures systems work under pressure

ScenariosClick to explore

Database Primary Failure

Main database goes down, automatic failover to replica

Cascading Service Failure

Authentication service overload causes downstream failures

Regional Datacenter Outage

Entire AWS region goes down, multi-region setup saves the day

Memory Leak in Production

Gradual memory consumption causes rolling server crashes

DDoS Attack During Peak Hours

Massive botnet targets API endpoints during Black Friday

Configuration Typo in Production

Database connection string typo causes all new connections to fail

Third-party Payment Provider Outage

Stripe goes down during checkout processing

Kubernetes Node Failure

Worker node hardware failure during traffic spike

Cache Stampede After Restart

Redis restart causes all clients to hit database simultaneously

Context

Main database goes down, automatic failover to replica

Metrics

Detection Time

30 seconds

Failover Time

2 minutes

Total Downtime

2.5 minutes

Affected Users

100%

Outcome

Service restored quickly with minimal data loss. 99.995% monthly availability maintained.

Key Lessons

•Automated failover is crucial for database reliability
•Read replicas can serve traffic during primary failures
•Monitoring with sub-minute detection prevents longer outages
•Regular failover testing ensures systems work under pressure

1. Database Primary Failure

Context

Main database goes down, automatic failover to replica

Metrics

Detection Time

30 seconds

Failover Time

2 minutes

Total Downtime

2.5 minutes

Affected Users

100%

Outcome

Service restored quickly with minimal data loss. 99.995% monthly availability maintained.

Key Lessons

•Automated failover is crucial for database reliability
•Read replicas can serve traffic during primary failures
•Monitoring with sub-minute detection prevents longer outages
•Regular failover testing ensures systems work under pressure

2. Cascading Service Failure

Context

Authentication service overload causes downstream failures

Metrics

Initial Service

Auth (overloaded)

Cascade Time

5 minutes

Services Affected

12 of 15

Recovery Time

45 minutes

Outcome

Major outage due to lack of circuit breakers. Lost $500K revenue and customer trust.

Key Lessons

•Circuit breakers prevent cascading failures
•Bulkhead pattern isolates critical services
•Rate limiting protects upstream services
•Service dependencies should be mapped and monitored

3. Regional Datacenter Outage

Context

Entire AWS region goes down, multi-region setup saves the day

Metrics

Primary Region

us-east-1 (down)

Failover Target

us-west-2 (active)

Traffic Switch

8 minutes

Service Impact

5% slower

Outcome

Minimal user impact thanks to multi-region architecture. 99.99% availability maintained.

Key Lessons

•Multi-region deployment prevents regional failures
•DNS failover enables automatic region switching
•Cross-region data replication is essential
•Accept higher costs for critical availability requirements

4. Memory Leak in Production

Context

Gradual memory consumption causes rolling server crashes

Metrics

Time to Detection

3 hours

Servers Affected

8 of 20

User Impact

40% degraded

Resolution Time

15 minutes

Outcome

Quick rollback prevented complete outage. Memory monitoring alerts improved.

Key Lessons

•Memory monitoring with automatic alerts is essential
•Canary deployments catch issues before full rollout
•Quick rollback capability limits blast radius
•Resource limits prevent single process from crashing server

5. DDoS Attack During Peak Hours

Context

Massive botnet targets API endpoints during Black Friday

Metrics

Attack Volume

100 Gbps

Mitigation Time

4 minutes

Legitimate Traffic

95% preserved

Revenue Impact

Minimal

Outcome

CDN and rate limiting absorbed attack. Business continued normally.

Key Lessons

•CDN provides first line of DDoS defense
•Rate limiting by IP and API key prevents abuse
•Geo-blocking can reduce attack surface
•Have DDoS mitigation service on standby

6. Configuration Typo in Production

Context

Database connection string typo causes all new connections to fail

Metrics

Error Introduction

14:32 UTC

Detection

90 seconds

Rollback

3 minutes

Total Impact

4.5 minutes

Outcome

Automated rollback triggered by health check failures. Limited customer impact.

Key Lessons

•Configuration validation before deployment is critical
•Automated health checks enable fast detection
•Staged rollouts limit blast radius
•Configuration as code enables quick rollback

7. Third-party Payment Provider Outage

Context

Stripe goes down during checkout processing

Metrics

Provider Downtime

47 minutes

Failover to PayPal

5 minutes

Lost Transactions

~200

Recovery Rate

85%

Outcome

Secondary payment provider prevented total checkout failure. Most customers completed purchase.

Key Lessons

•Multiple payment providers provide resilience
•Graceful degradation maintains partial functionality
•Queue failed transactions for retry
•Clear communication reduces customer frustration

8. Kubernetes Node Failure

Context

Worker node hardware failure during traffic spike

Metrics

Pod Redistribution

45 seconds

Service Disruption

None

Auto-scaling Response

2 minutes

Capacity Restored

100%

Outcome

Kubernetes automatically redistributed pods. Zero customer impact.

Key Lessons

•Container orchestration provides automatic recovery
•Resource reservations ensure capacity for redistribution
•Node auto-scaling handles capacity changes
•Health checks trigger automatic pod restarts

9. Cache Stampede After Restart

Context

Redis restart causes all clients to hit database simultaneously

Metrics

Cache Miss Rate

100%

Database Load

50x normal

Response Time

8 seconds

Recovery

12 minutes

Outcome

Database connection pool exhausted. Partial outage until cache warmed.

Key Lessons

•Implement cache warming strategies
•Use jittered TTLs to prevent simultaneous expiry
•Circuit breakers protect database from overload
•Consider cache preloading for critical data

No quiz questions available

Quiz ID "reliability-availability" not found

Reliability & Availability

Definitions

Reliability

Availability

Availability Levels & SLAs

Reliability Patterns

MTBF vs MTTR

MTBF (Mean Time Between Failures)

MTTR (Mean Time To Recovery)

Reducing MTTR Strategies

Real-world Examples

Netflix: 99.99% Availability

Google: 99.99% SLA

AWS: 99.99% SLA

Common Failure Modes

Hardware Failures

Software Bugs

Human Errors

Cascading Failures

Load Spikes

Dependencies

Building Reliable Systems: Checklist

Design Phase

Implementation Phase

🧮 SLA Calculator

Inputs

Result

🎯 Failure Response Scenarios

Metrics

Outcome

Lessons Learned

ScenariosClick to explore

Context

Metrics

Outcome

Key Lessons

1. Database Primary Failure

Context

Metrics

Outcome

Key Lessons

2. Cascading Service Failure

Context

Metrics

Outcome

Key Lessons

3. Regional Datacenter Outage

Context

Metrics

Outcome

Key Lessons

4. Memory Leak in Production

Context

Metrics

Outcome

Key Lessons

5. DDoS Attack During Peak Hours

Context

Metrics

Outcome

Key Lessons

6. Configuration Typo in Production

Context

Metrics

Outcome

Key Lessons

7. Third-party Payment Provider Outage

Context

Metrics

Outcome

Key Lessons

8. Kubernetes Node Failure

Context

Metrics

Outcome

Key Lessons

9. Cache Stampede After Restart

Context

Metrics

Outcome