Reliability & Availability

12 min readBeginner
Not Started
Loading...

Definitions

Reliability

The probability that a system performs correctly during a specific time duration.

Focus: Preventing failures from happening in the first place.

Availability

The percentage of time a system is operational and accessible.

Focus: Minimizing downtime when failures do occur.

Key insight: A system can be available but unreliable (returns wrong data), or reliable but not highly available (goes down often but works correctly when up).

Availability Levels & SLAs

Availability %
99%basic
99.999%ultra-high
Downtime/year
3.7 daysbasic
0.005 daysultra-high
Cost Factor
1xbasic
100xultra-high
99%3.7 days/year
Basic service level
Example: Internal tools
99.9%8.8 hours/year
Standard web service
Example: E-commerce sites
99.99%52 minutes/year
High availability
Example: Banking systems
99.999%5.3 minutes/year
Ultra high availability
Example: Critical infrastructure

Reliability Patterns

Common patterns for building reliable systems that can handle failures gracefully.

Redundancy
Multiple instances of critical components
Cost: High
Complexity: Medium

Implementation: Load balancers, database replicas, multi-AZ deployment

Circuit Breaker
Prevent cascading failures
Cost: Low
Complexity: Low

Implementation: Netflix Hystrix, fail-fast when dependency unavailable

Bulkhead
Isolate resources to prevent total failure
Cost: Medium
Complexity: Medium

Implementation: Separate thread pools, connection pools per service

Timeout & Retry
Handle transient failures gracefully
Cost: Low
Complexity: Low

Implementation: Exponential backoff, jitter, max retry limits

MTBF vs MTTR

MTBF (Mean Time Between Failures)

Average time between system failures. Focus on preventing failures.

  • • Better testing and quality assurance
  • • Redundant components and systems
  • • Regular maintenance and updates

MTTR (Mean Time To Recovery)

Average time to restore service after a failure. Focus on fast recovery.

  • • Automated monitoring and alerting
  • • Quick rollback mechanisms
  • • Clear incident response procedures

Formula: Availability = MTBF / (MTBF + MTTR)
Strategy: Increase MTBF and decrease MTTR for higher availability.

Reducing MTTR Strategies

Monitoring & Alerting
Prometheus, Grafana, PagerDuty
Detect issues in < 1 min
Automated Rollback
Blue-green deployment, feature flags
Restore service in < 5 min
Runbooks
Documentation, automated scripts
Standardized response procedures
Post-mortem Process
Incident analysis, action items
Learn and prevent future issues

Real-world Examples

Netflix: 99.99% Availability

Strategy: Chaos engineering, microservices, multi-region deployment

Can lose an entire AWS region without service impact. Intentionally breaks things in production to test resilience.

Google: 99.99% SLA

Strategy: Global load balancing, automatic failover, SRE practices

Site Reliability Engineering (SRE) teams focus on automation and reducing toil. Error budgets balance reliability vs feature velocity.

AWS: 99.99% SLA

Strategy: Multi-AZ deployment, auto-scaling, health checks

Each service has specific SLA commitments. Availability Zones are isolated from each other to prevent cascading failures.

Common Failure Modes

Hardware Failures

Disk crashes, network outages, power failures. Solution: Redundancy and quick replacement.

Software Bugs

Memory leaks, infinite loops, null pointer exceptions. Solution: Testing, monitoring, quick rollback.

Human Errors

Misconfigurations, accidental deletions, deployment mistakes. Solution: Automation, safeguards, training.

Cascading Failures

One component failure triggers others. Solution: Circuit breakers, bulkhead pattern, graceful degradation.

Load Spikes

Traffic exceeds capacity. Solution: Auto-scaling, load balancing, rate limiting.

Dependencies

External service failures. Solution: Timeouts, retries, fallback mechanisms, caching.

Building Reliable Systems: Checklist

Design Phase

  • ✓ Identify single points of failure
  • ✓ Plan for redundancy at each layer
  • ✓ Design for graceful degradation
  • ✓ Define SLA requirements early
  • ✓ Plan monitoring and alerting strategy

Implementation Phase

  • ✓ Implement circuit breakers
  • ✓ Add comprehensive logging
  • ✓ Set up health checks
  • ✓ Create deployment automation
  • ✓ Test failure scenarios

🧮 SLA Calculator

Calculate actual downtime and costs based on availability targets

Inputs

$/hour
$/month

Result

8.76 hours/year

99.9% availability means 8.76 hours downtime per year

Downtime/Year:8.76 hours
Revenue Loss:$87,600
Infra Cost Multiplier:2x
Total Infra Cost:$10,000/month

🎯 Failure Response Scenarios

Learn from real outage responses and their business impact

Scenarios

Database Primary Failure
Main database goes down, automatic failover to replica
Cascading Service Failure
Authentication service overload causes downstream failures
Regional Datacenter Outage
Entire AWS region goes down, multi-region setup saves the day
Memory Leak in Production
Gradual memory consumption causes rolling server crashes
DDoS Attack During Peak Hours
Massive botnet targets API endpoints during Black Friday
Configuration Typo in Production
Database connection string typo causes all new connections to fail
Third-party Payment Provider Outage
Stripe goes down during checkout processing
Kubernetes Node Failure
Worker node hardware failure during traffic spike
Cache Stampede After Restart
Redis restart causes all clients to hit database simultaneously

Context

Main database goes down, automatic failover to replica

Metrics

Detection Time
30 seconds
Failover Time
2 minutes
Total Downtime
2.5 minutes
Affected Users
100%

Outcome

Service restored quickly with minimal data loss. 99.995% monthly availability maintained.

Key Lessons

  • Automated failover is crucial for database reliability
  • Read replicas can serve traffic during primary failures
  • Monitoring with sub-minute detection prevents longer outages
  • Regular failover testing ensures systems work under pressure

📝 Reliability & Availability Quiz

1 of 3Current: 0/3

What is the difference between reliability and availability?