Chaos Engineering
Building resilient systems through controlled failure testing
Interactive Chaos Engineering Calculator
Chaos Engineering Principles
1. Build Hypothesis
Define steady-state behavior and predict system response to failures
2. Vary Real-world Events
Simulate realistic failure scenarios from server crashes to network partitions
3. Run in Production
Test against real traffic and conditions for authentic results
4. Automate Continuously
Make chaos experiments part of regular operational practice
5. Minimize Blast Radius
Start small and gradually increase scope to limit potential damage
Experiment Types
Chaos Engineering Tools & Platforms
| Tool | Target | Failure Types | Best For |
|---|---|---|---|
| Chaos Monkey | AWS EC2 | Instance termination | Getting started with chaos |
| Litmus | Kubernetes | Pod/Node/Network failures | Cloud-native applications |
| Gremlin | Multi-platform | Infrastructure, network, state | Enterprise environments |
| Chaos Toolkit | Any system | Extensible experiments | Custom experiment design |
| Toxiproxy | Network | Latency, timeouts, corruption | Network-level testing |
| PowerfulSeal | Kubernetes | Pod killing, node failures | Policy-driven chaos |
Chaos Experiment Design
Example: Database Connection Pool Exhaustion
# Chaos Experiment Configuration
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: db-connection-chaos
spec:
definition:
scope: Namespaced
permissions:
- apiGroups: [""]
resources: ["pods"]
verbs: ["create","delete","get","list","patch","update"]
image: "litmuschaos/go-runner:latest"
args:
- -c
- ./experiments -name db-connection-exhaustion
command:
- /bin/bash
env:
- name: TOTAL_CHAOS_DURATION
value: '300'
- name: CONNECTION_LIMIT
value: '10'
- name: TARGET_PODS
value: 'app=backend'
- name: LIB
value: 'litmus'Implementation Example: Circuit Breaker Validation
// Python Chaos Experiment: Circuit Breaker Testing
import requests
import time
import concurrent.futures
from dataclasses import dataclass
from typing import List, Dict
@dataclass
class ExperimentResult:
success_rate: float
avg_response_time: float
circuit_breaker_triggered: bool
error_distribution: Dict[str, int]
class CircuitBreakerChaosTest:
def __init__(self, target_service: str, failure_rate: float = 0.5):
self.target_service = target_service
self.failure_rate = failure_rate
self.results = []
def inject_network_latency(self, latency_ms: int):
"""Inject network latency using toxiproxy"""
toxic_config = {
"name": "latency_toxic",
"type": "latency",
"attributes": {"latency": latency_ms}
}
requests.post(f"http://toxiproxy:8474/proxies/{self.target_service}/toxics",
json=toxic_config)
def simulate_service_failures(self, duration: int, request_rate: int):
"""Simulate high failure rate to trigger circuit breaker"""
def make_request():
try:
start = time.time()
response = requests.get(f"http://{self.target_service}/health",
timeout=5)
latency = (time.time() - start) * 1000
return {
'success': response.status_code == 200,
'latency': latency,
'status_code': response.status_code
}
except Exception as e:
return {
'success': False,
'latency': 5000, # Timeout
'error': str(e)
}
# Inject latency to cause timeouts
self.inject_network_latency(3000)
# Generate concurrent load
with concurrent.futures.ThreadPoolExecutor(max_workers=50) as executor:
futures = []
for _ in range(duration * request_rate):
futures.append(executor.submit(make_request))
time.sleep(1/request_rate)
results = [f.result() for f in concurrent.futures.as_completed(futures)]
return self.analyze_results(results)
def analyze_results(self, results: List[Dict]) -> ExperimentResult:
"""Analyze experiment results"""
total_requests = len(results)
successful_requests = sum(1 for r in results if r['success'])
success_rate = successful_requests / total_requests
avg_latency = sum(r['latency'] for r in results) / total_requests
# Check if circuit breaker pattern is working
error_distribution = {}
for result in results:
if not result['success']:
error_type = result.get('error', f"HTTP_{result.get('status_code', 'Unknown')}")
error_distribution[error_type] = error_distribution.get(error_type, 0) + 1
# Circuit breaker should show fast failures after initial timeouts
circuit_breaker_triggered = 'CircuitBreakerOpen' in error_distribution
return ExperimentResult(
success_rate=success_rate,
avg_response_time=avg_latency,
circuit_breaker_triggered=circuit_breaker_triggered,
error_distribution=error_distribution
)
def cleanup_toxics(self):
"""Remove network toxics"""
requests.delete(f"http://toxiproxy:8474/proxies/{self.target_service}/toxics/latency_toxic")
# Usage Example
if __name__ == "__main__":
chaos_test = CircuitBreakerChaosTest("user-service")
print("🔥 Starting Chaos Experiment: Circuit Breaker Validation")
# Baseline measurement
print("📊 Measuring baseline performance...")
baseline = chaos_test.simulate_service_failures(duration=60, request_rate=10)
# Chaos injection
print("💥 Injecting network failures...")
chaos_result = chaos_test.simulate_service_failures(duration=300, request_rate=20)
# Cleanup
chaos_test.cleanup_toxics()
print(f"Results:")
print(f" Success Rate: {chaos_result.success_rate:.2%}")
print(f" Avg Response Time: {chaos_result.avg_response_time:.1f}ms")
print(f" Circuit Breaker Triggered: {chaos_result.circuit_breaker_triggered}")
print(f" Error Distribution: {chaos_result.error_distribution}")
# Validate hypothesis
if chaos_result.circuit_breaker_triggered and chaos_result.avg_response_time < 1000:
print("✅ Hypothesis validated: Circuit breaker is working correctly")
else:
print("❌ Hypothesis failed: Circuit breaker needs improvement")Real-World Case Studies
Netflix
Pioneered chaos engineering with Chaos Monkey. Regularly terminates production instances to ensure services can handle failures gracefully.
Uber
Uses chaos engineering to test microservices resilience during peak demand periods. Validates circuit breakers and fallback mechanisms.
Amazon
Conducts GameDays - simulated outages to test system resilience and team response. Focuses on dependency failure scenarios.
Best Practices
⚠️ Important Considerations
Safety First
Always have abort mechanisms and never run experiments without proper monitoring and rollback procedures.
Team Preparation
Ensure on-call teams are aware of experiments and have procedures for handling unexpected outcomes.
Customer Impact
Consider customer experience and avoid experiments during peak business hours or critical operations.
Legal & Compliance
Some industries have regulations that may restrict chaos engineering practices in production systems.