Chaos Engineering
Master building resilient systems through controlled failure testing and systematic reliability experiments
Chaos Engineering
Building resilient systems through controlled failure testing
Interactive Chaos Engineering Calculator
Chaos Engineering Principles
1. Build Hypothesis
Define steady-state behavior and predict system response to failures
2. Vary Real-world Events
Simulate realistic failure scenarios from server crashes to network partitions
3. Run in Production
Test against real traffic and conditions for authentic results
4. Automate Continuously
Make chaos experiments part of regular operational practice
5. Minimize Blast Radius
Start small and gradually increase scope to limit potential damage
Experiment Types
Chaos Engineering Tools & Platforms
Tool | Target | Failure Types | Best For |
---|---|---|---|
Chaos Monkey | AWS EC2 | Instance termination | Getting started with chaos |
Litmus | Kubernetes | Pod/Node/Network failures | Cloud-native applications |
Gremlin | Multi-platform | Infrastructure, network, state | Enterprise environments |
Chaos Toolkit | Any system | Extensible experiments | Custom experiment design |
Toxiproxy | Network | Latency, timeouts, corruption | Network-level testing |
PowerfulSeal | Kubernetes | Pod killing, node failures | Policy-driven chaos |
Chaos Experiment Design
Example: Database Connection Pool Exhaustion
# Chaos Experiment Configuration apiVersion: litmuschaos.io/v1alpha1 kind: ChaosExperiment metadata: name: db-connection-chaos spec: definition: scope: Namespaced permissions: - apiGroups: [""] resources: ["pods"] verbs: ["create","delete","get","list","patch","update"] image: "litmuschaos/go-runner:latest" args: - -c - ./experiments -name db-connection-exhaustion command: - /bin/bash env: - name: TOTAL_CHAOS_DURATION value: '300' - name: CONNECTION_LIMIT value: '10' - name: TARGET_PODS value: 'app=backend' - name: LIB value: 'litmus'
Implementation Example: Circuit Breaker Validation
// Python Chaos Experiment: Circuit Breaker Testing import requests import time import concurrent.futures from dataclasses import dataclass from typing import List, Dict @dataclass class ExperimentResult: success_rate: float avg_response_time: float circuit_breaker_triggered: bool error_distribution: Dict[str, int] class CircuitBreakerChaosTest: def __init__(self, target_service: str, failure_rate: float = 0.5): self.target_service = target_service self.failure_rate = failure_rate self.results = [] def inject_network_latency(self, latency_ms: int): """Inject network latency using toxiproxy""" toxic_config = { "name": "latency_toxic", "type": "latency", "attributes": {"latency": latency_ms} } requests.post(f"http://toxiproxy:8474/proxies/{self.target_service}/toxics", json=toxic_config) def simulate_service_failures(self, duration: int, request_rate: int): """Simulate high failure rate to trigger circuit breaker""" def make_request(): try: start = time.time() response = requests.get(f"http://{self.target_service}/health", timeout=5) latency = (time.time() - start) * 1000 return { 'success': response.status_code == 200, 'latency': latency, 'status_code': response.status_code } except Exception as e: return { 'success': False, 'latency': 5000, # Timeout 'error': str(e) } # Inject latency to cause timeouts self.inject_network_latency(3000) # Generate concurrent load with concurrent.futures.ThreadPoolExecutor(max_workers=50) as executor: futures = [] for _ in range(duration * request_rate): futures.append(executor.submit(make_request)) time.sleep(1/request_rate) results = [f.result() for f in concurrent.futures.as_completed(futures)] return self.analyze_results(results) def analyze_results(self, results: List[Dict]) -> ExperimentResult: """Analyze experiment results""" total_requests = len(results) successful_requests = sum(1 for r in results if r['success']) success_rate = successful_requests / total_requests avg_latency = sum(r['latency'] for r in results) / total_requests # Check if circuit breaker pattern is working error_distribution = {} for result in results: if not result['success']: error_type = result.get('error', f"HTTP_{result.get('status_code', 'Unknown')}") error_distribution[error_type] = error_distribution.get(error_type, 0) + 1 # Circuit breaker should show fast failures after initial timeouts circuit_breaker_triggered = 'CircuitBreakerOpen' in error_distribution return ExperimentResult( success_rate=success_rate, avg_response_time=avg_latency, circuit_breaker_triggered=circuit_breaker_triggered, error_distribution=error_distribution ) def cleanup_toxics(self): """Remove network toxics""" requests.delete(f"http://toxiproxy:8474/proxies/{self.target_service}/toxics/latency_toxic") # Usage Example if __name__ == "__main__": chaos_test = CircuitBreakerChaosTest("user-service") print("🔥 Starting Chaos Experiment: Circuit Breaker Validation") # Baseline measurement print("📊 Measuring baseline performance...") baseline = chaos_test.simulate_service_failures(duration=60, request_rate=10) # Chaos injection print("💥 Injecting network failures...") chaos_result = chaos_test.simulate_service_failures(duration=300, request_rate=20) # Cleanup chaos_test.cleanup_toxics() print(f"Results:") print(f" Success Rate: {chaos_result.success_rate:.2%}") print(f" Avg Response Time: {chaos_result.avg_response_time:.1f}ms") print(f" Circuit Breaker Triggered: {chaos_result.circuit_breaker_triggered}") print(f" Error Distribution: {chaos_result.error_distribution}") # Validate hypothesis if chaos_result.circuit_breaker_triggered and chaos_result.avg_response_time < 1000: print("✅ Hypothesis validated: Circuit breaker is working correctly") else: print("❌ Hypothesis failed: Circuit breaker needs improvement")
Real-World Case Studies
Netflix
Pioneered chaos engineering with Chaos Monkey. Regularly terminates production instances to ensure services can handle failures gracefully.
Uber
Uses chaos engineering to test microservices resilience during peak demand periods. Validates circuit breakers and fallback mechanisms.
Amazon
Conducts GameDays - simulated outages to test system resilience and team response. Focuses on dependency failure scenarios.
Best Practices
⚠️ Important Considerations
Safety First
Always have abort mechanisms and never run experiments without proper monitoring and rollback procedures.
Team Preparation
Ensure on-call teams are aware of experiments and have procedures for handling unexpected outcomes.
Customer Impact
Consider customer experience and avoid experiments during peak business hours or critical operations.
Legal & Compliance
Some industries have regulations that may restrict chaos engineering practices in production systems.