Advanced Observability Systems

Design next-generation observability platforms with AI-driven monitoring, distributed tracing, real-time analytics, and intelligent incident response.

45 min read•Advanced

Not Started

What are Advanced Observability Systems?

Advanced observability systems combine traditional monitoring with AI-powered insights, providing comprehensive visibility into complex distributed systems. They unify metrics, logs, traces, and profiles to enable rapid problem detection and resolution.

Modern observability platforms use machine learning for anomaly detection, predictive alerting, and automated root cause analysis, reducing Mean Time to Detection (MTTD) and Mean Time to Recovery (MTTR) by up to 80%.

Evolution Beyond the Three Pillars

1. Metrics → Smart Metrics

AI-powered metrics with automatic anomaly detection and predictive alerting.

High-cardinality, real-time analysis, automated baseline learning

2. Logs → Structured Events

Structured, searchable events with automatic pattern recognition and clustering.

OpenTelemetry, semantic conventions, intelligent indexing

3. Traces → Service Maps

Visual service topology with dependency analysis and bottleneck identification.

Distributed tracing, span analytics, performance insights

4. Profiles → Continuous Profiling

Always-on profiling with code-level insights and optimization recommendations.

CPU/memory profiling, flame graphs, optimization insights

5. Synthetic → User Analytics

Real user monitoring combined with synthetic testing for complete coverage.

RUM, Core Web Vitals, user journey tracking

6. AIOps → Autonomous Operations

Self-healing systems with automated incident response and remediation.

Root cause analysis, auto-remediation, incident prediction

Production Implementation

AI-Powered Observability Engine

import asyncio
import numpy as np
import pandas as pd
from typing import Dict, List, Optional, Tuple
from dataclasses import dataclass, field
from datetime import datetime, timedelta
from sklearn.ensemble import IsolationForest
from sklearn.cluster import DBSCAN
import json

@dataclass
class Metric:
    name: str
    value: float
    timestamp: datetime
    labels: Dict[str, str] = field(default_factory=dict)
    
@dataclass
class LogEvent:
    timestamp: datetime
    level: str
    message: str
    service: str
    trace_id: Optional[str] = None
    attributes: Dict = field(default_factory=dict)

@dataclass
class Span:
    trace_id: str
    span_id: str
    parent_span_id: Optional[str]
    operation_name: str
    service_name: str
    start_time: datetime
    end_time: datetime
    tags: Dict[str, str] = field(default_factory=dict)
    duration_ms: float = 0

@dataclass
class Anomaly:
    type: str  # 'metric', 'log', 'trace'
    severity: str  # 'low', 'medium', 'high', 'critical'
    description: str
    timestamp: datetime
    confidence_score: float
    affected_services: List[str]
    suggested_actions: List[str]
    root_cause_hypothesis: Optional[str] = None

class AdvancedObservabilityEngine:
    """AI-powered observability platform with advanced analytics"""
    
    def __init__(self):
        self.metrics_store = MetricsStore()
        self.log_analyzer = LogAnalyzer()
        self.trace_analyzer = TraceAnalyzer()
        self.anomaly_detector = AnomalyDetector()
        self.root_cause_analyzer = RootCauseAnalyzer()
        self.auto_remediator = AutoRemediator()
        
        # ML models for different data types
        self.metric_models = {}
        self.log_models = {}
        self.service_dependency_graph = ServiceDependencyGraph()
        
    async def ingest_telemetry(
        self,
        metrics: Optional[List[Metric]] = None,
        logs: Optional[List[LogEvent]] = None,
        spans: Optional[List[Span]] = None
    ):
        """Ingest and process telemetry data"""
        
        tasks = []
        
        if metrics:
            tasks.append(self.process_metrics(metrics))
        if logs:
            tasks.append(self.process_logs(logs))
        if spans:
            tasks.append(self.process_traces(spans))
            
        await asyncio.gather(*tasks)
        
        # Run anomaly detection after ingestion
        await self.detect_anomalies()
    
    async def process_metrics(self, metrics: List[Metric]):
        """Process incoming metrics with AI analysis"""
        
        for metric in metrics:
            # Store metric
            await self.metrics_store.store_metric(metric)
            
            # Update or create ML model for this metric
            metric_key = f"{metric.name}:{hash(frozenset(metric.labels.items()))}"
            
            if metric_key not in self.metric_models:
                self.metric_models[metric_key] = MetricAnomalyModel(metric.name)
            
            # Update model with new data point
            await self.metric_models[metric_key].update(metric)
    
    async def process_logs(self, logs: List[LogEvent]):
        """Process logs with pattern recognition"""
        
        for log in logs:
            # Extract structured data from unstructured logs
            structured_log = await self.log_analyzer.structure_log(log)
            
            # Detect log patterns and anomalies
            await self.log_analyzer.analyze_pattern(structured_log)
            
            # Link logs to traces if trace_id present
            if log.trace_id:
                await self.trace_analyzer.link_log_to_trace(log.trace_id, structured_log)
    
    async def process_traces(self, spans: List[Span]):
        """Process distributed traces with performance analysis"""
        
        # Group spans by trace_id
        traces = {}
        for span in spans:
            if span.trace_id not in traces:
                traces[span.trace_id] = []
            traces[span.trace_id].append(span)
        
        for trace_id, trace_spans in traces.items():
            # Analyze trace performance
            trace_analysis = await self.trace_analyzer.analyze_trace(trace_spans)
            
            # Update service dependency graph
            await self.service_dependency_graph.update_from_trace(trace_spans)
            
            # Detect trace anomalies (slow requests, errors, etc.)
            await self.detect_trace_anomalies(trace_id, trace_analysis)
    
    async def detect_anomalies(self) -> List[Anomaly]:
        """Run comprehensive anomaly detection"""
        
        anomalies = []
        
        # Metric anomalies
        metric_anomalies = await self.anomaly_detector.detect_metric_anomalies(
            self.metric_models
        )
        anomalies.extend(metric_anomalies)
        
        # Log anomalies
        log_anomalies = await self.log_analyzer.detect_log_anomalies()
        anomalies.extend(log_anomalies)
        
        # Service dependency anomalies
        dependency_anomalies = await self.service_dependency_graph.detect_anomalies()
        anomalies.extend(dependency_anomalies)
        
        # Correlate anomalies across data types
        correlated_anomalies = await self.correlate_anomalies(anomalies)
        
        # Run root cause analysis for high-severity anomalies
        for anomaly in correlated_anomalies:
            if anomaly.severity in ['high', 'critical']:
                root_cause = await self.root_cause_analyzer.analyze(anomaly)
                anomaly.root_cause_hypothesis = root_cause
                
                # Trigger auto-remediation if configured
                await self.auto_remediator.attempt_remediation(anomaly)
        
        return correlated_anomalies
    
    async def correlate_anomalies(self, anomalies: List[Anomaly]) -> List[Anomaly]:
        """Correlate anomalies across different data types"""
        
        # Group anomalies by time window and affected services
        time_window = timedelta(minutes=5)
        service_groups = {}
        
        for anomaly in anomalies:
            # Create time-service key for grouping
            time_bucket = anomaly.timestamp.replace(
                minute=anomaly.timestamp.minute // 5 * 5, 
                second=0, 
                microsecond=0
            )
            
            for service in anomaly.affected_services:
                key = f"{time_bucket}:{service}"
                if key not in service_groups:
                    service_groups[key] = []
                service_groups[key].append(anomaly)
        
        # Create correlated anomalies
        correlated = []
        processed_anomalies = set()
        
        for group_anomalies in service_groups.values():
            if len(group_anomalies) > 1:
                # Create a correlated anomaly
                primary_anomaly = max(group_anomalies, key=lambda a: a.confidence_score)
                
                if id(primary_anomaly) not in processed_anomalies:
                    # Enhance with correlation information
                    primary_anomaly.description += f" (correlated with {len(group_anomalies)-1} other anomalies)"
                    primary_anomaly.confidence_score = min(1.0, primary_anomaly.confidence_score * 1.2)
                    
                    correlated.append(primary_anomaly)
                    processed_anomalies.add(id(primary_anomaly))
            else:
                # Single anomaly
                anomaly = group_anomalies[0]
                if id(anomaly) not in processed_anomalies:
                    correlated.append(anomaly)
                    processed_anomalies.add(id(anomaly))
        
        return correlated
    
    async def get_service_insights(self, service_name: str, time_range: Tuple[datetime, datetime]) -> Dict:
        """Get comprehensive insights for a specific service"""
        
        start_time, end_time = time_range
        
        # Collect metrics
        service_metrics = await self.metrics_store.query_metrics(
            service=service_name,
            start_time=start_time,
            end_time=end_time
        )
        
        # Collect logs
        service_logs = await self.log_analyzer.query_logs(
            service=service_name,
            start_time=start_time,
            end_time=end_time
        )
        
        # Collect traces
        service_traces = await self.trace_analyzer.query_traces(
            service=service_name,
            start_time=start_time,
            end_time=end_time
        )
        
        # Generate insights
        insights = {
            "service_name": service_name,
            "time_range": [start_time.isoformat(), end_time.isoformat()],
            "metrics_summary": self._summarize_metrics(service_metrics),
            "error_analysis": await self.log_analyzer.analyze_errors(service_logs),
            "performance_analysis": await self.trace_analyzer.analyze_performance(service_traces),
            "dependencies": await self.service_dependency_graph.get_service_dependencies(service_name),
            "recommendations": await self._generate_recommendations(service_name, service_metrics, service_logs, service_traces),
            "health_score": await self._calculate_health_score(service_name, service_metrics, service_logs, service_traces)
        }
        
        return insights
    
    def _summarize_metrics(self, metrics: List[Metric]) -> Dict:
        """Summarize metrics with statistical analysis"""
        
        if not metrics:
            return {}
        
        # Group metrics by name
        metric_groups = {}
        for metric in metrics:
            if metric.name not in metric_groups:
                metric_groups[metric.name] = []
            metric_groups[metric.name].append(metric.value)
        
        # Calculate statistics
        summary = {}
        for name, values in metric_groups.items():
            np_values = np.array(values)
            summary[name] = {
                "count": len(values),
                "mean": float(np.mean(np_values)),
                "median": float(np.median(np_values)),
                "std": float(np.std(np_values)),
                "min": float(np.min(np_values)),
                "max": float(np.max(np_values)),
                "p95": float(np.percentile(np_values, 95)),
                "p99": float(np.percentile(np_values, 99))
            }
        
        return summary
    
    async def _generate_recommendations(
        self, 
        service_name: str, 
        metrics: List[Metric], 
        logs: List[LogEvent], 
        traces: List[Span]
    ) -> List[str]:
        """Generate actionable recommendations for service optimization"""
        
        recommendations = []
        
        # Performance recommendations based on traces
        if traces:
            avg_duration = np.mean([span.duration_ms for span in traces])
            p95_duration = np.percentile([span.duration_ms for span in traces], 95)
            
            if p95_duration > 5000:  # 5 second p95
                recommendations.append(
                    f"High response times detected (P95: {p95_duration:.0f}ms). "
                    "Consider adding caching or optimizing database queries."
                )
        
        # Error rate recommendations based on logs
        error_logs = [log for log in logs if log.level in ['ERROR', 'FATAL']]
        if error_logs:
            error_rate = len(error_logs) / len(logs) if logs else 0
            if error_rate > 0.01:  # 1% error rate
                recommendations.append(
                    f"High error rate detected ({error_rate*100:.1f}%). "
                    "Review recent deployments and add error handling."
                )
        
        # Resource utilization recommendations based on metrics
        cpu_metrics = [m for m in metrics if 'cpu' in m.name.lower()]
        if cpu_metrics:
            avg_cpu = np.mean([m.value for m in cpu_metrics])
            if avg_cpu > 80:
                recommendations.append(
                    f"High CPU utilization ({avg_cpu:.1f}%). "
                    "Consider horizontal scaling or performance optimization."
                )
        
        return recommendations
    
    async def _calculate_health_score(
        self,
        service_name: str,
        metrics: List[Metric],
        logs: List[LogEvent], 
        traces: List[Span]
    ) -> float:
        """Calculate overall service health score (0-100)"""
        
        score_components = []
        
        # Error rate component (40% weight)
        if logs:
            error_logs = [log for log in logs if log.level in ['ERROR', 'FATAL']]
            error_rate = len(error_logs) / len(logs)
            error_score = max(0, 100 - (error_rate * 1000))  # 1% error = 90 score
            score_components.append(('error_rate', error_score, 0.4))
        
        # Performance component (35% weight)
        if traces:
            p95_duration = np.percentile([span.duration_ms for span in traces], 95)
            # Score based on p95 latency (5s = 50 score)
            perf_score = max(0, 100 - (p95_duration / 50))
            score_components.append(('performance', perf_score, 0.35))
        
        # Resource utilization component (25% weight)
        if metrics:
            cpu_metrics = [m for m in metrics if 'cpu' in m.name.lower()]
            if cpu_metrics:
                avg_cpu = np.mean([m.value for m in cpu_metrics])
                # Optimal CPU around 70%, penalize high/low usage
                cpu_score = 100 - abs(avg_cpu - 70)
                score_components.append(('resource', cpu_score, 0.25))
        
        # Calculate weighted average
        if not score_components:
            return 100.0  # No data = healthy assumption
        
        total_weight = sum(weight for _, _, weight in score_components)
        weighted_score = sum(score * weight for _, score, weight in score_components)
        
        return min(100.0, max(0.0, weighted_score / total_weight))

class MetricAnomalyModel:
    """ML model for detecting metric anomalies"""
    
    def __init__(self, metric_name: str):
        self.metric_name = metric_name
        self.values_buffer = []
        self.model = IsolationForest(contamination=0.1, random_state=42)
        self.is_trained = False
        self.min_samples = 100
    
    async def update(self, metric: Metric):
        """Update model with new metric value"""
        self.values_buffer.append({
            'value': metric.value,
            'timestamp': metric.timestamp,
            'labels': metric.labels
        })
        
        # Keep rolling buffer
        if len(self.values_buffer) > 1000:
            self.values_buffer = self.values_buffer[-1000:]
        
        # Train/retrain model periodically
        if len(self.values_buffer) >= self.min_samples and len(self.values_buffer) % 50 == 0:
            await self._train_model()
    
    async def _train_model(self):
        """Train anomaly detection model"""
        if len(self.values_buffer) < self.min_samples:
            return
        
        # Prepare features (could include time-based features, moving averages, etc.)
        features = []
        values = [item['value'] for item in self.values_buffer]
        
        for i, value in enumerate(values):
            # Simple features: value, moving average, rate of change
            ma_5 = np.mean(values[max(0, i-4):i+1])
            rate_change = (value - values[i-1]) if i > 0 else 0
            
            features.append([value, ma_5, rate_change])
        
        # Train model
        X = np.array(features)
        self.model.fit(X)
        self.is_trained = True
    
    async def detect_anomaly(self, metric: Metric) -> Optional[float]:
        """Detect if metric is anomalous, return anomaly score"""
        if not self.is_trained or len(self.values_buffer) < 5:
            return None
        
        # Prepare features for current metric
        values = [item['value'] for item in self.values_buffer[-5:]] + [metric.value]
        ma_5 = np.mean(values[-5:])
        rate_change = metric.value - values[-2]
        
        features = np.array([[metric.value, ma_5, rate_change]])
        
        # Get anomaly score
        anomaly_score = self.model.decision_function(features)[0]
        is_anomaly = self.model.predict(features)[0] == -1
        
        return abs(anomaly_score) if is_anomaly else None

# Additional supporting classes would be implemented similarly...
class MetricsStore:
    async def store_metric(self, metric: Metric):
        pass
    
    async def query_metrics(self, service: str, start_time: datetime, end_time: datetime):
        pass

class LogAnalyzer:
    async def structure_log(self, log: LogEvent):
        pass
    
    async def analyze_pattern(self, log):
        pass
    
    async def detect_log_anomalies(self):
        pass

class TraceAnalyzer:
    async def analyze_trace(self, spans: List[Span]):
        pass
    
    async def link_log_to_trace(self, trace_id: str, log):
        pass

class ServiceDependencyGraph:
    async def update_from_trace(self, spans: List[Span]):
        pass
    
    async def detect_anomalies(self):
        pass

class AnomalyDetector:
    async def detect_metric_anomalies(self, models):
        pass

class RootCauseAnalyzer:
    async def analyze(self, anomaly: Anomaly):
        pass

class AutoRemediator:
    async def attempt_remediation(self, anomaly: Anomaly):
        pass

# Usage example
async def main():
    engine = AdvancedObservabilityEngine()
    
    # Simulate telemetry ingestion
    metrics = [
        Metric("cpu_usage", 75.5, datetime.utcnow(), {"service": "api", "host": "web-1"}),
        Metric("request_duration", 150.2, datetime.utcnow(), {"service": "api", "endpoint": "/users"})
    ]
    
    logs = [
        LogEvent(datetime.utcnow(), "ERROR", "Database connection failed", "api", "trace-123")
    ]
    
    spans = [
        Span("trace-123", "span-1", None, "handle_request", "api", 
             datetime.utcnow(), datetime.utcnow() + timedelta(milliseconds=150))
    ]
    
    await engine.ingest_telemetry(metrics=metrics, logs=logs, spans=spans)
    
    # Get service insights
    insights = await engine.get_service_insights(
        "api", 
        (datetime.utcnow() - timedelta(hours=1), datetime.utcnow())
    )
    
    print(f"Service health score: {insights['health_score']}")

# Run the observability engine
# asyncio.run(main())

Real-time Observability Dashboard

import React, { useState, useEffect, useCallback } from 'react';
import { LineChart, Line, AreaChart, Area, XAxis, YAxis, CartesianGrid, Tooltip, ResponsiveContainer, ScatterChart, Scatter } from 'recharts';

interface ServiceHealth {
  serviceName: string;
  healthScore: number;
  errorRate: number;
  p95Latency: number;
  throughput: number;
  cpuUsage: number;
  memoryUsage: number;
  status: 'healthy' | 'warning' | 'critical';
}

interface Anomaly {
  id: string;
  type: 'metric' | 'log' | 'trace';
  severity: 'low' | 'medium' | 'high' | 'critical';
  title: string;
  description: string;
  timestamp: string;
  affectedServices: string[];
  confidence: number;
  rootCause?: string;
  autoRemediated?: boolean;
}

interface TraceData {
  traceId: string;
  duration: number;
  serviceCount: number;
  errorCount: number;
  timestamp: string;
  criticalPath: string[];
}

export const AdvancedObservabilityDashboard: React.FC = () => {
  const [services, setServices] = useState<ServiceHealth[]>([]);
  const [anomalies, setAnomalies] = useState<Anomaly[]>([]);
  const [traces, setTraces] = useState<TraceData[]>([]);
  const [selectedService, setSelectedService] = useState<string>('');
  const [timeRange, setTimeRange] = useState<'1h' | '4h' | '24h' | '7d'>('4h');
  const [realTimeEnabled, setRealTimeEnabled] = useState(true);
  
  // WebSocket connection for real-time updates
  useEffect(() => {
    if (!realTimeEnabled) return;
    
    const ws = new WebSocket('wss://observability-api.example.com/ws');
    
    ws.onmessage = (event) => {
      const data = JSON.parse(event.data);
      
      switch (data.type) {
        case 'service_update':
          setServices(prev => prev.map(s => 
            s.serviceName === data.payload.serviceName 
              ? { ...s, ...data.payload }
              : s
          ));
          break;
        
        case 'anomaly_detected':
          setAnomalies(prev => [data.payload, ...prev.slice(0, 49)]);
          break;
        
        case 'trace_completed':
          setTraces(prev => [data.payload, ...prev.slice(0, 99)]);
          break;
      }
    };
    
    return () => ws.close();
  }, [realTimeEnabled]);
  
  // Fetch initial data
  useEffect(() => {
    fetchObservabilityData();
  }, [timeRange]);
  
  const fetchObservabilityData = async () => {
    try {
      const [servicesRes, anomaliesRes, tracesRes] = await Promise.all([
        fetch(`/api/observability/services?range=${timeRange}`),
        fetch(`/api/observability/anomalies?range=${timeRange}`),
        fetch(`/api/observability/traces?range=${timeRange}&limit=100`)
      ]);
      
      setServices(await servicesRes.json());
      setAnomalies(await anomaliesRes.json());
      setTraces(await tracesRes.json());
    } catch (error) {
      console.error('Failed to fetch observability data:', error);
    }
  };
  
  const getServiceStatusColor = (status: string) => {
    switch (status) {
      case 'healthy': return 'text-green-600 bg-green-100';
      case 'warning': return 'text-yellow-600 bg-yellow-100';
      case 'critical': return 'text-red-600 bg-red-100';
      default: return 'text-gray-600 bg-gray-100';
    }
  };
  
  const getSeverityColor = (severity: string) => {
    switch (severity) {
      case 'low': return 'border-blue-500 bg-blue-50';
      case 'medium': return 'border-yellow-500 bg-yellow-50';
      case 'high': return 'border-orange-500 bg-orange-50';
      case 'critical': return 'border-red-500 bg-red-50';
      default: return 'border-gray-500 bg-gray-50';
    }
  };
  
  const criticalAnomalies = anomalies.filter(a => a.severity === 'critical');
  const averageHealthScore = services.reduce((sum, s) => sum + s.healthScore, 0) / services.length || 0;
  const totalThroughput = services.reduce((sum, s) => sum + s.throughput, 0);
  
  // Prepare chart data
  const healthTrendData = services.map(service => ({
    service: service.serviceName,
    health: service.healthScore,
    errorRate: service.errorRate,
    latency: service.p95Latency
  }));
  
  const traceLatencyData = traces.slice(0, 50).map(trace => ({
    duration: trace.duration,
    serviceCount: trace.serviceCount,
    errorCount: trace.errorCount,
    time: new Date(trace.timestamp).getTime()
  }));
  
  return (
    <div className="p-6 space-y-6 bg-gray-50 dark:bg-gray-900 min-h-screen">
      {/* Header */}
      <div className="flex justify-between items-center">
        <div>
          <h1 className="text-3xl font-bold text-gray-900 dark:text-white">
            Advanced Observability
          </h1>
          <p className="text-gray-600 dark:text-gray-300">
            AI-powered monitoring and incident response
          </p>
        </div>
        
        <div className="flex items-center space-x-4">
          {/* Real-time toggle */}
          <label className="flex items-center space-x-2">
            <input
              type="checkbox"
              checked={realTimeEnabled}
              onChange={(e) => setRealTimeEnabled(e.target.checked)}
              className="rounded border-gray-300"
            />
            <span className="text-sm text-gray-700 dark:text-gray-300">Real-time</span>
          </label>
          
          {/* Time range selector */}
          <select
            value={timeRange}
            onChange={(e) => setTimeRange(e.target.value as any)}
            className="px-3 py-2 bg-white dark:bg-gray-800 border border-gray-300 dark:border-gray-600 rounded-lg"
          >
            <option value="1h">Last Hour</option>
            <option value="4h">Last 4 Hours</option>
            <option value="24h">Last 24 Hours</option>
            <option value="7d">Last 7 Days</option>
          </select>
        </div>
      </div>
      
      {/* Summary Cards */}
      <div className="grid grid-cols-1 md:grid-cols-4 gap-6">
        <div className="bg-white dark:bg-gray-800 rounded-lg p-6 shadow-lg">
          <h3 className="text-sm font-medium text-gray-500 dark:text-gray-400">Avg Health Score</h3>
          <p className={`text-2xl font-bold ${
            averageHealthScore >= 90 ? 'text-green-600' :
            averageHealthScore >= 70 ? 'text-yellow-600' : 'text-red-600'
          }`}>
            {averageHealthScore.toFixed(1)}%
          </p>
        </div>
        
        <div className={`rounded-lg p-6 shadow-lg ${
          criticalAnomalies.length === 0 
            ? 'bg-green-50 dark:bg-green-900/20' 
            : 'bg-red-50 dark:bg-red-900/20'
        }`}>
          <h3 className={`text-sm font-medium ${
            criticalAnomalies.length === 0 
              ? 'text-green-600 dark:text-green-400'
              : 'text-red-600 dark:text-red-400'
          }`}>Critical Issues</h3>
          <p className={`text-2xl font-bold ${
            criticalAnomalies.length === 0 
              ? 'text-green-900 dark:text-green-100'
              : 'text-red-900 dark:text-red-100'
          }`}>
            {criticalAnomalies.length}
          </p>
        </div>
        
        <div className="bg-blue-50 dark:bg-blue-900/20 rounded-lg p-6 shadow-lg">
          <h3 className="text-sm font-medium text-blue-600 dark:text-blue-400">Total Throughput</h3>
          <p className="text-2xl font-bold text-blue-900 dark:text-blue-100">
            {(totalThroughput / 1000).toFixed(1)}k RPS
          </p>
        </div>
        
        <div className="bg-purple-50 dark:bg-purple-900/20 rounded-lg p-6 shadow-lg">
          <h3 className="text-sm font-medium text-purple-600 dark:text-purple-400">Services</h3>
          <p className="text-2xl font-bold text-purple-900 dark:text-purple-100">
            {services.length}
          </p>
        </div>
      </div>
      
      {/* Service Health Overview */}
      <div className="bg-white dark:bg-gray-800 rounded-lg p-6 shadow-lg">
        <h3 className="text-lg font-semibold mb-4 text-gray-900 dark:text-white">
          Service Health Matrix
        </h3>
        <div className="overflow-x-auto">
          <table className="min-w-full divide-y divide-gray-200 dark:divide-gray-700">
            <thead className="bg-gray-50 dark:bg-gray-900">
              <tr>
                <th className="px-6 py-3 text-left text-xs font-medium text-gray-500 dark:text-gray-400 uppercase">
                  Service
                </th>
                <th className="px-6 py-3 text-left text-xs font-medium text-gray-500 dark:text-gray-400 uppercase">
                  Health
                </th>
                <th className="px-6 py-3 text-left text-xs font-medium text-gray-500 dark:text-gray-400 uppercase">
                  Error Rate
                </th>
                <th className="px-6 py-3 text-left text-xs font-medium text-gray-500 dark:text-gray-400 uppercase">
                  P95 Latency
                </th>
                <th className="px-6 py-3 text-left text-xs font-medium text-gray-500 dark:text-gray-400 uppercase">
                  Throughput
                </th>
                <th className="px-6 py-3 text-left text-xs font-medium text-gray-500 dark:text-gray-400 uppercase">
                  Resources
                </th>
              </tr>
            </thead>
            <tbody className="bg-white dark:bg-gray-800 divide-y divide-gray-200 dark:divide-gray-700">
              {services.map((service) => (
                <tr 
                  key={service.serviceName}
                  className="hover:bg-gray-50 dark:hover:bg-gray-700 cursor-pointer"
                  onClick={() => setSelectedService(service.serviceName)}
                >
                  <td className="px-6 py-4 whitespace-nowrap">
                    <div className="flex items-center">
                      <div className={`w-3 h-3 rounded-full mr-2 ${
                        service.status === 'healthy' ? 'bg-green-500' :
                        service.status === 'warning' ? 'bg-yellow-500' : 'bg-red-500'
                      }`}></div>
                      <span className="text-sm font-medium text-gray-900 dark:text-white">
                        {service.serviceName}
                      </span>
                    </div>
                  </td>
                  <td className="px-6 py-4 whitespace-nowrap">
                    <span className={`inline-flex px-2 py-1 text-xs font-semibold rounded-full ${getServiceStatusColor(service.status)}`}>
                      {service.healthScore.toFixed(0)}%
                    </span>
                  </td>
                  <td className="px-6 py-4 whitespace-nowrap text-sm text-gray-900 dark:text-white">
                    {(service.errorRate * 100).toFixed(2)}%
                  </td>
                  <td className="px-6 py-4 whitespace-nowrap text-sm text-gray-900 dark:text-white">
                    {service.p95Latency.toFixed(0)}ms
                  </td>
                  <td className="px-6 py-4 whitespace-nowrap text-sm text-gray-900 dark:text-white">
                    {service.throughput.toFixed(0)} RPS
                  </td>
                  <td className="px-6 py-4 whitespace-nowrap text-sm text-gray-900 dark:text-white">
                    CPU: {service.cpuUsage.toFixed(0)}% | Mem: {service.memoryUsage.toFixed(0)}%
                  </td>
                </tr>
              ))}
            </tbody>
          </table>
        </div>
      </div>
      
      {/* Charts Section */}
      <div className="grid grid-cols-1 lg:grid-cols-2 gap-6">
        {/* Service Health Trend */}
        <div className="bg-white dark:bg-gray-800 rounded-lg p-6 shadow-lg">
          <h3 className="text-lg font-semibold mb-4 text-gray-900 dark:text-white">
            Health Score by Service
          </h3>
          <ResponsiveContainer width="100%" height={300}>
            <AreaChart data={healthTrendData}>
              <CartesianGrid strokeDasharray="3 3" />
              <XAxis dataKey="service" />
              <YAxis domain={[0, 100]} />
              <Tooltip />
              <Area type="monotone" dataKey="health" stroke="#10b981" fill="#10b981" fillOpacity={0.3} />
            </AreaChart>
          </ResponsiveContainer>
        </div>
        
        {/* Trace Latency Distribution */}
        <div className="bg-white dark:bg-gray-800 rounded-lg p-6 shadow-lg">
          <h3 className="text-lg font-semibold mb-4 text-gray-900 dark:text-white">
            Trace Latency Distribution
          </h3>
          <ResponsiveContainer width="100%" height={300}>
            <ScatterChart data={traceLatencyData}>
              <CartesianGrid strokeDasharray="3 3" />
              <XAxis dataKey="serviceCount" name="Services" />
              <YAxis dataKey="duration" name="Duration (ms)" />
              <Tooltip cursor={{ strokeDasharray: '3 3' }} />
              <Scatter dataKey="duration" fill="#8b5cf6" />
            </ScatterChart>
          </ResponsiveContainer>
        </div>
      </div>
      
      {/* Recent Anomalies */}
      <div className="bg-white dark:bg-gray-800 rounded-lg p-6 shadow-lg">
        <h3 className="text-lg font-semibold mb-4 text-gray-900 dark:text-white">
          AI-Detected Anomalies
        </h3>
        <div className="space-y-3">
          {anomalies.slice(0, 5).map((anomaly) => (
            <div 
              key={anomaly.id} 
              className={`border-l-4 pl-4 py-3 ${getSeverityColor(anomaly.severity)}`}
            >
              <div className="flex justify-between items-start">
                <div className="flex-1">
                  <div className="flex items-center space-x-2 mb-1">
                    <h4 className="font-semibold text-gray-900 dark:text-white">
                      {anomaly.title}
                    </h4>
                    <span className={`px-2 py-1 text-xs rounded ${
                      anomaly.severity === 'critical' ? 'bg-red-200 text-red-800' :
                      anomaly.severity === 'high' ? 'bg-orange-200 text-orange-800' :
                      anomaly.severity === 'medium' ? 'bg-yellow-200 text-yellow-800' :
                      'bg-blue-200 text-blue-800'
                    }`}>
                      {anomaly.severity.toUpperCase()}
                    </span>
                    {anomaly.autoRemediated && (
                      <span className="px-2 py-1 text-xs bg-green-200 text-green-800 rounded">
                        AUTO-REMEDIATED
                      </span>
                    )}
                  </div>
                  <p className="text-sm text-gray-600 dark:text-gray-300 mb-2">
                    {anomaly.description}
                  </p>
                  {anomaly.rootCause && (
                    <p className="text-sm text-blue-600 dark:text-blue-400 mb-2">
                      <strong>Root Cause:</strong> {anomaly.rootCause}
                    </p>
                  )}
                  <div className="flex items-center space-x-4 text-xs text-gray-500">
                    <span>{new Date(anomaly.timestamp).toLocaleString()}</span>
                    <span>Services: {anomaly.affectedServices.join(', ')}</span>
                    <span>Confidence: {(anomaly.confidence * 100).toFixed(0)}%</span>
                  </div>
                </div>
              </div>
            </div>
          ))}
        </div>
      </div>
      
      {/* AI Insights Panel */}
      <div className="bg-gradient-to-r from-purple-50 to-indigo-50 dark:from-purple-900/20 dark:to-indigo-900/20 rounded-lg p-6 shadow-lg">
        <h3 className="text-lg font-semibold mb-4 text-gray-900 dark:text-white flex items-center">
          <span className="w-6 h-6 bg-purple-500 rounded-full flex items-center justify-center text-white text-sm mr-2">
            AI
          </span>
          Intelligent Recommendations
        </h3>
        <div className="grid grid-cols-1 md:grid-cols-3 gap-4">
          <div className="bg-white dark:bg-gray-800 rounded-lg p-4">
            <h4 className="font-medium text-gray-900 dark:text-white mb-2">Performance Optimization</h4>
            <p className="text-sm text-gray-600 dark:text-gray-300">
              3 services showing high latency patterns. Consider adding caching layers.
            </p>
          </div>
          <div className="bg-white dark:bg-gray-800 rounded-lg p-4">
            <h4 className="font-medium text-gray-900 dark:text-white mb-2">Cost Optimization</h4>
            <p className="text-sm text-gray-600 dark:text-gray-300">
              Enable intelligent sampling to reduce observability costs by 40%.
            </p>
          </div>
          <div className="bg-white dark:bg-gray-800 rounded-lg p-4">
            <h4 className="font-medium text-gray-900 dark:text-white mb-2">Capacity Planning</h4>
            <p className="text-sm text-gray-600 dark:text-gray-300">
              Traffic growth trend suggests scaling API service within 2 weeks.
            </p>
          </div>
        </div>
      </div>
    </div>
  );
};

Real-World Examples

Netflix

Atlas & Mantis Real-time Analytics

Netflix's observability platform processes 2.5 billion metrics per minute with real-time anomaly detection. Their Atlas time-series database and Mantis stream processing enable sub-second incident detection across 700+ microservices.

2.5B Metrics/minSub-second Detection

Datadog

Watchdog AI & APM

Datadog's Watchdog AI automatically detects anomalies across metrics, logs, and traces using machine learning. It correlates issues across services and provides root cause analysis, reducing MTTR by 50% for customers.

AI-Powered Detection50% MTTR Reduction

Uber

Observability Platform (M3, Jaeger)

Uber's observability platform handles 100M+ traces per day with M3 metrics and Jaeger distributed tracing. They use ML for intelligent sampling and anomaly detection across their global microservices architecture.

100M+ Traces/dayIntelligent Sampling

Best Practices

Do's

Implement high-cardinality metrics with intelligent sampling to manage costs
Use distributed tracing to understand service dependencies and bottlenecks
Deploy AI-powered anomaly detection with continuous learning capabilities
Correlate telemetry data across metrics, logs, and traces for root cause analysis
Implement progressive alerting with escalation policies and context-aware notifications

Don'ts

Don't rely on static thresholds - use dynamic baselines that adapt to traffic patterns
Don't ignore cardinality explosion - it can make your metrics system unusable
Don't create alert fatigue with too many low-value notifications
Don't trace every request in production - use intelligent sampling strategies
Don't ignore the cost of observability - it can become 10-30% of infrastructure spend

No quiz questions available

Quiz ID "advanced-observability-systems" not found