Advanced Observability Systems
Design next-generation observability platforms with AI-driven monitoring, distributed tracing, real-time analytics, and intelligent incident response.
What are Advanced Observability Systems?
Advanced observability systems combine traditional monitoring with AI-powered insights, providing comprehensive visibility into complex distributed systems. They unify metrics, logs, traces, and profiles to enable rapid problem detection and resolution.
Modern observability platforms use machine learning for anomaly detection, predictive alerting, and automated root cause analysis, reducing Mean Time to Detection (MTTD) and Mean Time to Recovery (MTTR) by up to 80%.
Interactive Observability Calculator
Observability Metrics
Recommendations
• Consider intelligent sampling to reduce costs
Evolution Beyond the Three Pillars
1. Metrics → Smart Metrics
AI-powered metrics with automatic anomaly detection and predictive alerting.
2. Logs → Structured Events
Structured, searchable events with automatic pattern recognition and clustering.
3. Traces → Service Maps
Visual service topology with dependency analysis and bottleneck identification.
4. Profiles → Continuous Profiling
Always-on profiling with code-level insights and optimization recommendations.
5. Synthetic → User Analytics
Real user monitoring combined with synthetic testing for complete coverage.
6. AIOps → Autonomous Operations
Self-healing systems with automated incident response and remediation.
Production Implementation
AI-Powered Observability Engine
import asyncio
import numpy as np
import pandas as pd
from typing import Dict, List, Optional, Tuple
from dataclasses import dataclass, field
from datetime import datetime, timedelta
from sklearn.ensemble import IsolationForest
from sklearn.cluster import DBSCAN
import json
@dataclass
class Metric:
name: str
value: float
timestamp: datetime
labels: Dict[str, str] = field(default_factory=dict)
@dataclass
class LogEvent:
timestamp: datetime
level: str
message: str
service: str
trace_id: Optional[str] = None
attributes: Dict = field(default_factory=dict)
@dataclass
class Span:
trace_id: str
span_id: str
parent_span_id: Optional[str]
operation_name: str
service_name: str
start_time: datetime
end_time: datetime
tags: Dict[str, str] = field(default_factory=dict)
duration_ms: float = 0
@dataclass
class Anomaly:
type: str # 'metric', 'log', 'trace'
severity: str # 'low', 'medium', 'high', 'critical'
description: str
timestamp: datetime
confidence_score: float
affected_services: List[str]
suggested_actions: List[str]
root_cause_hypothesis: Optional[str] = None
class AdvancedObservabilityEngine:
"""AI-powered observability platform with advanced analytics"""
def __init__(self):
self.metrics_store = MetricsStore()
self.log_analyzer = LogAnalyzer()
self.trace_analyzer = TraceAnalyzer()
self.anomaly_detector = AnomalyDetector()
self.root_cause_analyzer = RootCauseAnalyzer()
self.auto_remediator = AutoRemediator()
# ML models for different data types
self.metric_models = {}
self.log_models = {}
self.service_dependency_graph = ServiceDependencyGraph()
async def ingest_telemetry(
self,
metrics: Optional[List[Metric]] = None,
logs: Optional[List[LogEvent]] = None,
spans: Optional[List[Span]] = None
):
"""Ingest and process telemetry data"""
tasks = []
if metrics:
tasks.append(self.process_metrics(metrics))
if logs:
tasks.append(self.process_logs(logs))
if spans:
tasks.append(self.process_traces(spans))
await asyncio.gather(*tasks)
# Run anomaly detection after ingestion
await self.detect_anomalies()
async def process_metrics(self, metrics: List[Metric]):
"""Process incoming metrics with AI analysis"""
for metric in metrics:
# Store metric
await self.metrics_store.store_metric(metric)
# Update or create ML model for this metric
metric_key = f"{metric.name}:{hash(frozenset(metric.labels.items()))}"
if metric_key not in self.metric_models:
self.metric_models[metric_key] = MetricAnomalyModel(metric.name)
# Update model with new data point
await self.metric_models[metric_key].update(metric)
async def process_logs(self, logs: List[LogEvent]):
"""Process logs with pattern recognition"""
for log in logs:
# Extract structured data from unstructured logs
structured_log = await self.log_analyzer.structure_log(log)
# Detect log patterns and anomalies
await self.log_analyzer.analyze_pattern(structured_log)
# Link logs to traces if trace_id present
if log.trace_id:
await self.trace_analyzer.link_log_to_trace(log.trace_id, structured_log)
async def process_traces(self, spans: List[Span]):
"""Process distributed traces with performance analysis"""
# Group spans by trace_id
traces = {}
for span in spans:
if span.trace_id not in traces:
traces[span.trace_id] = []
traces[span.trace_id].append(span)
for trace_id, trace_spans in traces.items():
# Analyze trace performance
trace_analysis = await self.trace_analyzer.analyze_trace(trace_spans)
# Update service dependency graph
await self.service_dependency_graph.update_from_trace(trace_spans)
# Detect trace anomalies (slow requests, errors, etc.)
await self.detect_trace_anomalies(trace_id, trace_analysis)
async def detect_anomalies(self) -> List[Anomaly]:
"""Run comprehensive anomaly detection"""
anomalies = []
# Metric anomalies
metric_anomalies = await self.anomaly_detector.detect_metric_anomalies(
self.metric_models
)
anomalies.extend(metric_anomalies)
# Log anomalies
log_anomalies = await self.log_analyzer.detect_log_anomalies()
anomalies.extend(log_anomalies)
# Service dependency anomalies
dependency_anomalies = await self.service_dependency_graph.detect_anomalies()
anomalies.extend(dependency_anomalies)
# Correlate anomalies across data types
correlated_anomalies = await self.correlate_anomalies(anomalies)
# Run root cause analysis for high-severity anomalies
for anomaly in correlated_anomalies:
if anomaly.severity in ['high', 'critical']:
root_cause = await self.root_cause_analyzer.analyze(anomaly)
anomaly.root_cause_hypothesis = root_cause
# Trigger auto-remediation if configured
await self.auto_remediator.attempt_remediation(anomaly)
return correlated_anomalies
async def correlate_anomalies(self, anomalies: List[Anomaly]) -> List[Anomaly]:
"""Correlate anomalies across different data types"""
# Group anomalies by time window and affected services
time_window = timedelta(minutes=5)
service_groups = {}
for anomaly in anomalies:
# Create time-service key for grouping
time_bucket = anomaly.timestamp.replace(
minute=anomaly.timestamp.minute // 5 * 5,
second=0,
microsecond=0
)
for service in anomaly.affected_services:
key = f"{time_bucket}:{service}"
if key not in service_groups:
service_groups[key] = []
service_groups[key].append(anomaly)
# Create correlated anomalies
correlated = []
processed_anomalies = set()
for group_anomalies in service_groups.values():
if len(group_anomalies) > 1:
# Create a correlated anomaly
primary_anomaly = max(group_anomalies, key=lambda a: a.confidence_score)
if id(primary_anomaly) not in processed_anomalies:
# Enhance with correlation information
primary_anomaly.description += f" (correlated with {len(group_anomalies)-1} other anomalies)"
primary_anomaly.confidence_score = min(1.0, primary_anomaly.confidence_score * 1.2)
correlated.append(primary_anomaly)
processed_anomalies.add(id(primary_anomaly))
else:
# Single anomaly
anomaly = group_anomalies[0]
if id(anomaly) not in processed_anomalies:
correlated.append(anomaly)
processed_anomalies.add(id(anomaly))
return correlated
async def get_service_insights(self, service_name: str, time_range: Tuple[datetime, datetime]) -> Dict:
"""Get comprehensive insights for a specific service"""
start_time, end_time = time_range
# Collect metrics
service_metrics = await self.metrics_store.query_metrics(
service=service_name,
start_time=start_time,
end_time=end_time
)
# Collect logs
service_logs = await self.log_analyzer.query_logs(
service=service_name,
start_time=start_time,
end_time=end_time
)
# Collect traces
service_traces = await self.trace_analyzer.query_traces(
service=service_name,
start_time=start_time,
end_time=end_time
)
# Generate insights
insights = {
"service_name": service_name,
"time_range": [start_time.isoformat(), end_time.isoformat()],
"metrics_summary": self._summarize_metrics(service_metrics),
"error_analysis": await self.log_analyzer.analyze_errors(service_logs),
"performance_analysis": await self.trace_analyzer.analyze_performance(service_traces),
"dependencies": await self.service_dependency_graph.get_service_dependencies(service_name),
"recommendations": await self._generate_recommendations(service_name, service_metrics, service_logs, service_traces),
"health_score": await self._calculate_health_score(service_name, service_metrics, service_logs, service_traces)
}
return insights
def _summarize_metrics(self, metrics: List[Metric]) -> Dict:
"""Summarize metrics with statistical analysis"""
if not metrics:
return {}
# Group metrics by name
metric_groups = {}
for metric in metrics:
if metric.name not in metric_groups:
metric_groups[metric.name] = []
metric_groups[metric.name].append(metric.value)
# Calculate statistics
summary = {}
for name, values in metric_groups.items():
np_values = np.array(values)
summary[name] = {
"count": len(values),
"mean": float(np.mean(np_values)),
"median": float(np.median(np_values)),
"std": float(np.std(np_values)),
"min": float(np.min(np_values)),
"max": float(np.max(np_values)),
"p95": float(np.percentile(np_values, 95)),
"p99": float(np.percentile(np_values, 99))
}
return summary
async def _generate_recommendations(
self,
service_name: str,
metrics: List[Metric],
logs: List[LogEvent],
traces: List[Span]
) -> List[str]:
"""Generate actionable recommendations for service optimization"""
recommendations = []
# Performance recommendations based on traces
if traces:
avg_duration = np.mean([span.duration_ms for span in traces])
p95_duration = np.percentile([span.duration_ms for span in traces], 95)
if p95_duration > 5000: # 5 second p95
recommendations.append(
f"High response times detected (P95: {p95_duration:.0f}ms). "
"Consider adding caching or optimizing database queries."
)
# Error rate recommendations based on logs
error_logs = [log for log in logs if log.level in ['ERROR', 'FATAL']]
if error_logs:
error_rate = len(error_logs) / len(logs) if logs else 0
if error_rate > 0.01: # 1% error rate
recommendations.append(
f"High error rate detected ({error_rate*100:.1f}%). "
"Review recent deployments and add error handling."
)
# Resource utilization recommendations based on metrics
cpu_metrics = [m for m in metrics if 'cpu' in m.name.lower()]
if cpu_metrics:
avg_cpu = np.mean([m.value for m in cpu_metrics])
if avg_cpu > 80:
recommendations.append(
f"High CPU utilization ({avg_cpu:.1f}%). "
"Consider horizontal scaling or performance optimization."
)
return recommendations
async def _calculate_health_score(
self,
service_name: str,
metrics: List[Metric],
logs: List[LogEvent],
traces: List[Span]
) -> float:
"""Calculate overall service health score (0-100)"""
score_components = []
# Error rate component (40% weight)
if logs:
error_logs = [log for log in logs if log.level in ['ERROR', 'FATAL']]
error_rate = len(error_logs) / len(logs)
error_score = max(0, 100 - (error_rate * 1000)) # 1% error = 90 score
score_components.append(('error_rate', error_score, 0.4))
# Performance component (35% weight)
if traces:
p95_duration = np.percentile([span.duration_ms for span in traces], 95)
# Score based on p95 latency (5s = 50 score)
perf_score = max(0, 100 - (p95_duration / 50))
score_components.append(('performance', perf_score, 0.35))
# Resource utilization component (25% weight)
if metrics:
cpu_metrics = [m for m in metrics if 'cpu' in m.name.lower()]
if cpu_metrics:
avg_cpu = np.mean([m.value for m in cpu_metrics])
# Optimal CPU around 70%, penalize high/low usage
cpu_score = 100 - abs(avg_cpu - 70)
score_components.append(('resource', cpu_score, 0.25))
# Calculate weighted average
if not score_components:
return 100.0 # No data = healthy assumption
total_weight = sum(weight for _, _, weight in score_components)
weighted_score = sum(score * weight for _, score, weight in score_components)
return min(100.0, max(0.0, weighted_score / total_weight))
class MetricAnomalyModel:
"""ML model for detecting metric anomalies"""
def __init__(self, metric_name: str):
self.metric_name = metric_name
self.values_buffer = []
self.model = IsolationForest(contamination=0.1, random_state=42)
self.is_trained = False
self.min_samples = 100
async def update(self, metric: Metric):
"""Update model with new metric value"""
self.values_buffer.append({
'value': metric.value,
'timestamp': metric.timestamp,
'labels': metric.labels
})
# Keep rolling buffer
if len(self.values_buffer) > 1000:
self.values_buffer = self.values_buffer[-1000:]
# Train/retrain model periodically
if len(self.values_buffer) >= self.min_samples and len(self.values_buffer) % 50 == 0:
await self._train_model()
async def _train_model(self):
"""Train anomaly detection model"""
if len(self.values_buffer) < self.min_samples:
return
# Prepare features (could include time-based features, moving averages, etc.)
features = []
values = [item['value'] for item in self.values_buffer]
for i, value in enumerate(values):
# Simple features: value, moving average, rate of change
ma_5 = np.mean(values[max(0, i-4):i+1])
rate_change = (value - values[i-1]) if i > 0 else 0
features.append([value, ma_5, rate_change])
# Train model
X = np.array(features)
self.model.fit(X)
self.is_trained = True
async def detect_anomaly(self, metric: Metric) -> Optional[float]:
"""Detect if metric is anomalous, return anomaly score"""
if not self.is_trained or len(self.values_buffer) < 5:
return None
# Prepare features for current metric
values = [item['value'] for item in self.values_buffer[-5:]] + [metric.value]
ma_5 = np.mean(values[-5:])
rate_change = metric.value - values[-2]
features = np.array([[metric.value, ma_5, rate_change]])
# Get anomaly score
anomaly_score = self.model.decision_function(features)[0]
is_anomaly = self.model.predict(features)[0] == -1
return abs(anomaly_score) if is_anomaly else None
# Additional supporting classes would be implemented similarly...
class MetricsStore:
async def store_metric(self, metric: Metric):
pass
async def query_metrics(self, service: str, start_time: datetime, end_time: datetime):
pass
class LogAnalyzer:
async def structure_log(self, log: LogEvent):
pass
async def analyze_pattern(self, log):
pass
async def detect_log_anomalies(self):
pass
class TraceAnalyzer:
async def analyze_trace(self, spans: List[Span]):
pass
async def link_log_to_trace(self, trace_id: str, log):
pass
class ServiceDependencyGraph:
async def update_from_trace(self, spans: List[Span]):
pass
async def detect_anomalies(self):
pass
class AnomalyDetector:
async def detect_metric_anomalies(self, models):
pass
class RootCauseAnalyzer:
async def analyze(self, anomaly: Anomaly):
pass
class AutoRemediator:
async def attempt_remediation(self, anomaly: Anomaly):
pass
# Usage example
async def main():
engine = AdvancedObservabilityEngine()
# Simulate telemetry ingestion
metrics = [
Metric("cpu_usage", 75.5, datetime.utcnow(), {"service": "api", "host": "web-1"}),
Metric("request_duration", 150.2, datetime.utcnow(), {"service": "api", "endpoint": "/users"})
]
logs = [
LogEvent(datetime.utcnow(), "ERROR", "Database connection failed", "api", "trace-123")
]
spans = [
Span("trace-123", "span-1", None, "handle_request", "api",
datetime.utcnow(), datetime.utcnow() + timedelta(milliseconds=150))
]
await engine.ingest_telemetry(metrics=metrics, logs=logs, spans=spans)
# Get service insights
insights = await engine.get_service_insights(
"api",
(datetime.utcnow() - timedelta(hours=1), datetime.utcnow())
)
print(f"Service health score: {insights['health_score']}")
# Run the observability engine
# asyncio.run(main())
Real-time Observability Dashboard
import React, { useState, useEffect, useCallback } from 'react';
import { LineChart, Line, AreaChart, Area, XAxis, YAxis, CartesianGrid, Tooltip, ResponsiveContainer, ScatterChart, Scatter } from 'recharts';
interface ServiceHealth {
serviceName: string;
healthScore: number;
errorRate: number;
p95Latency: number;
throughput: number;
cpuUsage: number;
memoryUsage: number;
status: 'healthy' | 'warning' | 'critical';
}
interface Anomaly {
id: string;
type: 'metric' | 'log' | 'trace';
severity: 'low' | 'medium' | 'high' | 'critical';
title: string;
description: string;
timestamp: string;
affectedServices: string[];
confidence: number;
rootCause?: string;
autoRemediated?: boolean;
}
interface TraceData {
traceId: string;
duration: number;
serviceCount: number;
errorCount: number;
timestamp: string;
criticalPath: string[];
}
export const AdvancedObservabilityDashboard: React.FC = () => {
const [services, setServices] = useState<ServiceHealth[]>([]);
const [anomalies, setAnomalies] = useState<Anomaly[]>([]);
const [traces, setTraces] = useState<TraceData[]>([]);
const [selectedService, setSelectedService] = useState<string>('');
const [timeRange, setTimeRange] = useState<'1h' | '4h' | '24h' | '7d'>('4h');
const [realTimeEnabled, setRealTimeEnabled] = useState(true);
// WebSocket connection for real-time updates
useEffect(() => {
if (!realTimeEnabled) return;
const ws = new WebSocket('wss://observability-api.example.com/ws');
ws.onmessage = (event) => {
const data = JSON.parse(event.data);
switch (data.type) {
case 'service_update':
setServices(prev => prev.map(s =>
s.serviceName === data.payload.serviceName
? { ...s, ...data.payload }
: s
));
break;
case 'anomaly_detected':
setAnomalies(prev => [data.payload, ...prev.slice(0, 49)]);
break;
case 'trace_completed':
setTraces(prev => [data.payload, ...prev.slice(0, 99)]);
break;
}
};
return () => ws.close();
}, [realTimeEnabled]);
// Fetch initial data
useEffect(() => {
fetchObservabilityData();
}, [timeRange]);
const fetchObservabilityData = async () => {
try {
const [servicesRes, anomaliesRes, tracesRes] = await Promise.all([
fetch(`/api/observability/services?range=${timeRange}`),
fetch(`/api/observability/anomalies?range=${timeRange}`),
fetch(`/api/observability/traces?range=${timeRange}&limit=100`)
]);
setServices(await servicesRes.json());
setAnomalies(await anomaliesRes.json());
setTraces(await tracesRes.json());
} catch (error) {
console.error('Failed to fetch observability data:', error);
}
};
const getServiceStatusColor = (status: string) => {
switch (status) {
case 'healthy': return 'text-green-600 bg-green-100';
case 'warning': return 'text-yellow-600 bg-yellow-100';
case 'critical': return 'text-red-600 bg-red-100';
default: return 'text-gray-600 bg-gray-100';
}
};
const getSeverityColor = (severity: string) => {
switch (severity) {
case 'low': return 'border-blue-500 bg-blue-50';
case 'medium': return 'border-yellow-500 bg-yellow-50';
case 'high': return 'border-orange-500 bg-orange-50';
case 'critical': return 'border-red-500 bg-red-50';
default: return 'border-gray-500 bg-gray-50';
}
};
const criticalAnomalies = anomalies.filter(a => a.severity === 'critical');
const averageHealthScore = services.reduce((sum, s) => sum + s.healthScore, 0) / services.length || 0;
const totalThroughput = services.reduce((sum, s) => sum + s.throughput, 0);
// Prepare chart data
const healthTrendData = services.map(service => ({
service: service.serviceName,
health: service.healthScore,
errorRate: service.errorRate,
latency: service.p95Latency
}));
const traceLatencyData = traces.slice(0, 50).map(trace => ({
duration: trace.duration,
serviceCount: trace.serviceCount,
errorCount: trace.errorCount,
time: new Date(trace.timestamp).getTime()
}));
return (
<div className="p-6 space-y-6 bg-gray-50 dark:bg-gray-900 min-h-screen">
{/* Header */}
<div className="flex justify-between items-center">
<div>
<h1 className="text-3xl font-bold text-gray-900 dark:text-white">
Advanced Observability
</h1>
<p className="text-gray-600 dark:text-gray-300">
AI-powered monitoring and incident response
</p>
</div>
<div className="flex items-center space-x-4">
{/* Real-time toggle */}
<label className="flex items-center space-x-2">
<input
type="checkbox"
checked={realTimeEnabled}
onChange={(e) => setRealTimeEnabled(e.target.checked)}
className="rounded border-gray-300"
/>
<span className="text-sm text-gray-700 dark:text-gray-300">Real-time</span>
</label>
{/* Time range selector */}
<select
value={timeRange}
onChange={(e) => setTimeRange(e.target.value as any)}
className="px-3 py-2 bg-white dark:bg-gray-800 border border-gray-300 dark:border-gray-600 rounded-lg"
>
<option value="1h">Last Hour</option>
<option value="4h">Last 4 Hours</option>
<option value="24h">Last 24 Hours</option>
<option value="7d">Last 7 Days</option>
</select>
</div>
</div>
{/* Summary Cards */}
<div className="grid grid-cols-1 md:grid-cols-4 gap-6">
<div className="bg-white dark:bg-gray-800 rounded-lg p-6 shadow-lg">
<h3 className="text-sm font-medium text-gray-500 dark:text-gray-400">Avg Health Score</h3>
<p className={`text-2xl font-bold ${
averageHealthScore >= 90 ? 'text-green-600' :
averageHealthScore >= 70 ? 'text-yellow-600' : 'text-red-600'
}`}>
{averageHealthScore.toFixed(1)}%
</p>
</div>
<div className={`rounded-lg p-6 shadow-lg ${
criticalAnomalies.length === 0
? 'bg-green-50 dark:bg-green-900/20'
: 'bg-red-50 dark:bg-red-900/20'
}`}>
<h3 className={`text-sm font-medium ${
criticalAnomalies.length === 0
? 'text-green-600 dark:text-green-400'
: 'text-red-600 dark:text-red-400'
}`}>Critical Issues</h3>
<p className={`text-2xl font-bold ${
criticalAnomalies.length === 0
? 'text-green-900 dark:text-green-100'
: 'text-red-900 dark:text-red-100'
}`}>
{criticalAnomalies.length}
</p>
</div>
<div className="bg-blue-50 dark:bg-blue-900/20 rounded-lg p-6 shadow-lg">
<h3 className="text-sm font-medium text-blue-600 dark:text-blue-400">Total Throughput</h3>
<p className="text-2xl font-bold text-blue-900 dark:text-blue-100">
{(totalThroughput / 1000).toFixed(1)}k RPS
</p>
</div>
<div className="bg-purple-50 dark:bg-purple-900/20 rounded-lg p-6 shadow-lg">
<h3 className="text-sm font-medium text-purple-600 dark:text-purple-400">Services</h3>
<p className="text-2xl font-bold text-purple-900 dark:text-purple-100">
{services.length}
</p>
</div>
</div>
{/* Service Health Overview */}
<div className="bg-white dark:bg-gray-800 rounded-lg p-6 shadow-lg">
<h3 className="text-lg font-semibold mb-4 text-gray-900 dark:text-white">
Service Health Matrix
</h3>
<div className="overflow-x-auto">
<table className="min-w-full divide-y divide-gray-200 dark:divide-gray-700">
<thead className="bg-gray-50 dark:bg-gray-900">
<tr>
<th className="px-6 py-3 text-left text-xs font-medium text-gray-500 dark:text-gray-400 uppercase">
Service
</th>
<th className="px-6 py-3 text-left text-xs font-medium text-gray-500 dark:text-gray-400 uppercase">
Health
</th>
<th className="px-6 py-3 text-left text-xs font-medium text-gray-500 dark:text-gray-400 uppercase">
Error Rate
</th>
<th className="px-6 py-3 text-left text-xs font-medium text-gray-500 dark:text-gray-400 uppercase">
P95 Latency
</th>
<th className="px-6 py-3 text-left text-xs font-medium text-gray-500 dark:text-gray-400 uppercase">
Throughput
</th>
<th className="px-6 py-3 text-left text-xs font-medium text-gray-500 dark:text-gray-400 uppercase">
Resources
</th>
</tr>
</thead>
<tbody className="bg-white dark:bg-gray-800 divide-y divide-gray-200 dark:divide-gray-700">
{services.map((service) => (
<tr
key={service.serviceName}
className="hover:bg-gray-50 dark:hover:bg-gray-700 cursor-pointer"
onClick={() => setSelectedService(service.serviceName)}
>
<td className="px-6 py-4 whitespace-nowrap">
<div className="flex items-center">
<div className={`w-3 h-3 rounded-full mr-2 ${
service.status === 'healthy' ? 'bg-green-500' :
service.status === 'warning' ? 'bg-yellow-500' : 'bg-red-500'
}`}></div>
<span className="text-sm font-medium text-gray-900 dark:text-white">
{service.serviceName}
</span>
</div>
</td>
<td className="px-6 py-4 whitespace-nowrap">
<span className={`inline-flex px-2 py-1 text-xs font-semibold rounded-full ${getServiceStatusColor(service.status)}`}>
{service.healthScore.toFixed(0)}%
</span>
</td>
<td className="px-6 py-4 whitespace-nowrap text-sm text-gray-900 dark:text-white">
{(service.errorRate * 100).toFixed(2)}%
</td>
<td className="px-6 py-4 whitespace-nowrap text-sm text-gray-900 dark:text-white">
{service.p95Latency.toFixed(0)}ms
</td>
<td className="px-6 py-4 whitespace-nowrap text-sm text-gray-900 dark:text-white">
{service.throughput.toFixed(0)} RPS
</td>
<td className="px-6 py-4 whitespace-nowrap text-sm text-gray-900 dark:text-white">
CPU: {service.cpuUsage.toFixed(0)}% | Mem: {service.memoryUsage.toFixed(0)}%
</td>
</tr>
))}
</tbody>
</table>
</div>
</div>
{/* Charts Section */}
<div className="grid grid-cols-1 lg:grid-cols-2 gap-6">
{/* Service Health Trend */}
<div className="bg-white dark:bg-gray-800 rounded-lg p-6 shadow-lg">
<h3 className="text-lg font-semibold mb-4 text-gray-900 dark:text-white">
Health Score by Service
</h3>
<ResponsiveContainer width="100%" height={300}>
<AreaChart data={healthTrendData}>
<CartesianGrid strokeDasharray="3 3" />
<XAxis dataKey="service" />
<YAxis domain={[0, 100]} />
<Tooltip />
<Area type="monotone" dataKey="health" stroke="#10b981" fill="#10b981" fillOpacity={0.3} />
</AreaChart>
</ResponsiveContainer>
</div>
{/* Trace Latency Distribution */}
<div className="bg-white dark:bg-gray-800 rounded-lg p-6 shadow-lg">
<h3 className="text-lg font-semibold mb-4 text-gray-900 dark:text-white">
Trace Latency Distribution
</h3>
<ResponsiveContainer width="100%" height={300}>
<ScatterChart data={traceLatencyData}>
<CartesianGrid strokeDasharray="3 3" />
<XAxis dataKey="serviceCount" name="Services" />
<YAxis dataKey="duration" name="Duration (ms)" />
<Tooltip cursor={{ strokeDasharray: '3 3' }} />
<Scatter dataKey="duration" fill="#8b5cf6" />
</ScatterChart>
</ResponsiveContainer>
</div>
</div>
{/* Recent Anomalies */}
<div className="bg-white dark:bg-gray-800 rounded-lg p-6 shadow-lg">
<h3 className="text-lg font-semibold mb-4 text-gray-900 dark:text-white">
AI-Detected Anomalies
</h3>
<div className="space-y-3">
{anomalies.slice(0, 5).map((anomaly) => (
<div
key={anomaly.id}
className={`border-l-4 pl-4 py-3 ${getSeverityColor(anomaly.severity)}`}
>
<div className="flex justify-between items-start">
<div className="flex-1">
<div className="flex items-center space-x-2 mb-1">
<h4 className="font-semibold text-gray-900 dark:text-white">
{anomaly.title}
</h4>
<span className={`px-2 py-1 text-xs rounded ${
anomaly.severity === 'critical' ? 'bg-red-200 text-red-800' :
anomaly.severity === 'high' ? 'bg-orange-200 text-orange-800' :
anomaly.severity === 'medium' ? 'bg-yellow-200 text-yellow-800' :
'bg-blue-200 text-blue-800'
}`}>
{anomaly.severity.toUpperCase()}
</span>
{anomaly.autoRemediated && (
<span className="px-2 py-1 text-xs bg-green-200 text-green-800 rounded">
AUTO-REMEDIATED
</span>
)}
</div>
<p className="text-sm text-gray-600 dark:text-gray-300 mb-2">
{anomaly.description}
</p>
{anomaly.rootCause && (
<p className="text-sm text-blue-600 dark:text-blue-400 mb-2">
<strong>Root Cause:</strong> {anomaly.rootCause}
</p>
)}
<div className="flex items-center space-x-4 text-xs text-gray-500">
<span>{new Date(anomaly.timestamp).toLocaleString()}</span>
<span>Services: {anomaly.affectedServices.join(', ')}</span>
<span>Confidence: {(anomaly.confidence * 100).toFixed(0)}%</span>
</div>
</div>
</div>
</div>
))}
</div>
</div>
{/* AI Insights Panel */}
<div className="bg-gradient-to-r from-purple-50 to-indigo-50 dark:from-purple-900/20 dark:to-indigo-900/20 rounded-lg p-6 shadow-lg">
<h3 className="text-lg font-semibold mb-4 text-gray-900 dark:text-white flex items-center">
<span className="w-6 h-6 bg-purple-500 rounded-full flex items-center justify-center text-white text-sm mr-2">
AI
</span>
Intelligent Recommendations
</h3>
<div className="grid grid-cols-1 md:grid-cols-3 gap-4">
<div className="bg-white dark:bg-gray-800 rounded-lg p-4">
<h4 className="font-medium text-gray-900 dark:text-white mb-2">Performance Optimization</h4>
<p className="text-sm text-gray-600 dark:text-gray-300">
3 services showing high latency patterns. Consider adding caching layers.
</p>
</div>
<div className="bg-white dark:bg-gray-800 rounded-lg p-4">
<h4 className="font-medium text-gray-900 dark:text-white mb-2">Cost Optimization</h4>
<p className="text-sm text-gray-600 dark:text-gray-300">
Enable intelligent sampling to reduce observability costs by 40%.
</p>
</div>
<div className="bg-white dark:bg-gray-800 rounded-lg p-4">
<h4 className="font-medium text-gray-900 dark:text-white mb-2">Capacity Planning</h4>
<p className="text-sm text-gray-600 dark:text-gray-300">
Traffic growth trend suggests scaling API service within 2 weeks.
</p>
</div>
</div>
</div>
</div>
);
};
Real-World Examples
Netflix
Atlas & Mantis Real-time Analytics
Netflix's observability platform processes 2.5 billion metrics per minute with real-time anomaly detection. Their Atlas time-series database and Mantis stream processing enable sub-second incident detection across 700+ microservices.
Datadog
Watchdog AI & APM
Datadog's Watchdog AI automatically detects anomalies across metrics, logs, and traces using machine learning. It correlates issues across services and provides root cause analysis, reducing MTTR by 50% for customers.
Uber
Observability Platform (M3, Jaeger)
Uber's observability platform handles 100M+ traces per day with M3 metrics and Jaeger distributed tracing. They use ML for intelligent sampling and anomaly detection across their global microservices architecture.
Best Practices
Do's
Implement high-cardinality metrics with intelligent sampling to manage costs
Use distributed tracing to understand service dependencies and bottlenecks
Deploy AI-powered anomaly detection with continuous learning capabilities
Correlate telemetry data across metrics, logs, and traces for root cause analysis
Implement progressive alerting with escalation policies and context-aware notifications
Don'ts
Don't rely on static thresholds - use dynamic baselines that adapt to traffic patterns
Don't ignore cardinality explosion - it can make your metrics system unusable
Don't create alert fatigue with too many low-value notifications
Don't trace every request in production - use intelligent sampling strategies
Don't ignore the cost of observability - it can become 10-30% of infrastructure spend