📊 Why Monitor GenAI Systems?
GenAI systems are black boxes with unique monitoring challenges. Unlike traditional APIs, LLM responses vary, costs fluctuate wildly, and quality degradation can be subtle but impactful.
Reality Check: A single misconfigured prompt can burn through $10K+ in hours. Quality issues may go unnoticed until user complaints spike. Monitoring isn't optional.
⚠️ Without Monitoring
- • Runaway costs with no early warning
- • Quality degradation goes unnoticed
- • Performance issues affect users
- • No visibility into usage patterns
✅ With Comprehensive Monitoring
- • Proactive cost and performance alerts
- • Real-time quality and safety tracking
- • Detailed usage analytics
- • Predictable, optimized operations
🔧 Monitoring Frameworks
LLM Observability Stack
Comprehensive monitoring with OpenTelemetry and custom metrics
Full request tracing
Custom LLM metrics
Integration ready
Production scalable
Implementation
from opentelemetry import trace, metrics
from opentelemetry.exporter.prometheus import PrometheusMetricReader
from opentelemetry.sdk.metrics import MeterProvider
from dataclasses import dataclass
from typing import Dict, Any, Optional
import time
import asyncio
# Initialize OpenTelemetry
tracer = trace.get_tracer("llm.service")
meter = metrics.get_meter("llm.metrics")
# Define custom metrics
request_counter = meter.create_counter(
"llm_requests_total",
description="Total number of LLM requests"
)
token_histogram = meter.create_histogram(
"llm_tokens_used",
description="Token usage distribution"
)
latency_histogram = meter.create_histogram(
"llm_request_duration_seconds",
description="LLM request duration"
)
@dataclass
class LLMMetrics:
model: str
prompt_tokens: int
completion_tokens: int
total_tokens: int
duration: float
cost: float
status: str
user_id: str
class LLMObservability:
def __init__(self):
self.active_requests = {}
async def trace_llm_request(self, request_id: str, model: str, prompt: str, **kwargs):
"""Trace LLM request with comprehensive metrics"""
with tracer.start_as_current_span("llm_request") as span:
# Set span attributes
span.set_attribute("llm.model", model)
span.set_attribute("llm.request_id", request_id)
span.set_attribute("llm.prompt_length", len(prompt))
start_time = time.time()
try:
# Make LLM request
response = await self._make_llm_request(model, prompt, **kwargs)
# Calculate metrics
duration = time.time() - start_time
metrics = LLMMetrics(
model=model,
prompt_tokens=response.get('prompt_tokens', 0),
completion_tokens=response.get('completion_tokens', 0),
total_tokens=response.get('total_tokens', 0),
duration=duration,
cost=self._calculate_cost(response),
status='success',
user_id=kwargs.get('user_id', 'anonymous')
)
# Record metrics
self._record_metrics(metrics)
span.set_attribute("llm.status", "success")
return response
except Exception as e:
# Record error metrics
duration = time.time() - start_time
error_metrics = LLMMetrics(
model=model, prompt_tokens=0, completion_tokens=0,
total_tokens=0, duration=duration, cost=0.0,
status='error', user_id=kwargs.get('user_id', 'anonymous')
)
self._record_metrics(error_metrics)
span.set_attribute("llm.status", "error")
span.set_attribute("llm.error", str(e))
raise
def _record_metrics(self, metrics: LLMMetrics):
"""Record metrics to monitoring system"""
# Increment request counter
request_counter.add(1, {
"model": metrics.model,
"status": metrics.status,
"user_id": metrics.user_id
})
# Record token usage
token_histogram.record(metrics.total_tokens, {
"model": metrics.model,
"type": "total"
})
# Record latency
latency_histogram.record(metrics.duration, {
"model": metrics.model,
"status": metrics.status
})
📈 Essential Metrics
Performance Metrics
Latency, throughput, and response quality indicators
💻 Implementation Examples
// Basic LLM Monitoring Setup
import time
import logging
from typing import Dict, Any
class BasicLLMMonitor:
def __init__(self):
self.logger = logging.getLogger("llm_monitor")
self.metrics = {
'requests': 0,
'total_tokens': 0,
'total_cost': 0.0,
'errors': 0
}
def log_request(self, model: str, prompt_tokens: int,
completion_tokens: int, cost: float,
duration: float, success: bool):
"""Log basic request metrics"""
self.metrics['requests'] += 1
self.metrics['total_tokens'] += prompt_tokens + completion_tokens
self.metrics['total_cost'] += cost
if not success:
self.metrics['errors'] += 1
# Log structured data
self.logger.info({
'event': 'llm_request',
'model': model,
'prompt_tokens': prompt_tokens,
'completion_tokens': completion_tokens,
'total_tokens': prompt_tokens + completion_tokens,
'cost': cost,
'duration': duration,
'success': success,
'timestamp': time.time()
})
def get_summary(self) -> Dict[str, Any]:
"""Get monitoring summary"""
if self.metrics['requests'] > 0:
error_rate = (self.metrics['errors'] / self.metrics['requests']) * 100
avg_cost_per_request = self.metrics['total_cost'] / self.metrics['requests']
avg_tokens_per_request = self.metrics['total_tokens'] / self.metrics['requests']
else:
error_rate = 0
avg_cost_per_request = 0
avg_tokens_per_request = 0
return {
'total_requests': self.metrics['requests'],
'total_cost': self.metrics['total_cost'],
'total_tokens': self.metrics['total_tokens'],
'error_rate': error_rate,
'avg_cost_per_request': avg_cost_per_request,
'avg_tokens_per_request': avg_tokens_per_request
}
🚨 Smart Alerting Strategies
Critical Alerts
- • Cost spikes: >200% hourly increase
- • High error rates: >5% failed requests
- • Response latency: P95 >10 seconds
- • Safety violations: Any content policy breach
Warning Alerts
- • Budget burn rate: >80% daily budget
- • Quality degradation: Low relevance scores
- • Usage spikes: 3x normal traffic
- • Cache miss rates: >30% misses
📱 Dashboard Design Principles
Executive View
- • High-level KPIs
- • Cost trends
- • Usage growth
- • ROI metrics
Operations View
- • Real-time performance
- • Error rates & types
- • System health
- • Alert status
Engineering View
- • Detailed traces
- • Performance bottlenecks
- • Resource utilization
- • Debug information
✅ Best Practices
Monitoring Strategy
- ✓ Start with business metrics first
- ✓ Use structured logging consistently
- ✓ Implement distributed tracing
- ✓ Set up automated anomaly detection
- ✓ Monitor both technical and business KPIs
Alert Management
- ✓ Minimize false positives with smart thresholds
- ✓ Use escalation chains for different severities
- ✓ Include context in alert messages
- ✓ Set up alert fatigue prevention
- ✓ Test alert channels regularly
🎯 Key Takeaways
Monitor from day one: GenAI costs and quality issues compound quickly
Focus on business metrics: Cost per successful interaction matters more than raw latency
Use structured logging: Consistent data enables powerful analytics and alerting
Implement smart alerts: Prevent cost surprises with proactive anomaly detection
Design for stakeholders: Different views for executives, ops, and engineering teams