GenAI Monitoring & Observability

Implement comprehensive monitoring, alerting, and observability for production GenAI systems.

40 min readAdvanced
Not Started
Loading...

📊 Why Monitor GenAI Systems?

GenAI systems are black boxes with unique monitoring challenges. Unlike traditional APIs, LLM responses vary, costs fluctuate wildly, and quality degradation can be subtle but impactful.

Reality Check: A single misconfigured prompt can burn through $10K+ in hours. Quality issues may go unnoticed until user complaints spike. Monitoring isn't optional.

⚠️ Without Monitoring

  • • Runaway costs with no early warning
  • • Quality degradation goes unnoticed
  • • Performance issues affect users
  • • No visibility into usage patterns

✅ With Comprehensive Monitoring

  • • Proactive cost and performance alerts
  • • Real-time quality and safety tracking
  • • Detailed usage analytics
  • • Predictable, optimized operations

🔧 Monitoring Frameworks

LLM Observability Stack

Comprehensive monitoring with OpenTelemetry and custom metrics

Full request tracing

Custom LLM metrics

Integration ready

Production scalable

Implementation

from opentelemetry import trace, metrics
from opentelemetry.exporter.prometheus import PrometheusMetricReader
from opentelemetry.sdk.metrics import MeterProvider
from dataclasses import dataclass
from typing import Dict, Any, Optional
import time
import asyncio

# Initialize OpenTelemetry
tracer = trace.get_tracer("llm.service")
meter = metrics.get_meter("llm.metrics")

# Define custom metrics
request_counter = meter.create_counter(
    "llm_requests_total",
    description="Total number of LLM requests"
)

token_histogram = meter.create_histogram(
    "llm_tokens_used",
    description="Token usage distribution"
)

latency_histogram = meter.create_histogram(
    "llm_request_duration_seconds",
    description="LLM request duration"
)

@dataclass
class LLMMetrics:
    model: str
    prompt_tokens: int
    completion_tokens: int
    total_tokens: int
    duration: float
    cost: float
    status: str
    user_id: str

class LLMObservability:
    def __init__(self):
        self.active_requests = {}
        
    async def trace_llm_request(self, request_id: str, model: str, prompt: str, **kwargs):
        """Trace LLM request with comprehensive metrics"""
        
        with tracer.start_as_current_span("llm_request") as span:
            # Set span attributes
            span.set_attribute("llm.model", model)
            span.set_attribute("llm.request_id", request_id)
            span.set_attribute("llm.prompt_length", len(prompt))
            
            start_time = time.time()
            
            try:
                # Make LLM request
                response = await self._make_llm_request(model, prompt, **kwargs)
                
                # Calculate metrics
                duration = time.time() - start_time
                metrics = LLMMetrics(
                    model=model,
                    prompt_tokens=response.get('prompt_tokens', 0),
                    completion_tokens=response.get('completion_tokens', 0),
                    total_tokens=response.get('total_tokens', 0),
                    duration=duration,
                    cost=self._calculate_cost(response),
                    status='success',
                    user_id=kwargs.get('user_id', 'anonymous')
                )
                
                # Record metrics
                self._record_metrics(metrics)
                span.set_attribute("llm.status", "success")
                
                return response
                
            except Exception as e:
                # Record error metrics
                duration = time.time() - start_time
                error_metrics = LLMMetrics(
                    model=model, prompt_tokens=0, completion_tokens=0,
                    total_tokens=0, duration=duration, cost=0.0,
                    status='error', user_id=kwargs.get('user_id', 'anonymous')
                )
                
                self._record_metrics(error_metrics)
                span.set_attribute("llm.status", "error")
                span.set_attribute("llm.error", str(e))
                raise
                
    def _record_metrics(self, metrics: LLMMetrics):
        """Record metrics to monitoring system"""
        # Increment request counter
        request_counter.add(1, {
            "model": metrics.model,
            "status": metrics.status,
            "user_id": metrics.user_id
        })
        
        # Record token usage
        token_histogram.record(metrics.total_tokens, {
            "model": metrics.model,
            "type": "total"
        })
        
        # Record latency
        latency_histogram.record(metrics.duration, {
            "model": metrics.model,
            "status": metrics.status
        })

📈 Essential Metrics

Performance Metrics

Latency, throughput, and response quality indicators

Response time (P50, P95, P99)
Requests per second/minute
Token generation rate
Time to first token
Queue wait time
Error rates by type

💻 Implementation Examples

// Basic LLM Monitoring Setup
import time
import logging
from typing import Dict, Any

class BasicLLMMonitor:
    def __init__(self):
        self.logger = logging.getLogger("llm_monitor")
        self.metrics = {
            'requests': 0,
            'total_tokens': 0,
            'total_cost': 0.0,
            'errors': 0
        }
    
    def log_request(self, model: str, prompt_tokens: int, 
                   completion_tokens: int, cost: float, 
                   duration: float, success: bool):
        """Log basic request metrics"""
        
        self.metrics['requests'] += 1
        self.metrics['total_tokens'] += prompt_tokens + completion_tokens
        self.metrics['total_cost'] += cost
        
        if not success:
            self.metrics['errors'] += 1
        
        # Log structured data
        self.logger.info({
            'event': 'llm_request',
            'model': model,
            'prompt_tokens': prompt_tokens,
            'completion_tokens': completion_tokens,
            'total_tokens': prompt_tokens + completion_tokens,
            'cost': cost,
            'duration': duration,
            'success': success,
            'timestamp': time.time()
        })
    
    def get_summary(self) -> Dict[str, Any]:
        """Get monitoring summary"""
        if self.metrics['requests'] > 0:
            error_rate = (self.metrics['errors'] / self.metrics['requests']) * 100
            avg_cost_per_request = self.metrics['total_cost'] / self.metrics['requests']
            avg_tokens_per_request = self.metrics['total_tokens'] / self.metrics['requests']
        else:
            error_rate = 0
            avg_cost_per_request = 0
            avg_tokens_per_request = 0
        
        return {
            'total_requests': self.metrics['requests'],
            'total_cost': self.metrics['total_cost'],
            'total_tokens': self.metrics['total_tokens'],
            'error_rate': error_rate,
            'avg_cost_per_request': avg_cost_per_request,
            'avg_tokens_per_request': avg_tokens_per_request
        }

🚨 Smart Alerting Strategies

Critical Alerts

  • Cost spikes: >200% hourly increase
  • High error rates: >5% failed requests
  • Response latency: P95 >10 seconds
  • Safety violations: Any content policy breach

Warning Alerts

  • Budget burn rate: >80% daily budget
  • Quality degradation: Low relevance scores
  • Usage spikes: 3x normal traffic
  • Cache miss rates: >30% misses

📱 Dashboard Design Principles

Executive View

  • • High-level KPIs
  • • Cost trends
  • • Usage growth
  • • ROI metrics

Operations View

  • • Real-time performance
  • • Error rates & types
  • • System health
  • • Alert status

Engineering View

  • • Detailed traces
  • • Performance bottlenecks
  • • Resource utilization
  • • Debug information

✅ Best Practices

Monitoring Strategy

  • ✓ Start with business metrics first
  • ✓ Use structured logging consistently
  • ✓ Implement distributed tracing
  • ✓ Set up automated anomaly detection
  • ✓ Monitor both technical and business KPIs

Alert Management

  • ✓ Minimize false positives with smart thresholds
  • ✓ Use escalation chains for different severities
  • ✓ Include context in alert messages
  • ✓ Set up alert fatigue prevention
  • ✓ Test alert channels regularly

🎯 Key Takeaways

Monitor from day one: GenAI costs and quality issues compound quickly

Focus on business metrics: Cost per successful interaction matters more than raw latency

Use structured logging: Consistent data enables powerful analytics and alerting

Implement smart alerts: Prevent cost surprises with proactive anomaly detection

Design for stakeholders: Different views for executives, ops, and engineering teams

📝 GenAI Monitoring Expertise Check

1 of 8Current: 0/8

What is the most critical metric to monitor first in a GenAI system?