AI Gateway Patterns

Master AI Gateway patterns: intelligent routing, load balancing, fallback strategies, and unified AI API management.

45 min read•Advanced

Not Started

What are AI Gateway Patterns?

AI Gateway patterns provide a unified entry point for managing multiple AI models and services. They handle intelligent routing, load balancing, fallback strategies, rate limiting, authentication, and observability across diverse AI providers and models.

Key Benefits: Unified API interface, intelligent model routing, cost optimization, improved reliability through fallbacks, and centralized governance and monitoring.

AI Gateway Performance Calculator

Requests per Second1000 RPS

Number of Models3 models

Routing Strategy

Enable Fallback Models

Gateway Performance

Efficiency:92%

Avg Latency:90ms

Requests/Model:333 RPS

Cost Multiplier:0.99x

Availability:99.95%

Complexity:Medium

Core AI Gateway Patterns

Model Router Pattern

• Route requests based on model capabilities
• Version-based routing for A/B testing
• Request classification and smart routing
• Dynamic model selection
• Context-aware model assignment

Load Balancer Pattern

• Distribute load across model instances
• Health check and failover management
• Weighted routing based on performance
• Queue management and backpressure
• Auto-scaling based on demand

Fallback Pattern

• Cascade to backup models on failure
• Quality-based fallback strategies
• Cross-provider failover
• Graceful degradation patterns
• Circuit breaker implementation

Request Aggregator

• Batch multiple requests for efficiency
• Multi-model ensemble patterns
• Response merging and consensus
• Parallel request processing
• Result ranking and selection

Implementation Examples

AI Gateway Service Architecture

from fastapi import FastAPI, HTTPException, Depends
from typing import List, Dict, Any, Optional
import asyncio
import time
import random
from enum import Enum
from pydantic import BaseModel

class RoutingStrategy(str, Enum):
    ROUND_ROBIN = "round_robin"
    WEIGHTED = "weighted"
    LATENCY_BASED = "latency_based"
    COST_OPTIMIZED = "cost_optimized"

class ModelConfig(BaseModel):
    model_id: str
    endpoint_url: str
    weight: float = 1.0
    max_tokens: int = 4096
    cost_per_token: float = 0.0001
    avg_latency_ms: float = 100
    capabilities: List[str] = []
    fallback_priority: int = 1

class AIGateway:
    def __init__(self):
        self.models: Dict[str, ModelConfig] = {}
        self.routing_strategy = RoutingStrategy.WEIGHTED
        self.fallback_enabled = True
        self.circuit_breakers = {}
        self.request_stats = {}
        
    def register_model(self, config: ModelConfig):
        """Register a new model with the gateway"""
        self.models[config.model_id] = config
        self.circuit_breakers[config.model_id] = {
            'failures': 0,
            'last_failure': 0,
            'state': 'closed',  # closed, open, half_open
            'failure_threshold': 5
        }
        
    async def route_request(self, request: Dict[str, Any]) -> Dict[str, Any]:
        """Main request routing logic"""
        try:
            # 1. Select model based on routing strategy
            selected_model = self._select_model(request)
            
            # 2. Check circuit breaker
            if not self._is_circuit_closed(selected_model.model_id):
                if self.fallback_enabled:
                    selected_model = self._get_fallback_model(selected_model.model_id)
                else:
                    raise HTTPException(503, f"Model {selected_model.model_id} unavailable")
            
            # 3. Execute request with monitoring
            start_time = time.time()
            response = await self._execute_request(selected_model, request)
            latency = (time.time() - start_time) * 1000
            
            # 4. Update stats and circuit breaker
            self._update_stats(selected_model.model_id, latency, success=True)
            self._reset_circuit_breaker(selected_model.model_id)
            
            return {
                'response': response,
                'model_used': selected_model.model_id,
                'latency_ms': latency,
                'routing_strategy': self.routing_strategy.value
            }
            
        except Exception as e:
            # Handle failures and circuit breaker logic
            if 'selected_model' in locals():
                self._handle_failure(selected_model.model_id, str(e))
                
            # Try fallback if enabled
            if self.fallback_enabled and 'selected_model' in locals():
                return await self._try_fallback(selected_model.model_id, request)
            
            raise HTTPException(500, f"AI Gateway error: {str(e)}")
    
    def _select_model(self, request: Dict[str, Any]) -> ModelConfig:
        """Select best model based on routing strategy"""
        available_models = [m for m in self.models.values() 
                          if self._is_circuit_closed(m.model_id)]
        
        if not available_models:
            raise HTTPException(503, "No models available")
        
        if self.routing_strategy == RoutingStrategy.WEIGHTED:
            return self._weighted_selection(available_models)
        elif self.routing_strategy == RoutingStrategy.LATENCY_BASED:
            return min(available_models, key=lambda m: m.avg_latency_ms)
        elif self.routing_strategy == RoutingStrategy.COST_OPTIMIZED:
            return min(available_models, key=lambda m: m.cost_per_token)
        else:  # ROUND_ROBIN
            return self._round_robin_selection(available_models)
    
    def _weighted_selection(self, models: List[ModelConfig]) -> ModelConfig:
        """Weighted random selection based on model weights"""
        total_weight = sum(m.weight for m in models)
        rand_value = random.uniform(0, total_weight)
        
        current_weight = 0
        for model in models:
            current_weight += model.weight
            if rand_value <= current_weight:
                return model
        
        return models[0]  # Fallback
    
    def _is_circuit_closed(self, model_id: str) -> bool:
        """Check if circuit breaker allows requests"""
        breaker = self.circuit_breakers.get(model_id, {})
        
        if breaker.get('state') == 'open':
            # Check if we should try half-open
            if time.time() - breaker.get('last_failure', 0) > 30:  # 30 sec cooldown
                breaker['state'] = 'half_open'
                return True
            return False
        
        return True
    
    async def _execute_request(self, model: ModelConfig, request: Dict[str, Any]):
        """Execute request against selected model"""
        # Simulated model API call
        await asyncio.sleep(model.avg_latency_ms / 1000)
        
        # Simulate occasional failures
        if random.random() < 0.05:  # 5% failure rate
            raise Exception(f"Model {model.model_id} request failed")
        
        return {
            'text': f"Response from {model.model_id}",
            'tokens_used': random.randint(50, 200),
            'model_version': '1.0'
        }
    
    def _handle_failure(self, model_id: str, error: str):
        """Handle model failure and update circuit breaker"""
        breaker = self.circuit_breakers[model_id]
        breaker['failures'] += 1
        breaker['last_failure'] = time.time()
        
        if breaker['failures'] >= breaker['failure_threshold']:
            breaker['state'] = 'open'
            print(f"Circuit breaker opened for {model_id}")
    
    async def _try_fallback(self, failed_model_id: str, request: Dict[str, Any]):
        """Try fallback models in priority order"""
        fallback_models = sorted(
            [m for m in self.models.values() 
             if m.model_id != failed_model_id and self._is_circuit_closed(m.model_id)],
            key=lambda m: m.fallback_priority
        )
        
        for model in fallback_models:
            try:
                response = await self._execute_request(model, request)
                return {
                    'response': response,
                    'model_used': model.model_id,
                    'fallback_from': failed_model_id,
                    'routing_strategy': 'fallback'
                }
            except Exception:
                continue
        
        raise HTTPException(503, "All fallback models failed")

# FastAPI Application
app = FastAPI(title="AI Gateway Service")
gateway = AIGateway()

# Register models
gateway.register_model(ModelConfig(
    model_id="gpt-4",
    endpoint_url="https://api.openai.com/v1/chat/completions",
    weight=2.0,
    cost_per_token=0.00003,
    avg_latency_ms=800,
    capabilities=["reasoning", "coding", "analysis"],
    fallback_priority=1
))

gateway.register_model(ModelConfig(
    model_id="claude-3",
    endpoint_url="https://api.anthropic.com/v1/messages",
    weight=1.8,
    cost_per_token=0.000025,
    avg_latency_ms=650,
    capabilities=["reasoning", "writing", "analysis"],
    fallback_priority=2
))

@app.post("/chat/completions")
async def chat_completions(request: Dict[str, Any]):
    return await gateway.route_request(request)

@app.get("/gateway/status")
async def gateway_status():
    return {
        'models': len(gateway.models),
        'routing_strategy': gateway.routing_strategy.value,
        'fallback_enabled': gateway.fallback_enabled,
        'circuit_breakers': {
            model_id: breaker['state'] 
            for model_id, breaker in gateway.circuit_breakers.items()
        }
    }

Intelligent Request Classification

class RequestClassifier:
    def __init__(self):
        self.model_capabilities = {
            'gpt-4': {'reasoning': 0.95, 'coding': 0.9, 'creative': 0.85, 'analysis': 0.9},
            'claude-3': {'reasoning': 0.9, 'writing': 0.95, 'analysis': 0.85, 'safety': 0.95},
            'llama-2': {'coding': 0.8, 'general': 0.85, 'cost_efficient': 0.95},
            'gemini-pro': {'multimodal': 0.9, 'reasoning': 0.85, 'coding': 0.8}
        }
        
        self.task_patterns = {
            'coding': [r'write.*code', r'implement.*function', r'debug', r'refactor'],
            'reasoning': [r'explain.*why', r'analyze', r'compare', r'evaluate'],
            'creative': [r'write.*story', r'generate.*content', r'brainstorm'],
            'analysis': [r'summarize', r'extract.*data', r'analyze.*document']
        }
    
    def classify_request(self, request_text: str, context: Dict = None) -> Dict[str, Any]:
        """Classify request and recommend best model"""
        import re
        
        # Extract task type from request
        task_scores = {}
        for task_type, patterns in self.task_patterns.items():
            score = 0
            for pattern in patterns:
                if re.search(pattern, request_text.lower()):
                    score += 1
            task_scores[task_type] = score / len(patterns)
        
        # Find dominant task type
        primary_task = max(task_scores, key=task_scores.get) if task_scores else 'general'
        
        # Score models for this task
        model_scores = {}
        for model_id, capabilities in self.model_capabilities.items():
            score = capabilities.get(primary_task, 0.5)
            
            # Apply context-based adjustments
            if context:
                if context.get('budget') == 'low':
                    score *= capabilities.get('cost_efficient', 0.8)
                if context.get('latency_critical'):
                    score *= (1.0 - capabilities.get('avg_latency_ms', 500) / 1000)
                if context.get('safety_critical'):
                    score *= capabilities.get('safety', 0.7)
            
            model_scores[model_id] = score
        
        # Recommend top models
        ranked_models = sorted(model_scores.items(), key=lambda x: x[1], reverse=True)
        
        return {
            'primary_task': primary_task,
            'task_confidence': max(task_scores.values()) if task_scores else 0,
            'recommended_models': ranked_models[:3],
            'routing_reason': f"Best for {primary_task} tasks"
        }

# Usage in gateway
class EnhancedAIGateway(AIGateway):
    def __init__(self):
        super().__init__()
        self.classifier = RequestClassifier()
    
    def _select_model(self, request: Dict[str, Any]) -> ModelConfig:
        """Enhanced model selection with intelligent classification"""
        
        # Classify the request
        classification = self.classifier.classify_request(
            request.get('messages', [{}])[-1].get('content', ''),
            request.get('context', {})
        )
        
        # Get recommended model
        recommended_model_id = classification['recommended_models'][0][0]
        
        # Check if recommended model is available
        if (recommended_model_id in self.models and 
            self._is_circuit_closed(recommended_model_id)):
            return self.models[recommended_model_id]
        
        # Fallback to standard routing
        return super()._select_model(request)

Real-World AI Gateway Implementations

OpenAI Gateway

• Routes between GPT-3.5, GPT-4, and specialized models
• Handles 100M+ API requests daily
• Cost-based routing saves 30% on inference costs
• Multi-region failover with <50ms switching
• Used by ChatGPT, GitHub Copilot, Microsoft products

Anthropic Claude Gateway

• Intelligent routing based on safety requirements
• Handles constitutional AI filtering at gateway level
• Context-aware model selection (Claude vs Claude Instant)
• Real-time quality monitoring and fallbacks
• Powers enterprise Claude deployments

Hugging Face Inference API

• Routes across 50,000+ open-source models
• Dynamic scaling based on model popularity
• Automatic fallback to similar models
• Serves 1B+ inferences monthly
• Powers startups and enterprise ML workflows

Google Vertex AI Gateway

• Unified access to PaLM, Gemini, and custom models
• Intelligent routing based on model capabilities
• Enterprise-grade security and compliance
• Integrated with Google Cloud services
• Used by major enterprise customers

AI Gateway Best Practices

✅ Do

Implement comprehensive health checks for all models
Use circuit breakers to prevent cascade failures
Monitor latency, cost, and quality metrics continuously
Implement graceful fallback strategies
Use request classification for intelligent routing
Implement proper authentication and rate limiting

❌ Don't

Route blindly without understanding request context
Ignore model-specific rate limits and quotas
Create single points of failure in the gateway
Skip monitoring and observability implementation
Hard-code model endpoints or configurations
Neglect security and access control

No quiz questions available

Quiz ID "ai-gateway-patterns" not found