System Designer

💰 Understanding Token Economics

Tokens are the currency of GenAI. Every prompt and response consumes tokens, directly impacting costs and performance. Effective token management can reduce costs by 50-90% while maintaining quality.

Cost Reality: A single ChatGPT-style conversation can cost $0.02-$0.20. At scale (1M requests/month), poor token management can mean $20K+ in unnecessary costs.

❌ Poor Token Management

• No usage tracking or limits
• Inefficient prompt design
• No caching strategy
• Unpredictable costs

✅ Optimized Token Management

• Real-time usage monitoring
• Smart caching and compression
• User budgets and rate limiting
• 50-90% cost reduction

🎯 Management Strategies

Streaming Token Management

Real-time token tracking with early termination

35% reduction in usage

200ms first token

40% cost savings

Constant O(1) memory

Implementation

import tiktoken
from typing import AsyncIterator, Dict, Optional
from dataclasses import dataclass

@dataclass
class TokenUsage:
    prompt_tokens: int
    completion_tokens: int
    total_tokens: int
    estimated_cost: float

class StreamingTokenManager:
    def __init__(self, model: str = "gpt-4"):
        self.model = model
        self.encoder = tiktoken.encoding_for_model(model)
        self.pricing = {
            "gpt-4": {"prompt": 0.03, "completion": 0.06},
            "gpt-3.5-turbo": {"prompt": 0.0015, "completion": 0.002}
        }
        
    async def stream_with_tracking(self, prompt: str, max_tokens: Optional[int] = None):
        """Stream responses with real-time token tracking"""
        prompt_tokens = len(self.encoder.encode(prompt))
        completion_tokens = 0
        
        # Early termination if exceeding limit
        if max_tokens and prompt_tokens > max_tokens:
            raise ValueError(f"Prompt exceeds limit: {prompt_tokens} > {max_tokens}")
        
        async for chunk in self._stream_completion(prompt, max_tokens):
            if chunk.get('content'):
                chunk_tokens = len(self.encoder.encode(chunk['content']))
                completion_tokens += chunk_tokens
                
                # Check limit during streaming
                if max_tokens and (prompt_tokens + completion_tokens) > max_tokens:
                    yield {'content': chunk['content'], 'finish_reason': 'length'}
                    break
                    
                yield {
                    'content': chunk['content'],
                    'tokens_used': prompt_tokens + completion_tokens,
                    'estimated_cost': self._calculate_cost(prompt_tokens, completion_tokens)
                }
        
        # Final usage summary
        yield {
            'usage': TokenUsage(
                prompt_tokens=prompt_tokens,
                completion_tokens=completion_tokens,
                total_tokens=prompt_tokens + completion_tokens,
                estimated_cost=self._calculate_cost(prompt_tokens, completion_tokens)
            )
        }

⚡ Optimization Techniques

Smart Caching

Cache responses based on semantic similarity

Impact: 90% cost reduction for repeated queries

💻 Implementation Examples

// Basic Token Counter
import tiktoken

class TokenCounter:
    def __init__(self, model="gpt-4"):
        self.encoder = tiktoken.encoding_for_model(model)
        self.model_limits = {
            "gpt-4": 8192,
            "gpt-3.5-turbo": 4096,
            "claude-3": 200000
        }
    
    def count_tokens(self, text: str) -> int:
        return len(self.encoder.encode(text))
    
    def estimate_cost(self, prompt_tokens: int, completion_tokens: int) -> float:
        pricing = {
            "gpt-4": {"prompt": 0.03, "completion": 0.06}
        }
        return (prompt_tokens * pricing["gpt-4"]["prompt"] + 
                completion_tokens * pricing["gpt-4"]["completion"]) / 1000

📊 Cost Analysis

Model Pricing (per 1K tokens)

GPT-4 Turbo$0.01 / $0.03

GPT-3.5 Turbo$0.0005 / $0.0015

Claude-3 Opus$0.015 / $0.075

Optimization Impact

• Smart Caching: 90% cost reduction
• Prompt Compression: 40% token savings
• Request Batching: 50% throughput boost
• Context Management: 60% efficiency gain

✅ Best Practices

Cost Control

✓ Implement user/feature budgets
✓ Monitor usage in real-time
✓ Set up cost alerts
✓ Cache frequent queries
✓ Use smaller models when possible

Performance

✓ Stream responses for better UX
✓ Batch requests efficiently
✓ Compress prompts semantically
✓ Implement graceful degradation
✓ Track and optimize token efficiency

🎯 Key Takeaways

✓

Measure everything: Track tokens, costs, and efficiency metrics in real-time

✓

Cache intelligently: 90% cost reduction possible with smart caching strategies

✓

Stream for UX: Streaming improves perceived performance and enables early termination

✓

Budget and limit: Prevent runaway costs with user budgets and rate limiting

✓

Right-size models: Use the smallest model that meets quality requirements

No quiz questions available

Quiz ID "token-management" not found

Token Management & Cost Optimization