Token Management & Cost Optimization

Master token economics, usage optimization, and cost management for production GenAI applications.

35 min readAdvanced
Not Started
Loading...

💰 Understanding Token Economics

Tokens are the currency of GenAI. Every prompt and response consumes tokens, directly impacting costs and performance. Effective token management can reduce costs by 50-90% while maintaining quality.

Cost Reality: A single ChatGPT-style conversation can cost $0.02-$0.20. At scale (1M requests/month), poor token management can mean $20K+ in unnecessary costs.

❌ Poor Token Management

  • • No usage tracking or limits
  • • Inefficient prompt design
  • • No caching strategy
  • • Unpredictable costs

✅ Optimized Token Management

  • • Real-time usage monitoring
  • • Smart caching and compression
  • • User budgets and rate limiting
  • • 50-90% cost reduction

🎯 Management Strategies

Streaming Token Management

Real-time token tracking with early termination

35% reduction in usage

200ms first token

40% cost savings

Constant O(1) memory

Implementation

import tiktoken
from typing import AsyncIterator, Dict, Optional
from dataclasses import dataclass

@dataclass
class TokenUsage:
    prompt_tokens: int
    completion_tokens: int
    total_tokens: int
    estimated_cost: float

class StreamingTokenManager:
    def __init__(self, model: str = "gpt-4"):
        self.model = model
        self.encoder = tiktoken.encoding_for_model(model)
        self.pricing = {
            "gpt-4": {"prompt": 0.03, "completion": 0.06},
            "gpt-3.5-turbo": {"prompt": 0.0015, "completion": 0.002}
        }
        
    async def stream_with_tracking(self, prompt: str, max_tokens: Optional[int] = None):
        """Stream responses with real-time token tracking"""
        prompt_tokens = len(self.encoder.encode(prompt))
        completion_tokens = 0
        
        # Early termination if exceeding limit
        if max_tokens and prompt_tokens > max_tokens:
            raise ValueError(f"Prompt exceeds limit: {prompt_tokens} > {max_tokens}")
        
        async for chunk in self._stream_completion(prompt, max_tokens):
            if chunk.get('content'):
                chunk_tokens = len(self.encoder.encode(chunk['content']))
                completion_tokens += chunk_tokens
                
                # Check limit during streaming
                if max_tokens and (prompt_tokens + completion_tokens) > max_tokens:
                    yield {'content': chunk['content'], 'finish_reason': 'length'}
                    break
                    
                yield {
                    'content': chunk['content'],
                    'tokens_used': prompt_tokens + completion_tokens,
                    'estimated_cost': self._calculate_cost(prompt_tokens, completion_tokens)
                }
        
        # Final usage summary
        yield {
            'usage': TokenUsage(
                prompt_tokens=prompt_tokens,
                completion_tokens=completion_tokens,
                total_tokens=prompt_tokens + completion_tokens,
                estimated_cost=self._calculate_cost(prompt_tokens, completion_tokens)
            )
        }

⚡ Optimization Techniques

Smart Caching

Cache responses based on semantic similarity

Impact: 90% cost reduction for repeated queries

💻 Implementation Examples

// Basic Token Counter
import tiktoken

class TokenCounter:
    def __init__(self, model="gpt-4"):
        self.encoder = tiktoken.encoding_for_model(model)
        self.model_limits = {
            "gpt-4": 8192,
            "gpt-3.5-turbo": 4096,
            "claude-3": 200000
        }
    
    def count_tokens(self, text: str) -> int:
        return len(self.encoder.encode(text))
    
    def estimate_cost(self, prompt_tokens: int, completion_tokens: int) -> float:
        pricing = {
            "gpt-4": {"prompt": 0.03, "completion": 0.06}
        }
        return (prompt_tokens * pricing["gpt-4"]["prompt"] + 
                completion_tokens * pricing["gpt-4"]["completion"]) / 1000

📊 Cost Analysis

Model Pricing (per 1K tokens)

GPT-4 Turbo$0.01 / $0.03
GPT-3.5 Turbo$0.0005 / $0.0015
Claude-3 Opus$0.015 / $0.075

Optimization Impact

  • Smart Caching: 90% cost reduction
  • Prompt Compression: 40% token savings
  • Request Batching: 50% throughput boost
  • Context Management: 60% efficiency gain

✅ Best Practices

Cost Control

  • ✓ Implement user/feature budgets
  • ✓ Monitor usage in real-time
  • ✓ Set up cost alerts
  • ✓ Cache frequent queries
  • ✓ Use smaller models when possible

Performance

  • ✓ Stream responses for better UX
  • ✓ Batch requests efficiently
  • ✓ Compress prompts semantically
  • ✓ Implement graceful degradation
  • ✓ Track and optimize token efficiency

🎯 Key Takeaways

Measure everything: Track tokens, costs, and efficiency metrics in real-time

Cache intelligently: 90% cost reduction possible with smart caching strategies

Stream for UX: Streaming improves perceived performance and enables early termination

Budget and limit: Prevent runaway costs with user budgets and rate limiting

Right-size models: Use the smallest model that meets quality requirements

📝 Token Management Mastery Check

1 of 8Current: 0/8

What is the primary benefit of streaming token management?