💰 Understanding Token Economics
Tokens are the currency of GenAI. Every prompt and response consumes tokens, directly impacting costs and performance. Effective token management can reduce costs by 50-90% while maintaining quality.
Cost Reality: A single ChatGPT-style conversation can cost $0.02-$0.20. At scale (1M requests/month), poor token management can mean $20K+ in unnecessary costs.
❌ Poor Token Management
- • No usage tracking or limits
- • Inefficient prompt design
- • No caching strategy
- • Unpredictable costs
✅ Optimized Token Management
- • Real-time usage monitoring
- • Smart caching and compression
- • User budgets and rate limiting
- • 50-90% cost reduction
🎯 Management Strategies
Streaming Token Management
Real-time token tracking with early termination
35% reduction in usage
200ms first token
40% cost savings
Constant O(1) memory
Implementation
import tiktoken
from typing import AsyncIterator, Dict, Optional
from dataclasses import dataclass
@dataclass
class TokenUsage:
prompt_tokens: int
completion_tokens: int
total_tokens: int
estimated_cost: float
class StreamingTokenManager:
def __init__(self, model: str = "gpt-4"):
self.model = model
self.encoder = tiktoken.encoding_for_model(model)
self.pricing = {
"gpt-4": {"prompt": 0.03, "completion": 0.06},
"gpt-3.5-turbo": {"prompt": 0.0015, "completion": 0.002}
}
async def stream_with_tracking(self, prompt: str, max_tokens: Optional[int] = None):
"""Stream responses with real-time token tracking"""
prompt_tokens = len(self.encoder.encode(prompt))
completion_tokens = 0
# Early termination if exceeding limit
if max_tokens and prompt_tokens > max_tokens:
raise ValueError(f"Prompt exceeds limit: {prompt_tokens} > {max_tokens}")
async for chunk in self._stream_completion(prompt, max_tokens):
if chunk.get('content'):
chunk_tokens = len(self.encoder.encode(chunk['content']))
completion_tokens += chunk_tokens
# Check limit during streaming
if max_tokens and (prompt_tokens + completion_tokens) > max_tokens:
yield {'content': chunk['content'], 'finish_reason': 'length'}
break
yield {
'content': chunk['content'],
'tokens_used': prompt_tokens + completion_tokens,
'estimated_cost': self._calculate_cost(prompt_tokens, completion_tokens)
}
# Final usage summary
yield {
'usage': TokenUsage(
prompt_tokens=prompt_tokens,
completion_tokens=completion_tokens,
total_tokens=prompt_tokens + completion_tokens,
estimated_cost=self._calculate_cost(prompt_tokens, completion_tokens)
)
}
⚡ Optimization Techniques
Smart Caching
Cache responses based on semantic similarity
💻 Implementation Examples
// Basic Token Counter
import tiktoken
class TokenCounter:
def __init__(self, model="gpt-4"):
self.encoder = tiktoken.encoding_for_model(model)
self.model_limits = {
"gpt-4": 8192,
"gpt-3.5-turbo": 4096,
"claude-3": 200000
}
def count_tokens(self, text: str) -> int:
return len(self.encoder.encode(text))
def estimate_cost(self, prompt_tokens: int, completion_tokens: int) -> float:
pricing = {
"gpt-4": {"prompt": 0.03, "completion": 0.06}
}
return (prompt_tokens * pricing["gpt-4"]["prompt"] +
completion_tokens * pricing["gpt-4"]["completion"]) / 1000
📊 Cost Analysis
Model Pricing (per 1K tokens)
Optimization Impact
- • Smart Caching: 90% cost reduction
- • Prompt Compression: 40% token savings
- • Request Batching: 50% throughput boost
- • Context Management: 60% efficiency gain
✅ Best Practices
Cost Control
- ✓ Implement user/feature budgets
- ✓ Monitor usage in real-time
- ✓ Set up cost alerts
- ✓ Cache frequent queries
- ✓ Use smaller models when possible
Performance
- ✓ Stream responses for better UX
- ✓ Batch requests efficiently
- ✓ Compress prompts semantically
- ✓ Implement graceful degradation
- ✓ Track and optimize token efficiency
🎯 Key Takeaways
Measure everything: Track tokens, costs, and efficiency metrics in real-time
Cache intelligently: 90% cost reduction possible with smart caching strategies
Stream for UX: Streaming improves perceived performance and enables early termination
Budget and limit: Prevent runaway costs with user budgets and rate limiting
Right-size models: Use the smallest model that meets quality requirements