RAG Architecture Deep Dive
Master advanced Retrieval-Augmented Generation architectures, from simple pipelines to complex agentic systems. Learn hierarchical retrieval, context management, quality optimization, and production deployment patterns.
4 Patterns
Architecture patterns
Production Ready
Battle-tested code
Advanced Topics
Reranking, hybrid search
🧮 RAG System Calculator
Calculate storage, performance, and cost metrics for your RAG deployment.
System Analysis
Number of Chunks:390,625
Embedding Storage:1.12 GB
Total Storage:1.19 GB
Query Latency:58.0 ms
Throughput:17 QPS
Accuracy (est):85%
Monthly Cost:$39.09
RAG Architecture Patterns
Choose the right architecture pattern based on your use case and complexity requirements.
Simple RAG Pipeline
# Simple RAG Implementation
def simple_rag(query, vector_store, llm):
# 1. Retrieve relevant documents
docs = vector_store.search(query, k=5)
# 2. Create context from documents
context = "\n".join([doc.content for doc in docs])
# 3. Generate answer with context
prompt = f"Context: {context}\n\nQuestion: {query}\nAnswer:"
answer = llm.generate(prompt)
return answerAdvanced Chunking Strategies
Semantic Chunking
- • Group semantically related content
- • Maintain context boundaries
- • Dynamic chunk sizes based on content
- • Preserve document structure
Sliding Window
- • Overlapping chunks for continuity
- • Configurable overlap percentage
- • Context preservation at boundaries
- • Better retrieval coverage
Hierarchical Chunking
- • Multiple granularity levels
- • Document → Section → Paragraph
- • Efficient for large documents
- • Contextual navigation
Adaptive Chunking
- • Content-aware boundaries
- • Respects markdown/HTML structure
- • Code block preservation
- • Table and list handling
Smart Chunking Implementation
from typing import List, Dict
import re
from langchain.text_splitter import RecursiveCharacterTextSplitter
class SmartChunker:
"""Advanced document chunking with semantic awareness"""
def __init__(self, chunk_size: int = 512, chunk_overlap: int = 128):
self.chunk_size = chunk_size
self.chunk_overlap = chunk_overlap
# Separators in order of preference
self.separators = [
"\n\n\n", # Triple newline (major sections)
"\n\n", # Double newline (paragraphs)
"\n", # Single newline
". ", # Sentence boundary
"? ", # Question boundary
"! ", # Exclamation boundary
"; ", # Semicolon
", ", # Comma
" " # Space
]
def chunk_by_structure(self, text: str) -> List[str]:
"""Chunk based on document structure"""
splitter = RecursiveCharacterTextSplitter(
chunk_size=self.chunk_size,
chunk_overlap=self.chunk_overlap,
separators=self.separators,
length_function=len
)
return splitter.split_text(text)
def sliding_window_chunks(self, text: str, window_size: int = 5, stride: int = 3) -> List[Dict]:
"""Create overlapping chunks with metadata"""
sentences = self._split_sentences(text)
chunks = []
for i in range(0, len(sentences) - window_size + 1, stride):
chunk_sentences = sentences[i:i + window_size]
chunk_text = ' '.join(chunk_sentences)
# Add context from surrounding sentences
prev_context = ' '.join(sentences[max(0, i-2):i]) if i > 0 else ""
next_context = ' '.join(sentences[i+window_size:min(len(sentences), i+window_size+2)])
chunks.append({
'text': chunk_text,
'metadata': {
'chunk_id': f"chunk_{i}",
'position': i,
'prev_context': prev_context[:200],
'next_context': next_context[:200]
}
})
return chunksAdvanced Context Management
Context Window Optimization
- • Token Budget Management: Allocate tokens between context and response
- • Context Compression: Summarize less relevant information
- • Sliding Windows: Maintain conversation history efficiently
- • Relevance Ranking: Prioritize most important context
Quality Optimization
- • Reranking Models: Cross-encoder for precision improvement
- • Hybrid Search: Combine vector and keyword search
- • Query Expansion: Enrich queries with synonyms
- • Answer Validation: Verify response grounding
Production RAG Pipeline
Complete Production RAG System
import asyncio
from typing import List, Dict, Optional
from dataclasses import dataclass
import numpy as np
from sentence_transformers import SentenceTransformer, CrossEncoder
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
import redis
import hashlib
import json
@dataclass
class RAGConfig:
embedding_model: str = "all-mpnet-base-v2"
rerank_model: str = "cross-encoder/ms-marco-MiniLM-L-6-v2"
chunk_size: int = 512
chunk_overlap: int = 128
top_k: int = 5
rerank_top_k: int = 3
cache_ttl: int = 3600
class ProductionRAG:
"""Production-ready RAG system with caching and reranking"""
def __init__(self, config: RAGConfig):
self.config = config
self.embedder = SentenceTransformer(config.embedding_model)
self.reranker = CrossEncoder(config.rerank_model)
self.vector_store = QdrantClient(host="localhost", port=6333)
self.cache = redis.Redis(host='localhost', port=6379)
self.collection_name = "rag_documents"
self._init_collection()
def _init_collection(self):
"""Initialize vector store collection"""
self.vector_store.recreate_collection(
collection_name=self.collection_name,
vectors_config=VectorParams(
size=self.embedder.get_sentence_embedding_dimension(),
distance=Distance.COSINE
)
)
async def ingest_documents(self, documents: List[Dict]) -> Dict:
"""Ingest and index documents with smart chunking"""
from smart_chunker import SmartChunker
chunker = SmartChunker(self.config.chunk_size, self.config.chunk_overlap)
all_chunks = []
for doc in documents:
# Smart chunking
chunks = chunker.chunk_by_structure(doc['content'])
for i, chunk in enumerate(chunks):
# Generate embedding
embedding = self.embedder.encode(chunk)
# Create point for vector store
point = PointStruct(
id=hashlib.md5(f"{doc['id']}_{i}".encode()).hexdigest(),
vector=embedding.tolist(),
payload={
"content": chunk,
"doc_id": doc['id'],
"chunk_index": i,
"metadata": doc.get('metadata', {})
}
)
all_chunks.append(point)
# Batch upsert to vector store
self.vector_store.upsert(
collection_name=self.collection_name,
points=all_chunks
)
return {
"documents_processed": len(documents),
"chunks_created": len(all_chunks)
}
async def retrieve(self, query: str, use_cache: bool = True) -> List[Dict]:
"""Retrieve relevant documents with caching and reranking"""
# Check cache
if use_cache:
cache_key = f"rag:{hashlib.md5(query.encode()).hexdigest()}"
cached = self.cache.get(cache_key)
if cached:
return json.loads(cached)
# Generate query embedding
query_embedding = self.embedder.encode(query)
# Vector search (get more than needed for reranking)
results = self.vector_store.search(
collection_name=self.collection_name,
query_vector=query_embedding.tolist(),
limit=self.config.top_k * 3
)
# Extract documents
docs = [{
'content': hit.payload['content'],
'score': hit.score,
'metadata': hit.payload.get('metadata', {})
} for hit in results]
# Rerank documents
if len(docs) > 0:
pairs = [[query, doc['content']] for doc in docs]
rerank_scores = self.reranker.predict(pairs)
# Sort by rerank scores and take top-k
ranked_docs = sorted(
zip(docs, rerank_scores),
key=lambda x: x[1],
reverse=True
)[:self.config.rerank_top_k]
final_docs = [
{**doc, 'rerank_score': float(score)}
for doc, score in ranked_docs
]
else:
final_docs = []
# Cache results
if use_cache and final_docs:
self.cache.setex(
cache_key,
self.config.cache_ttl,
json.dumps(final_docs)
)
return final_docs
async def query(self, question: str, llm_client) -> Dict:
"""Complete RAG query pipeline"""
# Retrieve relevant documents
retrieved_docs = await self.retrieve(question)
if not retrieved_docs:
return {
"answer": "No relevant documents found.",
"sources": [],
"confidence": 0.0
}
# Create context
context = "\n\n".join([
f"[Source {i+1}]\n{doc['content']}"
for i, doc in enumerate(retrieved_docs)
])
# Generate answer
prompt = f"""Answer based on the context provided.
Context:
{context}
Question: {question}
Answer:"""
answer = await llm_client.generate(prompt)
# Calculate confidence
avg_score = np.mean([d.get('rerank_score', 0) for d in retrieved_docs])
return {
"answer": answer,
"sources": retrieved_docs,
"confidence": min(avg_score, 1.0)
}Real-World Examples
Microsoft Copilot
Uses hierarchical RAG to ground responses in Office 365 data with sub-100ms retrieval.
- • 100M+ documents indexed
- • Multi-tenant isolation
- • 95% answer accuracy
Perplexity AI
Implements iterative RAG with real-time web crawling and multi-source aggregation.
- • Real-time indexing
- • Citation tracking
- • Query refinement
GitHub Copilot
Leverages agentic RAG for context-aware code suggestions across repositories.
- • 1B+ lines of code
- • Multi-language support
- • Semantic code search
RAG Best Practices
✅ Do's
- •Use chunk overlap (20-25%) for context continuity
- •Implement hybrid search for better recall
- •Use reranking for precision improvement
- •Cache frequent queries for performance
- •Version your document embeddings
- •Monitor retrieval quality metrics
❌ Don'ts
- •Don't use fixed chunk sizes for all content
- •Don't ignore document metadata
- •Don't skip deduplication
- •Don't retrieve too many chunks
- •Don't forget out-of-domain handling
- •Don't expose raw documents unfiltered