System Designer

RAG Architecture Deep Dive

Master advanced Retrieval-Augmented Generation architectures, from simple pipelines to complex agentic systems. Learn hierarchical retrieval, context management, quality optimization, and production deployment patterns.

4 Patterns

Architecture patterns

Production Ready

Battle-tested code

Advanced Topics

Reranking, hybrid search

🧮 RAG System Calculator

Calculate storage, performance, and cost metrics for your RAG deployment.

Corpus Size (documents): 100,000

Chunk Size (tokens): 512

Embedding Dimension: 768

Top-K Results: 5

Enable RerankingHybrid Search (Vector + Keyword)

System Analysis

Number of Chunks:390,625

Embedding Storage:1.12 GB

Total Storage:1.19 GB

Query Latency:58.0 ms

Throughput:17 QPS

Accuracy (est):85%

Monthly Cost:$39.09

RAG Architecture Patterns

Choose the right architecture pattern based on your use case and complexity requirements.

Simple RAG Pipeline

# Simple RAG Implementation
def simple_rag(query, vector_store, llm):
    # 1. Retrieve relevant documents
    docs = vector_store.search(query, k=5)
    
    # 2. Create context from documents
    context = "\n".join([doc.content for doc in docs])
    
    # 3. Generate answer with context
    prompt = f"Context: {context}\n\nQuestion: {query}\nAnswer:"
    answer = llm.generate(prompt)
    
    return answer

Advanced Chunking Strategies

Semantic Chunking

• Group semantically related content
• Maintain context boundaries
• Dynamic chunk sizes based on content
• Preserve document structure

Sliding Window

• Overlapping chunks for continuity
• Configurable overlap percentage
• Context preservation at boundaries
• Better retrieval coverage

Hierarchical Chunking

• Multiple granularity levels
• Document → Section → Paragraph
• Efficient for large documents
• Contextual navigation

Adaptive Chunking

• Content-aware boundaries
• Respects markdown/HTML structure
• Code block preservation
• Table and list handling

Smart Chunking Implementation

from typing import List, Dict
import re
from langchain.text_splitter import RecursiveCharacterTextSplitter

class SmartChunker:
    """Advanced document chunking with semantic awareness"""
    
    def __init__(self, chunk_size: int = 512, chunk_overlap: int = 128):
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        
        # Separators in order of preference
        self.separators = [
            "\n\n\n",  # Triple newline (major sections)
            "\n\n",     # Double newline (paragraphs)
            "\n",        # Single newline
            ". ",         # Sentence boundary
            "? ",         # Question boundary
            "! ",         # Exclamation boundary
            "; ",         # Semicolon
            ", ",         # Comma
            " "           # Space
        ]
    
    def chunk_by_structure(self, text: str) -> List[str]:
        """Chunk based on document structure"""
        splitter = RecursiveCharacterTextSplitter(
            chunk_size=self.chunk_size,
            chunk_overlap=self.chunk_overlap,
            separators=self.separators,
            length_function=len
        )
        return splitter.split_text(text)
    
    def sliding_window_chunks(self, text: str, window_size: int = 5, stride: int = 3) -> List[Dict]:
        """Create overlapping chunks with metadata"""
        sentences = self._split_sentences(text)
        chunks = []
        
        for i in range(0, len(sentences) - window_size + 1, stride):
            chunk_sentences = sentences[i:i + window_size]
            chunk_text = ' '.join(chunk_sentences)
            
            # Add context from surrounding sentences
            prev_context = ' '.join(sentences[max(0, i-2):i]) if i > 0 else ""
            next_context = ' '.join(sentences[i+window_size:min(len(sentences), i+window_size+2)])
            
            chunks.append({
                'text': chunk_text,
                'metadata': {
                    'chunk_id': f"chunk_{i}",
                    'position': i,
                    'prev_context': prev_context[:200],
                    'next_context': next_context[:200]
                }
            })
        
        return chunks

Advanced Context Management

Context Window Optimization

• Token Budget Management: Allocate tokens between context and response
• Context Compression: Summarize less relevant information
• Sliding Windows: Maintain conversation history efficiently
• Relevance Ranking: Prioritize most important context

Quality Optimization

• Reranking Models: Cross-encoder for precision improvement
• Hybrid Search: Combine vector and keyword search
• Query Expansion: Enrich queries with synonyms
• Answer Validation: Verify response grounding

Production RAG Pipeline

Complete Production RAG System

import asyncio
from typing import List, Dict, Optional
from dataclasses import dataclass
import numpy as np
from sentence_transformers import SentenceTransformer, CrossEncoder
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
import redis
import hashlib
import json

@dataclass
class RAGConfig:
    embedding_model: str = "all-mpnet-base-v2"
    rerank_model: str = "cross-encoder/ms-marco-MiniLM-L-6-v2"
    chunk_size: int = 512
    chunk_overlap: int = 128
    top_k: int = 5
    rerank_top_k: int = 3
    cache_ttl: int = 3600

class ProductionRAG:
    """Production-ready RAG system with caching and reranking"""
    
    def __init__(self, config: RAGConfig):
        self.config = config
        self.embedder = SentenceTransformer(config.embedding_model)
        self.reranker = CrossEncoder(config.rerank_model)
        self.vector_store = QdrantClient(host="localhost", port=6333)
        self.cache = redis.Redis(host='localhost', port=6379)
        self.collection_name = "rag_documents"
        self._init_collection()
    
    def _init_collection(self):
        """Initialize vector store collection"""
        self.vector_store.recreate_collection(
            collection_name=self.collection_name,
            vectors_config=VectorParams(
                size=self.embedder.get_sentence_embedding_dimension(),
                distance=Distance.COSINE
            )
        )
    
    async def ingest_documents(self, documents: List[Dict]) -> Dict:
        """Ingest and index documents with smart chunking"""
        from smart_chunker import SmartChunker
        chunker = SmartChunker(self.config.chunk_size, self.config.chunk_overlap)
        
        all_chunks = []
        for doc in documents:
            # Smart chunking
            chunks = chunker.chunk_by_structure(doc['content'])
            
            for i, chunk in enumerate(chunks):
                # Generate embedding
                embedding = self.embedder.encode(chunk)
                
                # Create point for vector store
                point = PointStruct(
                    id=hashlib.md5(f"{doc['id']}_{i}".encode()).hexdigest(),
                    vector=embedding.tolist(),
                    payload={
                        "content": chunk,
                        "doc_id": doc['id'],
                        "chunk_index": i,
                        "metadata": doc.get('metadata', {})
                    }
                )
                all_chunks.append(point)
        
        # Batch upsert to vector store
        self.vector_store.upsert(
            collection_name=self.collection_name,
            points=all_chunks
        )
        
        return {
            "documents_processed": len(documents),
            "chunks_created": len(all_chunks)
        }
    
    async def retrieve(self, query: str, use_cache: bool = True) -> List[Dict]:
        """Retrieve relevant documents with caching and reranking"""
        # Check cache
        if use_cache:
            cache_key = f"rag:{hashlib.md5(query.encode()).hexdigest()}"
            cached = self.cache.get(cache_key)
            if cached:
                return json.loads(cached)
        
        # Generate query embedding
        query_embedding = self.embedder.encode(query)
        
        # Vector search (get more than needed for reranking)
        results = self.vector_store.search(
            collection_name=self.collection_name,
            query_vector=query_embedding.tolist(),
            limit=self.config.top_k * 3
        )
        
        # Extract documents
        docs = [{
            'content': hit.payload['content'],
            'score': hit.score,
            'metadata': hit.payload.get('metadata', {})
        } for hit in results]
        
        # Rerank documents
        if len(docs) > 0:
            pairs = [[query, doc['content']] for doc in docs]
            rerank_scores = self.reranker.predict(pairs)
            
            # Sort by rerank scores and take top-k
            ranked_docs = sorted(
                zip(docs, rerank_scores),
                key=lambda x: x[1],
                reverse=True
            )[:self.config.rerank_top_k]
            
            final_docs = [
                {**doc, 'rerank_score': float(score)}
                for doc, score in ranked_docs
            ]
        else:
            final_docs = []
        
        # Cache results
        if use_cache and final_docs:
            self.cache.setex(
                cache_key,
                self.config.cache_ttl,
                json.dumps(final_docs)
            )
        
        return final_docs
    
    async def query(self, question: str, llm_client) -> Dict:
        """Complete RAG query pipeline"""
        # Retrieve relevant documents
        retrieved_docs = await self.retrieve(question)
        
        if not retrieved_docs:
            return {
                "answer": "No relevant documents found.",
                "sources": [],
                "confidence": 0.0
            }
        
        # Create context
        context = "\n\n".join([
            f"[Source {i+1}]\n{doc['content']}"
            for i, doc in enumerate(retrieved_docs)
        ])
        
        # Generate answer
        prompt = f"""Answer based on the context provided.

Context:
{context}

Question: {question}

Answer:"""
        
        answer = await llm_client.generate(prompt)
        
        # Calculate confidence
        avg_score = np.mean([d.get('rerank_score', 0) for d in retrieved_docs])
        
        return {
            "answer": answer,
            "sources": retrieved_docs,
            "confidence": min(avg_score, 1.0)
        }

Real-World Examples

Microsoft Copilot

Uses hierarchical RAG to ground responses in Office 365 data with sub-100ms retrieval.

• 100M+ documents indexed
• Multi-tenant isolation
• 95% answer accuracy

Perplexity AI

Implements iterative RAG with real-time web crawling and multi-source aggregation.

• Real-time indexing
• Citation tracking
• Query refinement

GitHub Copilot

Leverages agentic RAG for context-aware code suggestions across repositories.

• 1B+ lines of code
• Multi-language support
• Semantic code search

RAG Best Practices

✅ Do's

•Use chunk overlap (20-25%) for context continuity
•Implement hybrid search for better recall
•Use reranking for precision improvement
•Cache frequent queries for performance
•Version your document embeddings
•Monitor retrieval quality metrics

❌ Don'ts

•Don't use fixed chunk sizes for all content
•Don't ignore document metadata
•Don't skip deduplication
•Don't retrieve too many chunks
•Don't forget out-of-domain handling
•Don't expose raw documents unfiltered