Design a RAG (Retrieval-Augmented Generation) System

Build a system that combines external knowledge retrieval with large language models to provide accurate, contextual, and up-to-date responses.

System Requirements

Functional Requirements

  • Document ingestion and chunking
  • Vector embedding generation
  • Semantic search and retrieval
  • Context-aware response generation
  • Multi-modal content support
  • Real-time knowledge updates

Non-Functional Requirements

  • Query response time < 3 seconds
  • Support 10M+ documents
  • Handle 10K concurrent users
  • 99.9% availability
  • Accurate retrieval (recall > 85%)
  • Coherent generation (BLEU > 0.4)

Retrieval Strategies

Dense Retrieval

87% accuracy

Use neural embeddings for semantic similarity matching

Pros

  • Better semantic understanding
  • Handles synonyms well
  • Context-aware

Cons

  • Computationally expensive
  • Requires training data
  • Black box similarity

Embedding

OpenAI text-embedding-3-large

Retrieval

Vector database (Pinecone/Weaviate)

Sparse Retrieval

72% accuracy

Traditional keyword-based search (BM25, TF-IDF)

Pros

  • Fast and interpretable
  • Works well for exact matches
  • Lower compute cost

Cons

  • Misses semantic similarity
  • Poor with synonyms
  • Keyword dependence

Embedding

BM25 scoring

Retrieval

Elasticsearch/Solr

Hybrid Retrieval

91% accuracy

Combine dense and sparse retrieval with ranking

Pros

  • Best of both approaches
  • Robust performance
  • Handles diverse queries

Cons

  • Complex implementation
  • Higher latency
  • Requires tuning

Embedding

Multi-vector approach

Retrieval

Fusion ranking algorithms

System Architecture

User Query → Query Processing → Embedding Service
↓ ← Vector Database (Pinecone/Weaviate)
Similarity Search → Document Retrieval → Context Ranking
↓ ← Knowledge Base (Documents/APIs)
Prompt Construction → LLM Service → Response Generation
Response Validation → Answer Delivery

Data Ingestion

  • Document parsing
  • Chunk generation
  • Metadata extraction
  • Quality filtering

Embedding Pipeline

  • Text embeddings
  • Image embeddings
  • Batch processing
  • Vector storage

Retrieval Engine

  • Query processing
  • Similarity search
  • Result ranking
  • Context filtering

Generation Service

  • Prompt engineering
  • LLM inference
  • Response synthesis
  • Fact checking

Capacity Estimation

Query Processing & Storage

Query Types
70%Simple
30%Complex
Response Times
500msRetrieval
2000msGeneration
Storage Needs
2TBVectors
8TBDocuments

Performance Metrics

Daily Queries
Peak: 2K QPS
1M+
Retrieval Accuracy
Hybrid approach
91%
Response Time P95
Including generation
2.8s
Knowledge Base Size
500GB processed
10M docs

Infrastructure Requirements

Vector Database
2TB vectors + replicas
LLM Service
GPU clusters for inference
Document Store
8TB raw documents

Database Schema

documents

document_id (UUID, Primary Key) title content source_url author created_at updated_at document_type (pdf/html/txt) language metadata (JSON) status (processed/pending)

chunks

chunk_id (UUID, Primary Key) document_id (Foreign Key) chunk_text chunk_index start_position end_position embedding_vector (VECTOR) chunk_size overlap_size

queries

query_id (UUID, Primary Key) user_id query_text query_embedding (VECTOR) timestamp response_time retrieval_results (JSON) generated_response feedback_score

embeddings

embedding_id (UUID, Primary Key) content_hash embedding_vector (VECTOR) model_version created_at dimension_size embedding_type (query/document)

Advanced RAG Techniques

Multi-hop Retrieval

• Iterative information gathering
• Question decomposition
• Evidence chain building
• Complex reasoning support

Self-RAG

• Relevance reflection
• Answer quality checking
• Adaptive retrieval triggers
• Self-correction loops

Graph RAG

• Knowledge graph construction
• Entity relationship extraction
• Graph-based retrieval
• Structured reasoning paths

Evaluation Metrics

Retrieval Quality:
  • • Recall@K (relevant docs retrieved)
  • • Precision@K (retrieved docs relevant)
  • • MRR (Mean Reciprocal Rank)
  • • NDCG (Normalized Discounted Cumulative Gain)
Generation Quality:
  • • BLEU (n-gram overlap with reference)
  • • ROUGE (recall-oriented summarization)
  • • Faithfulness (response grounded in context)
  • • Relevance (response addresses query)

Practice Questions

1

How would you handle document updates in real-time while maintaining consistency across embeddings?

2

Design a chunking strategy for different document types (PDFs, code, structured data). How do you optimize chunk size?

3

Implement multi-modal RAG that handles text, images, and tables. How do you handle cross-modal retrieval?

4

Design attribution and citation system. How do you trace generated answers back to source documents?

5

Handle hallucination detection and prevention. How do you ensure generated responses stay grounded in retrieved context?