Design a Search Engine (Google/Bing)
Build a web-scale search engine with distributed crawling, indexing, ranking algorithms, and real-time query processing for billions of documents.
System Requirements
Functional Requirements
- Crawl and index billions of web pages
- Process search queries with sub-second latency
- Rank results by relevance and quality
- Handle different content types (text, images, PDFs)
- Support advanced search operators and filters
- Provide search suggestions and auto-complete
- Fresh content indexing and real-time updates
- Personalized search results based on user history
Non-Functional Requirements
- 100B+ web pages indexed
- 100K+ search queries per second
- Search latency < 200ms globally
- 99.9% search availability
- Handle 10B+ new/updated pages daily
- Support 50+ languages and regions
- Crawl respectfully (robots.txt compliance)
- Storage for 100+ petabytes of content
Capacity Estimation
Search Engine Scale
Search Traffic
Infrastructure Scale
Web Crawling Strategies
Breadth-First Crawling
Crawl pages level by level from seed URLs
Pros
- • Discovers popular pages first
- • Good coverage
- • Simple to implement
Cons
- • May get stuck in large sites
- • Slow to find deep content
Best For
Focused Crawling
Prioritize pages based on topic relevance or quality
Pros
- • Efficient resource usage
- • High-quality content
- • Topic-specific
Cons
- • Complex topic classification
- • May miss diverse content
Best For
Continuous Crawling
Monitor and re-crawl pages based on update frequency
Pros
- • Fresh content
- • Efficient updates
- • Change detection
Cons
- • Complex scheduling
- • Resource intensive
Best For
System Architecture
Web Crawler
Distributed crawlers, politeness policies, robots.txtContent Processor
MapReduce/Spark, NLP pipelines, content extractionIndexing System
Distributed storage, compression, incremental updatesQuery Processor
Query parsing, term expansion, result aggregationCrawling Layer
Indexing Layer
Serving Layer
Search Ranking Algorithm
Content Relevance
Very High WeightPage Authority
High WeightUser Context
High WeightTechnical Quality
Medium WeightRanking Pipeline
Document Retrieval
Feature Scoring
Final Ranking
Performance Metrics
Technical Challenges & Solutions
Scale of Indexing
Inverted Index Size
Content Duplicate Detection
Ranking Signal Integration
Index and Storage Design
Inverted Index Structure
Inverted Index:
Term → Posting List
"search" → {
document_frequency: 1000000,
postings: [
{doc_id: 123, tf: 5, positions: [10,45,67]},
{doc_id: 456, tf: 2, positions: [23,89]},
{doc_id: 789, tf: 1, positions: [156]},
...
]
}
Document Store
documents table:
- doc_id (bigint, PK)
- url (varchar, unique)
- title (varchar)
- content_hash (varchar)
- crawl_date (timestamp)
- pagerank_score (float)
- content_type (varchar)
- language (varchar)
- word_count (int)
Link Graph Schema
links table:
- from_doc_id (bigint)
- to_doc_id (bigint)
- anchor_text (varchar)
- link_type (internal/external)
Primary key: (from_doc_id, to_doc_id)
Index: (to_doc_id) for backlink analysis
Used for PageRank computation
Practice Questions
How would you design a web crawler that respects robots.txt and avoids being blocked by websites?
Design an inverted index that can handle billions of documents. How do you shard and compress it?
How would you implement PageRank at scale? Design the computation and storage strategy.
Design a ranking algorithm that combines relevance, authority, and personalization. How do you A/B test it?
How would you handle incremental index updates? Design a system that keeps the index fresh in real-time.