Design a Search Engine (Google/Bing)

Build a web-scale search engine with distributed crawling, indexing, ranking algorithms, and real-time query processing for billions of documents.

System Requirements

Functional Requirements

  • Crawl and index billions of web pages
  • Process search queries with sub-second latency
  • Rank results by relevance and quality
  • Handle different content types (text, images, PDFs)
  • Support advanced search operators and filters
  • Provide search suggestions and auto-complete
  • Fresh content indexing and real-time updates
  • Personalized search results based on user history

Non-Functional Requirements

  • 100B+ web pages indexed
  • 100K+ search queries per second
  • Search latency < 200ms globally
  • 99.9% search availability
  • Handle 10B+ new/updated pages daily
  • Support 50+ languages and regions
  • Crawl respectfully (robots.txt compliance)
  • Storage for 100+ petabytes of content

Capacity Estimation

Search Engine Scale

Read vs Write
1x ratioIndex
1000x ratioSearch
Content Types
70% splitText
30% splitMultimedia
Query Complexity
80% queriesSimple
20% queriesComplex

Search Traffic

Daily Searches
Google-scale search volume
8B queries
Peak QPS
Global peak traffic
100K/second
Average Latency
End-to-end search time
< 200ms
Index Coverage
Crawlable web estimated
100B pages

Infrastructure Scale

Crawling Infrastructure
10K+ crawlers, 1M pages/second
Index Storage
100+ petabytes compressed indexes
Query Servers
100K+ servers globally

Web Crawling Strategies

1

Breadth-First Crawling

Crawl pages level by level from seed URLs

Pros

  • Discovers popular pages first
  • Good coverage
  • Simple to implement

Cons

  • May get stuck in large sites
  • Slow to find deep content

Best For

Initial web discovery and popular content
2

Focused Crawling

Prioritize pages based on topic relevance or quality

Pros

  • Efficient resource usage
  • High-quality content
  • Topic-specific

Cons

  • Complex topic classification
  • May miss diverse content

Best For

Domain-specific search engines
3

Continuous Crawling

Monitor and re-crawl pages based on update frequency

Pros

  • Fresh content
  • Efficient updates
  • Change detection

Cons

  • Complex scheduling
  • Resource intensive

Best For

News, social media, and dynamic content

System Architecture

Indexing Pipeline:
Crawler Fleet → URL Queue → Content Fetcher → Content Parser
Text Extraction → Language Detection → Content Analysis
Tokenization → Indexing → Inverted Index Shards
Query Pipeline:
User Query → Query Processor → Index Lookup
Result Retrieval → Ranking Algorithm → Result Aggregation
Personalization → Response Generation → Client

Web Crawler

Distributed crawlers, politeness policies, robots.txt
Purpose:
Discover and fetch web pages at massive scale
Scale:
10B+ pages crawled daily

Content Processor

MapReduce/Spark, NLP pipelines, content extraction
Purpose:
Parse, extract, and clean content from web pages
Scale:
1M+ pages processed per minute

Indexing System

Distributed storage, compression, incremental updates
Purpose:
Build and maintain distributed inverted indexes
Scale:
100TB+ index storage

Query Processor

Query parsing, term expansion, result aggregation
Purpose:
Parse queries and retrieve relevant documents
Scale:
100K+ queries per second

Crawling Layer

• Distributed crawler fleet
• Politeness and rate limiting
• Content deduplication
• Robots.txt compliance

Indexing Layer

• Inverted index construction
• Compression and sharding
• Incremental updates
• Link graph analysis

Serving Layer

• Query processing and parsing
• Result ranking and scoring
• Personalization engine
• Caching and optimization

Search Ranking Algorithm

Content Relevance

Very High Weight
Term frequency and proximity
Semantic matching and synonyms
Content freshness and recency
Language and locale matching

Page Authority

High Weight
PageRank and link analysis
Domain authority and trust
Citation and reference quality
Social signals and engagement

User Context

High Weight
Search history and preferences
Geographic location
Device type and capabilities
Time and seasonality

Technical Quality

Medium Weight
Page load speed and performance
Mobile friendliness
HTTPS and security
Structured data markup

Ranking Pipeline

Document Retrieval

• Term matching in inverted index
• Boolean and phrase queries
• Synonym expansion
• Candidate set generation

Feature Scoring

• TF-IDF and BM25 scoring
• PageRank integration
• Freshness and quality signals
• Machine learning features

Final Ranking

• Learning-to-rank models
• Personalization layer
• Diversity optimization
• Result presentation

Performance Metrics

Index Lookup
Inverted index query time
< 10ms
Ranking Computation
Feature scoring and ML inference
< 50ms
Result Assembly
Snippet generation and formatting
< 20ms
Total Query Time
End-to-end search latency
< 200ms
Index Compression
Compressed vs raw index size
10:1 ratio
Cache Hit Rate
Query result caching effectiveness
80%
Crawl Coverage
Unique pages in index
Billions
Relevance Score
User satisfaction metrics
95%

Technical Challenges & Solutions

1

Scale of Indexing

Problem: Process and index 10B+ pages daily with real-time updates
Solution: Distributed indexing pipeline with MapReduce/Spark processing
Implementation: Horizontal sharding, parallel processing, incremental updates
2

Inverted Index Size

Problem: Store and efficiently query massive inverted indexes
Solution: Compressed indexes with delta encoding and block-based storage
Implementation: Index sharding by term, distributed across clusters
3

Content Duplicate Detection

Problem: Identify and handle duplicate or near-duplicate content
Solution: Shingling and locality-sensitive hashing for similarity detection
Implementation: MinHash algorithms, content fingerprinting
4

Ranking Signal Integration

Problem: Combine hundreds of ranking signals in real-time
Solution: Machine learning models with feature preprocessing
Implementation: Gradient boosting, neural networks, A/B testing framework

Index and Storage Design

Inverted Index Structure

Inverted Index: Term → Posting List "search" → { document_frequency: 1000000, postings: [ {doc_id: 123, tf: 5, positions: [10,45,67]}, {doc_id: 456, tf: 2, positions: [23,89]}, {doc_id: 789, tf: 1, positions: [156]}, ... ] }

Document Store

documents table: - doc_id (bigint, PK) - url (varchar, unique) - title (varchar) - content_hash (varchar) - crawl_date (timestamp) - pagerank_score (float) - content_type (varchar) - language (varchar) - word_count (int)

Link Graph Schema

links table: - from_doc_id (bigint) - to_doc_id (bigint) - anchor_text (varchar) - link_type (internal/external) Primary key: (from_doc_id, to_doc_id) Index: (to_doc_id) for backlink analysis Used for PageRank computation

Practice Questions

1

How would you design a web crawler that respects robots.txt and avoids being blocked by websites?

2

Design an inverted index that can handle billions of documents. How do you shard and compress it?

3

How would you implement PageRank at scale? Design the computation and storage strategy.

4

Design a ranking algorithm that combines relevance, authority, and personalization. How do you A/B test it?

5

How would you handle incremental index updates? Design a system that keeps the index fresh in real-time.