Skip to main contentSkip to user menuSkip to navigation

Design a Web Search Engine

Build a Google-scale search engine handling 100B+ web pages, 100K+ QPS, with sub-200ms global latency and advanced ranking algorithms.

Distributed SystemsInformation RetrievalWeb Scale

1.1 Clarifying Questions

Q: What's the scope and scale of the search engine we're building?
A: Build a general-purpose web search engine like Google, handling 100B+ web pages with 100K+ queries per second globally.
Engineering Implications: Massive web-scale system requiring distributed crawling, petabyte-scale storage, sub-200ms query latency, and 99.9% availability. Need to compete with Google's quality and performance.
Q: What types of content and search capabilities should we support?
A: Index text, images, PDFs, videos. Support basic keyword search, phrase queries, boolean operators, auto-complete, and personalized results.
Engineering Implications: Multi-modal content indexing requires specialized parsers and different ranking signals. Advanced search operators need sophisticated query processing. Personalization requires user profiling and ML models.
Q: What are our crawling and indexing requirements?
A: Crawl 10B+ pages daily, respect robots.txt, handle duplicate detection, and provide fresh results. Index updates should be near real-time for news content.
Engineering Implications: Distributed crawler fleet with politeness policies. Incremental indexing for freshness vs. full rebuilds for quality. Duplicate detection at web scale requires approximate algorithms like MinHash.
Q: What are the performance and quality expectations?
A: Sub-200ms search latency globally, 99.9% availability, highly relevant results comparable to Google's quality standards.
Engineering Implications: Global latency requires edge deployment and sophisticated caching. High relevance needs advanced ranking algorithms combining text relevance, link analysis, and user signals. Quality demands continuous experimentation.
Q: What business and regulatory constraints do we need to consider?
A: Operate globally with GDPR compliance, handle right-to-be-forgotten, provide search result explanations, and support multiple languages/regions.
Engineering Implications: Privacy regulations affect data collection and retention. Multi-language support needs language-specific processing. Regional compliance may require data residency and content filtering.

1.2 Back-of-the-Envelope Calculations

Search Traffic & Load

Daily search queries8B queries/day
Google-scale search volume
Peak QPS100K queries/second
3x average during peak hours
Average query length3-4 words (20 characters)
Typical user search behavior
Results per queryTop 10 + snippets
Standard SERP format

Storage & Index Requirements

Web pages indexed100B documents
Estimated crawlable web size
Average page size50KB (text + metadata)
After content extraction
Raw content storage5PB compressed
100B × 50KB with compression
Inverted index size500TB
~10% of raw content size

Infrastructure Scale

Crawler fleet10K crawler instances
1M pages/second crawl rate
Query servers100K servers globally
1K QPS per server capacity
Index shards10K shards
Distributed across data centers
Global data centers50+ regions
Edge deployment for latency

Monthly Operating Costs

Compute infrastructure$200M/month
Query servers + crawlers + indexing
Storage & networking$50M/month
Distributed storage + CDN + bandwidth
Data center operations$30M/month
Power, cooling, facilities, staff
Total operational cost$280M/month
Google-scale infrastructure costs
No quiz questions available
Quiz ID "search-engine" not found

Interview Practice Questions

Practice these open-ended questions to prepare for system design interviews. Think through each scenario and discuss trade-offs.

1

Distributed Crawling System: Design a web crawler that processes 10B+ pages daily while respecting politeness policies. Address URL frontier management, duplicate detection using MinHash, robots.txt compliance, dynamic content rendering, and handling crawler traps.

2

Inverted Index Architecture: Build an inverted index system for 100B documents with sub-second query performance. Design sharding strategies (term-based vs document-based), compression techniques, posting list formats, and strategies for handling updates without full rebuilds.

3

Ranking Pipeline: Implement a multi-stage ranking system combining text relevance, PageRank, user signals, and ML models. Address feature extraction at query time, model serving infrastructure, A/B testing framework, and real-time personalization.

4

Query Understanding & Expansion: Design systems for spell correction, synonym expansion, query intent classification, and semantic understanding. Include handling of ambiguous queries, multi-language support, and voice search processing.

5

Real-Time Indexing: Build a system for indexing breaking news and trending content within seconds. Design the pipeline from content discovery to serving, handling duplicate detection, quality filtering, and integration with the main index.

6

Search Quality & Spam Detection: Implement comprehensive quality control including spam detection, content farms identification, SEO manipulation prevention, and fake news filtering. Design ML pipelines for continuous improvement and adversarial robustness.