Design a Web Search Engine
Build a Google-scale search engine handling 100B+ web pages, 100K+ QPS, with sub-200ms global latency and advanced ranking algorithms.
1.1 Clarifying Questions
1.2 Back-of-the-Envelope Calculations
Search Traffic & Load
Storage & Index Requirements
Infrastructure Scale
Monthly Operating Costs
Interview Practice Questions
Practice these open-ended questions to prepare for system design interviews. Think through each scenario and discuss trade-offs.
Distributed Crawling System: Design a web crawler that processes 10B+ pages daily while respecting politeness policies. Address URL frontier management, duplicate detection using MinHash, robots.txt compliance, dynamic content rendering, and handling crawler traps.
Inverted Index Architecture: Build an inverted index system for 100B documents with sub-second query performance. Design sharding strategies (term-based vs document-based), compression techniques, posting list formats, and strategies for handling updates without full rebuilds.
Ranking Pipeline: Implement a multi-stage ranking system combining text relevance, PageRank, user signals, and ML models. Address feature extraction at query time, model serving infrastructure, A/B testing framework, and real-time personalization.
Query Understanding & Expansion: Design systems for spell correction, synonym expansion, query intent classification, and semantic understanding. Include handling of ambiguous queries, multi-language support, and voice search processing.
Real-Time Indexing: Build a system for indexing breaking news and trending content within seconds. Design the pipeline from content discovery to serving, handling duplicate detection, quality filtering, and integration with the main index.
Search Quality & Spam Detection: Implement comprehensive quality control including spam detection, content farms identification, SEO manipulation prevention, and fake news filtering. Design ML pipelines for continuous improvement and adversarial robustness.