Web Crawler Architecture

Build scalable web crawlers: distributed architecture, rate limiting, politeness policies, and data processing.

25 min readAdvanced
Not Started
Loading...

Problem Statement

Design a web crawler that can systematically discover and download billions of web pages for a search engine. The system must be scalable, respectful of websites, and handle various content types while avoiding traps and duplicate content.

Key Requirements

  • • Crawl 1 billion pages per month
  • • Respect robots.txt and rate limiting
  • • Handle different content types (HTML, PDF, images)
  • • Avoid crawler traps and infinite loops
  • • Prioritize important pages and fresh content

Step 1: Clarify Requirements

Functional Requirements

  • Web Crawling: Download web pages from given seed URLs
  • Content Types: HTML, PDF, images, videos (configurable)
  • Politeness: Respect robots.txt and rate limits
  • Freshness: Re-crawl pages based on update frequency

Non-Functional Requirements

  • Scale: 1 billion pages/month (~400 pages/second
  • Robustness: Handle failures, malicious websites
  • Efficiency: Avoid duplicate crawling, optimize bandwidth
  • Extensibility: Easy to add new content processors

Step 2: Back-of-Envelope Estimation

Crawl Rate

1B pages/month
≈ 400 pages/second average
Peak: 800 pages/second
Need distributed crawling

Storage

Average page: 500KB
1B × 500KB = 500TB
Metadata: ~50TB
URLs to crawl: ~10TB

Bandwidth

400 pages/sec × 500KB
= 200 MB/second
= 17TB per day
Need high bandwidth

Step 3: High-Level Design

Crawler Architecture

Seed URLs → URL Frontier → Crawler Workers → Content Processor → Storage
Seed URLs
URL Frontier
Crawlers
Processor
Storage

URL Frontier

  • • Queue of URLs to be crawled
  • • Priority-based ordering
  • • Politeness per domain
  • • Freshness scheduling

Crawler Workers

  • • Download pages from URLs
  • • Respect robots.txt
  • • Handle timeouts/errors
  • • Extract new URLs

Content Processor

  • • Parse HTML and extract links
  • • Detect duplicate content
  • • Content validation
  • • Malware scanning

Storage Layer

  • • Store crawled content
  • • URL database and metadata
  • • Duplicate detection cache
  • • Robots.txt cache

Step 4: Deep Dive - URL Frontier

Politeness & Priority Management

Politeness (Per Domain)

  • • Separate queue per domain/host
  • • Configurable delay between requests
  • • Respect robots.txt crawl-delay
  • • Monitor server response times

Priority Queues

  • • High: News sites, popular domains
  • • Medium: Business sites, blogs
  • • Low: Personal pages, forums
  • • Dynamic priority based on PageRank

Duplicate Detection Strategies

URL Level

  • • Normalize URLs (remove fragments)
  • • Hash table for seen URLs
  • • Bloom filter for memory efficiency
  • • Handle URL redirects

Content Level

  • • Hash content (MD5, SHA-256)
  • • Simhash for near-duplicates
  • • Compare text similarity
  • • Handle dynamic content

Performance

  • • Distributed hash tables
  • • LRU cache for hot URLs
  • • Periodic cleanup
  • • Probabilistic data structures

Robustness & Error Handling

Crawler Traps

  • • Infinite calendar/pagination
  • • URL length limits (2KB max)
  • • Depth limits per domain
  • • Detect cycles with heuristics

Failure Handling

  • • Retry with exponential backoff
  • • Circuit breaker for bad hosts
  • • Timeout configuration
  • • Dead letter queue

Scalability & Performance

Horizontal Scaling

  • • Partition URLs by domain hash
  • • Independent crawler workers
  • • Shared distributed queue (Kafka)
  • • Auto-scaling based on queue depth

Caching Strategy

  • • DNS caching (reduce lookups)
  • • Robots.txt cache (per domain)
  • • URL seen cache (bloom filter)
  • • Content hash cache

Storage Optimization

  • • Compress content before storage
  • • Deduplicate at storage layer
  • • Hot/cold data separation
  • • Archive old crawl data

Network Optimization

  • • HTTP/2 for multiplexing
  • • Connection pooling per domain
  • • Geographic distribution
  • • Bandwidth rate limiting

Content Processing Pipeline

Processing Stages

1. Download
  • • HTTP client
  • • Handle redirects
  • • Timeout management
2. Parse
  • • Extract links
  • • Parse HTML/XML
  • • Handle malformed content
3. Filter
  • • Content validation
  • • Spam detection
  • • Quality scoring
4. Store
  • • Index content
  • • Update metadata
  • • Queue new URLs

Real-World Web Crawlers

Googlebot

  • • Crawls trillions of pages
  • • Mobile-first indexing
  • • JavaScript rendering (Chromium)
  • • Respects crawl budget per site

Common Crawl

  • • Non-profit web crawling
  • • Monthly crawls of entire web
  • • Open dataset for research
  • • Processes 25+ billion pages

Internet Archive

  • • Preserves web history
  • • Crawls for long-term storage
  • • Heritrix crawler framework
  • • 70+ billion web pages archived

Scrapy (Open Source)

  • • Python web crawling framework
  • • Built-in duplicate filtering
  • • Async networking with Twisted
  • • Extensible middleware system

Crawler Ethics & Best Practices

  • • Always respect robots.txt directives
  • • Implement rate limiting to avoid overwhelming servers
  • • Use appropriate User-Agent headers
  • • Handle error responses gracefully
  • • Consider legal and copyright implications
  • • Monitor and respond to website owner requests

📝 Case Study Quiz

Question 1 of 4

What is the primary purpose of the URL Frontier component in a web crawler system?