Web-Scale Data Sources
Master web-scale data collection for pretraining: Common Crawl processing, data pipelines, and quality filtering at scale
What are Web-Scale Data Sources?
Web-scale data sources are massive collections of internet content used to pretrain large language models. These include Common Crawl (petabytes of web pages), Wikipedia, GitHub code, academic papers, books, and curated datasets. Processing these sources requires distributed systems capable of handling billions of documents.
Building Pretraining Datasets at Scale
Modern language models require trillions of tokens for pretraining, necessitating sophisticated data collection and processing pipelines that can handle petabyte-scale datasets while maintaining quality, diversity, and legal compliance.
Major Data Sources
- • Common Crawl: 250+ billion pages
- • Wikipedia: 6M+ articles, 40+ languages
- • GitHub: 200M+ repositories
- • ArXiv: 2M+ academic papers
- • Books3: 196,640 books
- • Reddit: 2B+ comments
- • Stack Exchange: 50M+ Q&As
Processing Challenges
- • Deduplication at scale
- • Quality filtering
- • Language detection
- • PII removal
- • Copyright compliance
- • Format standardization
- • Toxicity filtering
Dataset Processing Calculator
Estimate storage, processing time, and quality for your pretraining dataset pipeline:
Common Crawl Architecture
Common Crawl provides the largest publicly available web crawl data, with over 250 billion web pages collected since 2008. Processing this data requires specialized infrastructure and techniques.
Data Formats
WARC Files
Raw web archive format
- • Complete HTTP responses
- • ~1GB compressed files
- • Preserves all metadata
WET Files
Extracted plain text
- • Pre-extracted text
- • ~200MB per file
- • Faster processing
WAT Files
Metadata and links
- • JSON metadata
- • Link structure
- • Headers and tags
Processing Pipeline Stages
Common Crawl Processing Pipeline
Production-ready pipeline for processing Common Crawl data with Apache Spark, including quality filtering, deduplication, and language detection at scale.
Quality Filtering System
Implement sophisticated quality filters to ensure high-quality training data, including heuristic rules, classifier-based filtering, and perplexity scoring.
Deduplication at Scale
Remove duplicate content using MinHash LSH for fuzzy deduplication and exact hash matching, essential for preventing memorization and improving training efficiency.
Production Best Practices
✅ Do
- • Use incremental processing for new crawls
- • Implement checkpoint recovery
- • Monitor data quality metrics continuously
- • Version control your filtering rules
- • Sample and manually review filtered data
- • Maintain provenance tracking
- • Use distributed deduplication
- • Implement PII detection early
❌ Don't
- • Process entire dataset in single pass
- • Skip quality validation steps
- • Ignore legal and copyright concerns
- • Use overly aggressive filtering
- • Neglect language diversity
- • Store uncompressed data
- • Process without deduplication
- • Mix data sources without tracking
Data Source Comparison
| Source | Size | Quality | Processing | Use Case |
|---|---|---|---|---|
| Common Crawl | 250B+ pages | Variable | Complex | General knowledge |
| Wikipedia | 20GB text | High | Simple | Factual knowledge |
| GitHub | 1TB+ code | Variable | Moderate | Code understanding |
| ArXiv | 100GB | High | Moderate | Scientific knowledge |