ML Datasets: Deep Dive
Comprehensive guide to ML datasets: collection methods, popular datasets, and finding quality training data
The Foundation of Machine Learning
Datasets are the lifeblood of machine learning. The quality, scale, and diversity of your training data fundamentally determines what your model can learn and how well it will perform. Understanding how datasets are collected, processed, and utilized is crucial for building effective ML systems.
Key Insight
"Data is the new oil" - but unlike oil, data improves with refinement, curation, and careful collection. The best models aren't just trained on more data, but on better data.
How Datasets Are Collected
1. Web Crawling at Scale
The internet is the largest source of text data. Web crawlers systematically traverse the web, downloading and processing billions of pages. This process involves:
- Discovery: Starting from seed URLs and following links
- Politeness: Respecting robots.txt and rate limiting
- Deduplication: Removing duplicate content using MinHash or SimHash
- Quality filtering: Removing spam, ads, and low-quality content
- Language detection: Identifying and categorizing by language
Scale Example: Common Crawl processes 3+ billion web pages per month, producing 200-300TB of uncompressed content.
2. Curated Sources
High-quality, structured sources provide cleaner training data:
- Wikipedia: Encyclopedic knowledge, well-structured, multilingual
- Books: Long-form, coherent text (BookCorpus, Project Gutenberg)
- Academic papers: Technical knowledge (ArXiv, PubMed)
- News articles: Current events, formal writing style
- Reference works: Dictionaries, manuals, documentation
3. User-Generated Content
Platforms with user contributions offer diverse, conversational data:
- Q&A Sites: StackExchange, Quora - technical discussions
- Forums: Reddit, HackerNews - informal conversations
- Code repositories: GitHub - programming languages and documentation
- Social media: Twitter, blogs - real-time, informal text
4. Synthetic Data Generation
Creating artificial data to augment real datasets:
- Data augmentation: Paraphrasing, back-translation
- Template-based generation: Structured variations
- Model-generated data: Using existing models to create training data
- Simulation: Physics engines, game environments
Major Datasets: Deep Dive
C4 (Colossal Clean Crawled Corpus)
Size
750GB compressed
Tokens
~156 billion
Created by Google for T5 model training, C4 is a cleaned version of Common Crawl data. The cleaning process is sophisticated and has become a standard approach:
# C4 Cleaning Pipeline (Conceptual)
def clean_c4_style(text):
# 1. Language detection - English only
if detect_language(text) != 'en':
return None
# 2. Length filtering
if len(text.split()) < 5:
return None
# 3. Remove pages with bad words
if contains_blocklist_words(text):
return None
# 4. Remove duplicate lines
lines = text.split('\n')
if has_duplicate_lines(lines, threshold=0.3):
return None
# 5. Remove pages with lorem ipsum
if 'lorem ipsum' in text.lower():
return None
# 6. Terminal punctuation check
if not ends_with_punctuation(text):
return None
return textKey Applications:
- • T5, mT5 model pre-training
- • Baseline for many LLM training pipelines
- • Benchmark for data cleaning techniques
Common Crawl
Monthly Size
200-300TB
Pages/Month
3+ billion
The raw material for many datasets. Common Crawl provides open web crawl data that serves as the foundation for C4, RefinedWeb, and many other curated datasets. Processing involves:
- WARC format: Web ARChive files containing raw HTTP responses
- WET format: Extracted plain text
- WAT format: Metadata and extracted links
# Processing Common Crawl data
import warc
from warcio.archiveiterator import ArchiveIterator
def process_common_crawl(warc_path):
with open(warc_path, 'rb') as stream:
for record in ArchiveIterator(stream):
if record.rec_type == 'response':
# Extract URL
url = record.rec_headers.get_header('WARC-Target-URI')
# Get content
content = record.content_stream().read()
# Extract text from HTML
text = extract_text_from_html(content)
# Apply quality filters
if passes_quality_checks(text):
yield {
'url': url,
'text': text,
'length': len(text)
}Wikipedia
Articles (EN)
6.7M+
Languages
300+
Wikipedia provides high-quality, structured encyclopedic content. It's included in virtually every major language model's training data due to its reliability and broad coverage.
Why Wikipedia is Valuable:
- • Neutral point of view policy ensures balanced content
- • Citations and references provide factual grounding
- • Regular updates keep information current
- • Structured format (infoboxes, categories) aids extraction
- • Cross-language links enable multilingual training
Stack Exchange Network
Questions
23M+
Sites
180+
Stack Exchange provides high-quality Q&A pairs across diverse domains. The voting system naturally filters for quality, making it excellent for training conversational models.
Top Sites for ML:
- • Stack Overflow (programming)
- • Mathematics
- • Cross Validated (statistics)
- • Data Science
Data Quality Signals:
- • Accepted answers
- • Vote scores
- • User reputation
- • View counts
GitHub Code
Public Repos
200M+
Languages
300+
GitHub provides vast amounts of code in multiple programming languages, along with documentation, comments, and commit messages. Used for training code generation models like Codex and CodeLlama.
# Filtering GitHub data for quality
def filter_github_code(repo_data):
filters = {
'min_stars': 10, # Repository popularity
'has_license': True, # Legal compliance
'not_fork': True, # Original content
'language_confidence': 0.8, # Clear language detection
'max_file_size': 1_000_000, # Avoid huge generated files
'exclude_extensions': ['.min.js', '.csv', '.json'],
'exclude_generated': True, # No auto-generated code
}
# Additional quality checks
if passes_filters(repo_data, filters):
# Remove sensitive information
code = remove_secrets(repo_data.code)
code = remove_api_keys(code)
return code
return NoneRedPajama
Total Size
1.2T tokens
Components
7 sources
Open reproduction of LLaMA's training dataset, combining multiple high-quality sources:
| Source | Tokens | Percentage |
|---|---|---|
| Common Crawl | 878B | 72.6% |
| C4 | 175B | 14.4% |
| GitHub | 59B | 4.9% |
| Wikipedia | 24B | 2.0% |
| Books | 26B | 2.1% |
| ArXiv | 28B | 2.3% |
| StackExchange | 20B | 1.7% |
The Pile
Size
825GB
Components
22 sources
Created by EleutherAI, The Pile combines diverse, high-quality text sources. It's notable for including many specialized datasets not found elsewhere:
Unique Components:
- • PubMed Central (biomedical)
- • USPTO (patents)
- • Ubuntu IRC (technical chat)
- • YouTube subtitles
- • EuroParl (multilingual)
Quality Features:
- • Careful deduplication
- • Domain diversity
- • Quality filtering per source
- • Documented biases
- • Reproducible pipeline
Dataset Quality and Processing
Quality Dimensions
✓ High Quality Indicators
- • Grammatically correct text
- • Factual accuracy with citations
- • Diverse vocabulary and topics
- • Natural language flow
- • Appropriate content length
- • Minimal repetition
✗ Quality Issues to Filter
- • Spam and advertisements
- • Machine-generated text
- • Excessive boilerplate
- • Navigation menus
- • Error messages
- • Personal information
Deduplication Strategies
# Deduplication approaches
from datasketch import MinHash
def deduplicate_dataset(documents):
# 1. Exact deduplication
seen_hashes = set()
unique_docs = []
for doc in documents:
doc_hash = hashlib.sha256(doc.encode()).hexdigest()
if doc_hash not in seen_hashes:
seen_hashes.add(doc_hash)
unique_docs.append(doc)
# 2. Fuzzy deduplication with MinHash
minhashes = {}
threshold = 0.8 # Similarity threshold
for doc in unique_docs:
minhash = MinHash()
for token in doc.split():
minhash.update(token.encode())
# Check similarity with existing documents
is_duplicate = False
for existing_hash in minhashes.values():
if minhash.jaccard(existing_hash) > threshold:
is_duplicate = True
break
if not is_duplicate:
minhashes[doc] = minhash
return list(minhashes.keys())Data Contamination Prevention
Preventing test set leakage is crucial for fair evaluation:
Contamination Check Methods:
- • N-gram overlap analysis with benchmark datasets
- • Temporal splitting (training on data before benchmark creation)
- • Explicit filtering of known evaluation sets
- • Canary string insertion for detection
Finding and Evaluating Datasets
Where to Find Datasets
🤗 Hugging Face Hub
Largest repository of ML datasets with easy integration
- • 100,000+ datasets
- • Streaming support for large datasets
- • Dataset cards with documentation
- • Version control and citations
from datasets import load_dataset
# Load any dataset easily
dataset = load_dataset("c4", "en", streaming=True)
for example in dataset['train'].take(5):
print(example['text'][:100])📊 Academic Sources
Research-grade datasets with papers
- • Papers with Code
- • Google Dataset Search
- • Kaggle Datasets
- • UCI ML Repository
- • OpenML
- • DataHub.io
🏢 Corporate Datasets
Industry-released datasets
- • Common Crawl (web data)
- • LAION (image-text pairs)
- • OpenWebText (GPT-2 recreation)
- • BigQuery Public Datasets
- • AWS Open Data
🔧 Specialized Domains
Domain-specific collections
- • PubMed (biomedical)
- • ArXiv (scientific papers)
- • CORD-19 (COVID research)
- • Financial datasets (Quandl)
- • Legal (Case.Law)
Evaluating Dataset Quality
Dataset Quality Checklist
Documentation Quality
Dataset card, collection methodology, known biases, intended uses
Legal Compliance
Clear license, privacy compliance, copyright status, usage restrictions
Data Statistics
Size, diversity metrics, class balance, temporal coverage
Preprocessing Requirements
Cleaning needed, format conversions, tokenization approach
Validation Results
Models trained on it, benchmark scores, known issues
Dataset Mixing Strategies
Modern language models are rarely trained on a single dataset. The art lies in finding the right mixture of diverse data sources to create a balanced, high-quality training corpus.
Example: GPT-3 Data Mix
| Dataset | Weight in Training | Epochs |
|---|---|---|
| Common Crawl (filtered) | 60% | 0.44 |
| WebText2 | 22% | 2.9 |
| Books1 | 8% | 1.9 |
| Books2 | 8% | 0.43 |
| Wikipedia | 3% | 3.4 |
Note: Higher quality sources (Wikipedia, Books) are upsampled with multiple epochs, while web crawl data is downsampled.
# Dataset mixing with sampling weights
from torch.utils.data import DataLoader, WeightedRandomSampler
def create_mixed_dataloader(datasets, weights, batch_size):
"""
Create a dataloader that samples from multiple datasets
according to specified weights
"""
# Normalize weights
weights = np.array(weights) / np.sum(weights)
# Create sampling probabilities for each example
dataset_sizes = [len(d) for d in datasets]
total_size = sum(dataset_sizes)
# Calculate per-example sampling weight
example_weights = []
for dataset_idx, (size, weight) in enumerate(zip(dataset_sizes, weights)):
# Weight for each example in this dataset
example_weight = weight * total_size / size
example_weights.extend([example_weight] * size)
# Create weighted sampler
sampler = WeightedRandomSampler(
weights=example_weights,
num_samples=total_size,
replacement=True
)
# Concatenate datasets
combined_dataset = ConcatDataset(datasets)
return DataLoader(
combined_dataset,
batch_size=batch_size,
sampler=sampler
)Future Trends in ML Datasets
Synthetic Data Generation
Using AI to create training data, reducing reliance on human-generated content:
- • Constitutional AI training
- • Self-instruct methods
- • Data augmentation with LLMs
- • Physics simulations for robotics
Multimodal Datasets
Combining text, images, audio, and video for comprehensive understanding:
- • LAION-5B (image-text pairs)
- • WebVid (video-text)
- • Audioset (audio classification)
- • Ego4D (first-person video)
Quality Over Quantity
Focus shifting from scale to curation:
- • Textbooks and educational content
- • Expert-curated datasets
- • Deduplication at scale
- • Active learning for data selection
Privacy-Preserving Data
Techniques for training on sensitive data:
- • Differential privacy
- • Federated learning datasets
- • Anonymization techniques
- • Consent-based data collection
Best Practices for Dataset Usage
✓ Do's
- • Document everything: Collection process, filtering, biases
- • Version control: Track dataset changes over time
- • Validate thoroughly: Check for leakage, quality issues
- • Mix thoughtfully: Balance diverse sources
- • Respect licenses: Understand usage restrictions
- • Monitor drift: Track distribution changes
- • Test incrementally: Start small, scale up
✗ Don'ts
- • Ignore biases: All datasets have limitations
- • Skip deduplication: Duplicates hurt generalization
- • Mix blindly: Random mixing can degrade quality
- • Forget privacy: Personal info must be removed
- • Assume quality: Always validate, even popular datasets
- • Overlook legal issues: Copyright matters
- • Train on test sets: Contamination invalidates evaluation