What is Boilerplate Detection?
Boilerplate detection is the process of automatically identifying and removing template content, navigation elements, advertisements, and other non-essential parts of web pages to extract the main content. This technology is fundamental to web scraping, content aggregation, search engine indexing, and readability tools. By distinguishing between actual article content and surrounding page elements, boilerplate detection enables cleaner data extraction and better user experiences in applications like news readers, research tools, and content management systems.
Modern boilerplate detection combines machine learning techniques, DOM analysis, visual rendering information, and heuristic rules to achieve high accuracy across diverse website layouts and content management systems. The challenge lies in adapting to constantly evolving web designs, dynamic content loading, and the vast heterogeneity of website structures while maintaining processing efficiency for large-scale applications.
Boilerplate Detection Performance Calculator
Memory Usage: 150MB
Queue Time: 3h
False Positive Rate: 7%
Boilerplate Detection Techniques
DOM-based Analysis
Analyzes HTML structure, element positions, and content distribution patterns.
Visual Rendering
Uses browser rendering to analyze visual layout and content positioning.
Machine Learning
Trained models using features from DOM, text, and visual analysis combined.
Heuristic Rules
Hand-crafted rules based on common web design patterns and structures.
Implementation Examples
DOM-based Content Extraction
from bs4 import BeautifulSoup, Comment
import re
from collections import Counter
import requests
class BoilerplateDetector:
def __init__(self):
# Common boilerplate indicators
self.boilerplate_classes = {
'nav', 'navigation', 'menu', 'header', 'footer', 'sidebar',
'ads', 'advertisement', 'social', 'share', 'comments',
'related', 'recommended', 'tags', 'categories'
}
self.boilerplate_tags = {'nav', 'header', 'footer', 'aside', 'script', 'style'}
self.content_tags = {'article', 'main', 'section', 'div', 'p'}
def calculate_text_density(self, element):
"""Calculate text-to-markup ratio for an element"""
text_length = len(element.get_text(strip=True))
markup_length = len(str(element))
if markup_length == 0:
return 0
return text_length / markup_length
def calculate_link_density(self, element):
"""Calculate ratio of link text to total text"""
total_text = element.get_text(strip=True)
if not total_text:
return 1.0 # High link density if no text
link_text = ' '.join([a.get_text(strip=True) for a in element.find_all('a')])
return len(link_text) / len(total_text)
def get_content_indicators(self, element):
"""Extract features that indicate main content"""
indicators = {
'text_density': self.calculate_text_density(element),
'link_density': self.calculate_link_density(element),
'text_length': len(element.get_text(strip=True)),
'tag_name': element.name,
'has_content_class': any(cls in str(element.get('class', [])).lower()
for cls in ['content', 'article', 'main', 'post']),
'has_boilerplate_class': any(cls in str(element.get('class', [])).lower()
for cls in self.boilerplate_classes),
'paragraph_count': len(element.find_all('p')),
'image_count': len(element.find_all('img')),
'dom_depth': len(list(element.parents))
}
return indicators
def is_boilerplate(self, element, threshold=0.3):
"""Determine if an element is likely boilerplate"""
indicators = self.get_content_indicators(element)
# Rule-based scoring
score = 0.5 # neutral starting point
# Positive indicators (likely content)
if indicators['text_density'] > 0.3:
score += 0.2
if indicators['text_length'] > 100:
score += 0.1
if indicators['has_content_class']:
score += 0.3
if indicators['paragraph_count'] > 2:
score += 0.1
# Negative indicators (likely boilerplate)
if indicators['link_density'] > 0.5:
score -= 0.2
if indicators['has_boilerplate_class']:
score -= 0.3
if indicators['tag_name'] in self.boilerplate_tags:
score -= 0.4
if indicators['text_length'] < 50:
score -= 0.2
return score < threshold
def extract_main_content(self, html):
"""Extract main content from HTML, removing boilerplate"""
soup = BeautifulSoup(html, 'html.parser')
# Remove obviously non-content elements
for element in soup.find_all(['script', 'style', 'nav', 'header', 'footer']):
element.decompose()
# Remove comments
for comment in soup.find_all(string=lambda text: isinstance(text, Comment)):
comment.extract()
# Find content candidates
content_candidates = []
for element in soup.find_all(['article', 'main', 'section', 'div']):
if not self.is_boilerplate(element):
indicators = self.get_content_indicators(element)
content_candidates.append((element, indicators))
# Sort by content quality score
content_candidates.sort(
key=lambda x: (x[1]['text_length'] * x[1]['text_density'] *
(1 - x[1]['link_density'])),
reverse=True
)
# Return the best content candidate
if content_candidates:
best_content = content_candidates[0][0]
return {
'title': self._extract_title(soup),
'content': best_content.get_text(strip=True, separator=' '),
'html': str(best_content),
'quality_score': content_candidates[0][1]
}
return None
def _extract_title(self, soup):
"""Extract page title using multiple strategies"""
# Try different title sources in order of preference
title_selectors = [
'h1',
'article h1',
'.post-title',
'.article-title',
'title'
]
for selector in title_selectors:
element = soup.select_one(selector)
if element and element.get_text(strip=True):
return element.get_text(strip=True)
return "No title found"
# Usage example
def extract_article(url):
"""Extract article content from a web page"""
try:
response = requests.get(url, timeout=10)
response.raise_for_status()
detector = BoilerplateDetector()
result = detector.extract_main_content(response.text)
if result:
return {
'url': url,
'title': result['title'],
'content': result['content'],
'word_count': len(result['content'].split()),
'extraction_quality': result['quality_score']
}
else:
return {'error': 'No main content found'}
except Exception as e:
return {'error': str(e)}
# Test the extractor
article = extract_article('https://example-news-site.com/article')
if 'error' not in article:
print(f"Title: {article['title']}")
print(f"Word count: {article['word_count']}")
print(f"Content preview: {article['content'][:200]}...")
else:
print(f"Extraction failed: {article['error']}")
Advanced ML-based Content Extraction
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
import pickle
class MLBoilerplateDetector:
def __init__(self):
self.model = None
self.vectorizer = TfidfVectorizer(max_features=1000, stop_words='english')
self.feature_names = [
'text_density', 'link_density', 'text_length', 'tag_depth',
'has_content_class', 'has_nav_class', 'paragraph_count',
'list_count', 'image_count', 'form_count', 'table_count'
]
def extract_features(self, element, soup_context):
"""Extract comprehensive features for ML classification"""
text = element.get_text(strip=True)
features = {
'text_density': len(text) / max(1, len(str(element))),
'link_density': len(''.join(a.get_text() for a in element.find_all('a'))) / max(1, len(text)),
'text_length': len(text),
'tag_depth': len(list(element.parents)),
'has_content_class': int(any(cls in str(element.get('class', [])).lower()
for cls in ['content', 'article', 'main', 'post', 'body'])),
'has_nav_class': int(any(cls in str(element.get('class', [])).lower()
for cls in ['nav', 'menu', 'sidebar', 'footer', 'header'])),
'paragraph_count': len(element.find_all('p')),
'list_count': len(element.find_all(['ul', 'ol'])),
'image_count': len(element.find_all('img')),
'form_count': len(element.find_all('form')),
'table_count': len(element.find_all('table')),
}
# Additional contextual features
features.update({
'position_in_dom': self._get_element_position(element, soup_context),
'sibling_content_ratio': self._calculate_sibling_content_ratio(element),
'parent_content_ratio': self._calculate_parent_content_ratio(element),
'css_visibility': self._check_css_visibility(element),
'semantic_score': self._calculate_semantic_score(text)
})
return np.array([features[name] for name in self.feature_names])
def _get_element_position(self, element, soup):
"""Calculate relative position of element in DOM"""
all_elements = soup.find_all()
try:
position = all_elements.index(element)
return position / len(all_elements)
except ValueError:
return 0.5
def _calculate_sibling_content_ratio(self, element):
"""Ratio of content vs non-content siblings"""
if not element.parent:
return 0.5
siblings = element.parent.find_all(recursive=False)
if len(siblings) <= 1:
return 0.5
content_siblings = sum(1 for sib in siblings if len(sib.get_text(strip=True)) > 50)
return content_siblings / len(siblings)
def _calculate_parent_content_ratio(self, element):
"""Content density of parent elements"""
if not element.parent:
return 0.5
parent_text = element.parent.get_text(strip=True)
element_text = element.get_text(strip=True)
if not parent_text:
return 0
return len(element_text) / len(parent_text)
def _check_css_visibility(self, element):
"""Check if element is visually hidden (simplified)"""
style = element.get('style', '').lower()
classes = ' '.join(element.get('class', [])).lower()
hidden_indicators = ['display:none', 'visibility:hidden', 'hidden', 'sr-only']
return int(not any(indicator in style or indicator in classes
for indicator in hidden_indicators))
def _calculate_semantic_score(self, text):
"""Simple semantic content scoring"""
if not text:
return 0
# Common content words vs navigation words
content_words = ['article', 'story', 'news', 'post', 'content', 'text', 'paragraph']
nav_words = ['menu', 'navigation', 'links', 'home', 'contact', 'about']
text_lower = text.lower()
content_score = sum(text_lower.count(word) for word in content_words)
nav_score = sum(text_lower.count(word) for word in nav_words)
total_score = content_score + nav_score
return content_score / max(1, total_score)
def train_model(self, training_data):
"""Train the ML model on labeled data"""
features = []
labels = []
for html, is_content in training_data:
soup = BeautifulSoup(html, 'html.parser')
for element in soup.find_all(['div', 'section', 'article', 'p']):
feature_vector = self.extract_features(element, soup)
features.append(feature_vector)
labels.append(1 if is_content else 0)
X = np.array(features)
y = np.array(labels)
# Train Random Forest classifier
self.model = RandomForestClassifier(
n_estimators=100,
max_depth=10,
random_state=42
)
self.model.fit(X, y)
return self.model.score(X, y)
def predict_content(self, html):
"""Predict main content using trained model"""
if not self.model:
raise ValueError("Model not trained yet")
soup = BeautifulSoup(html, 'html.parser')
content_candidates = []
for element in soup.find_all(['article', 'main', 'section', 'div']):
if len(element.get_text(strip=True)) > 50: # Minimum content length
features = self.extract_features(element, soup)
probability = self.model.predict_proba([features])[0][1] # Probability of being content
content_candidates.append({
'element': element,
'probability': probability,
'text_length': len(element.get_text(strip=True))
})
# Sort by probability and text length
content_candidates.sort(
key=lambda x: x['probability'] * np.log(x['text_length'] + 1),
reverse=True
)
if content_candidates and content_candidates[0]['probability'] > 0.6:
return {
'content': content_candidates[0]['element'].get_text(strip=True),
'confidence': content_candidates[0]['probability'],
'html': str(content_candidates[0]['element'])
}
return None
def save_model(self, filepath):
"""Save trained model to file"""
with open(filepath, 'wb') as f:
pickle.dump({'model': self.model, 'vectorizer': self.vectorizer}, f)
def load_model(self, filepath):
"""Load trained model from file"""
with open(filepath, 'rb') as f:
data = pickle.load(f)
self.model = data['model']
self.vectorizer = data['vectorizer']
# Usage example
ml_detector = MLBoilerplateDetector()
# Train on labeled data (you would need to collect this)
# training_data = [(html_content, is_main_content), ...]
# accuracy = ml_detector.train_model(training_data)
# Save the trained model
# ml_detector.save_model('boilerplate_model.pkl')
# Use for content extraction
# result = ml_detector.predict_content(html_content)
# if result:
# print(f"Extracted content with {result['confidence']:.2f} confidence")
# print(result['content'][:200] + "...")
Real-World Boilerplate Detection Implementations
Mozilla Readability
Powers Firefox Reader Mode and is used across many content extraction applications.
- • Heuristic-based content scoring algorithm
- • Handles diverse website layouts and CMSs
- • Optimized for speed and browser integration
- • Open-source with wide adoption across tools
Google Web Crawler
Uses advanced boilerplate detection for search indexing and content understanding at web scale.
- • ML-based content quality assessment
- • Visual rendering analysis for content blocks
- • Handles JavaScript-generated content
- • Processes billions of pages daily
Pocket (Read Later Service)
Extracts clean article content for distraction-free reading across millions of saved articles.
- • Hybrid approach combining multiple techniques
- • User feedback loop for improving accuracy
- • Mobile-optimized content formatting
- • Real-time processing for instant saves
Mercury Parser (Postlight)
Commercial-grade content extraction API used by news aggregators and content management systems.
- • Site-specific extraction rules and fallbacks
- • High accuracy across news and blog sites
- • Metadata extraction (author, publish date)
- • Enterprise-grade reliability and support
Boilerplate Detection Best Practices
✅ Do
- • Combine multiple detection techniques for robustness
- • Test extensively across diverse website types
- • Handle dynamic content loading and SPAs
- • Implement fallback mechanisms for edge cases
- • Monitor extraction quality and user feedback
- • Cache results to improve processing efficiency
❌ Don't
- • Rely solely on CSS selectors or class names
- • Ignore website-specific extraction patterns
- • Process pages without considering load performance
- • Skip validation of extracted content quality
- • Assume all sites follow similar layout patterns
- • Forget to handle malformed or unusual HTML