Skip to main contentSkip to user menuSkip to navigation

Boilerplate Detection

Master boilerplate detection: content extraction from web pages, DOM analysis, and main content identification algorithms.

50 min readAdvanced
Not Started
Loading...

What is Boilerplate Detection?

Boilerplate detection is the process of automatically identifying and removing template content, navigation elements, advertisements, and other non-essential parts of web pages to extract the main content. This technology is fundamental to web scraping, content aggregation, search engine indexing, and readability tools. By distinguishing between actual article content and surrounding page elements, boilerplate detection enables cleaner data extraction and better user experiences in applications like news readers, research tools, and content management systems.

Modern boilerplate detection combines machine learning techniques, DOM analysis, visual rendering information, and heuristic rules to achieve high accuracy across diverse website layouts and content management systems. The challenge lies in adapting to constantly evolving web designs, dynamic content loading, and the vast heterogeneity of website structures while maintaining processing efficiency for large-scale applications.

Boilerplate Detection Performance Calculator

1125ms
Time/Page
1
Pages/sec
85%
Accuracy
88%
Content Ratio

Memory Usage: 150MB

Queue Time: 3h

False Positive Rate: 7%

Boilerplate Detection Techniques

DOM-based Analysis

Analyzes HTML structure, element positions, and content distribution patterns.

Features: Text density, link density, DOM depth, element types
Speed: Fast, minimal computational overhead
Accuracy: Good for structured pages, struggles with dynamic content

Visual Rendering

Uses browser rendering to analyze visual layout and content positioning.

Features: Visual blocks, font sizes, content area detection
Speed: Slower, requires browser engine processing
Accuracy: Excellent, handles modern web layouts well

Machine Learning

Trained models using features from DOM, text, and visual analysis combined.

Features: Multi-modal feature vectors, ensemble methods
Speed: Variable, depends on model complexity
Accuracy: Highest, adapts to new layouts and patterns

Heuristic Rules

Hand-crafted rules based on common web design patterns and structures.

Features: CSS selectors, class names, element attributes
Speed: Very fast, simple pattern matching
Accuracy: Moderate, requires manual maintenance

Implementation Examples

DOM-based Content Extraction

Python Boilerplate Detection with BeautifulSoup
from bs4 import BeautifulSoup, Comment
import re
from collections import Counter
import requests

class BoilerplateDetector:
    def __init__(self):
        # Common boilerplate indicators
        self.boilerplate_classes = {
            'nav', 'navigation', 'menu', 'header', 'footer', 'sidebar', 
            'ads', 'advertisement', 'social', 'share', 'comments', 
            'related', 'recommended', 'tags', 'categories'
        }
        
        self.boilerplate_tags = {'nav', 'header', 'footer', 'aside', 'script', 'style'}
        self.content_tags = {'article', 'main', 'section', 'div', 'p'}
    
    def calculate_text_density(self, element):
        """Calculate text-to-markup ratio for an element"""
        text_length = len(element.get_text(strip=True))
        markup_length = len(str(element))
        
        if markup_length == 0:
            return 0
        
        return text_length / markup_length
    
    def calculate_link_density(self, element):
        """Calculate ratio of link text to total text"""
        total_text = element.get_text(strip=True)
        if not total_text:
            return 1.0  # High link density if no text
        
        link_text = ' '.join([a.get_text(strip=True) for a in element.find_all('a')])
        return len(link_text) / len(total_text)
    
    def get_content_indicators(self, element):
        """Extract features that indicate main content"""
        indicators = {
            'text_density': self.calculate_text_density(element),
            'link_density': self.calculate_link_density(element),
            'text_length': len(element.get_text(strip=True)),
            'tag_name': element.name,
            'has_content_class': any(cls in str(element.get('class', [])).lower() 
                                   for cls in ['content', 'article', 'main', 'post']),
            'has_boilerplate_class': any(cls in str(element.get('class', [])).lower() 
                                       for cls in self.boilerplate_classes),
            'paragraph_count': len(element.find_all('p')),
            'image_count': len(element.find_all('img')),
            'dom_depth': len(list(element.parents))
        }
        
        return indicators
    
    def is_boilerplate(self, element, threshold=0.3):
        """Determine if an element is likely boilerplate"""
        indicators = self.get_content_indicators(element)
        
        # Rule-based scoring
        score = 0.5  # neutral starting point
        
        # Positive indicators (likely content)
        if indicators['text_density'] > 0.3:
            score += 0.2
        if indicators['text_length'] > 100:
            score += 0.1
        if indicators['has_content_class']:
            score += 0.3
        if indicators['paragraph_count'] > 2:
            score += 0.1
        
        # Negative indicators (likely boilerplate)
        if indicators['link_density'] > 0.5:
            score -= 0.2
        if indicators['has_boilerplate_class']:
            score -= 0.3
        if indicators['tag_name'] in self.boilerplate_tags:
            score -= 0.4
        if indicators['text_length'] < 50:
            score -= 0.2
        
        return score < threshold
    
    def extract_main_content(self, html):
        """Extract main content from HTML, removing boilerplate"""
        soup = BeautifulSoup(html, 'html.parser')
        
        # Remove obviously non-content elements
        for element in soup.find_all(['script', 'style', 'nav', 'header', 'footer']):
            element.decompose()
        
        # Remove comments
        for comment in soup.find_all(string=lambda text: isinstance(text, Comment)):
            comment.extract()
        
        # Find content candidates
        content_candidates = []
        
        for element in soup.find_all(['article', 'main', 'section', 'div']):
            if not self.is_boilerplate(element):
                indicators = self.get_content_indicators(element)
                content_candidates.append((element, indicators))
        
        # Sort by content quality score
        content_candidates.sort(
            key=lambda x: (x[1]['text_length'] * x[1]['text_density'] * 
                          (1 - x[1]['link_density'])), 
            reverse=True
        )
        
        # Return the best content candidate
        if content_candidates:
            best_content = content_candidates[0][0]
            return {
                'title': self._extract_title(soup),
                'content': best_content.get_text(strip=True, separator=' '),
                'html': str(best_content),
                'quality_score': content_candidates[0][1]
            }
        
        return None
    
    def _extract_title(self, soup):
        """Extract page title using multiple strategies"""
        # Try different title sources in order of preference
        title_selectors = [
            'h1',
            'article h1',
            '.post-title',
            '.article-title', 
            'title'
        ]
        
        for selector in title_selectors:
            element = soup.select_one(selector)
            if element and element.get_text(strip=True):
                return element.get_text(strip=True)
        
        return "No title found"

# Usage example
def extract_article(url):
    """Extract article content from a web page"""
    try:
        response = requests.get(url, timeout=10)
        response.raise_for_status()
        
        detector = BoilerplateDetector()
        result = detector.extract_main_content(response.text)
        
        if result:
            return {
                'url': url,
                'title': result['title'],
                'content': result['content'],
                'word_count': len(result['content'].split()),
                'extraction_quality': result['quality_score']
            }
        else:
            return {'error': 'No main content found'}
    
    except Exception as e:
        return {'error': str(e)}

# Test the extractor
article = extract_article('https://example-news-site.com/article')
if 'error' not in article:
    print(f"Title: {article['title']}")
    print(f"Word count: {article['word_count']}")
    print(f"Content preview: {article['content'][:200]}...")
else:
    print(f"Extraction failed: {article['error']}")

Advanced ML-based Content Extraction

Machine Learning Content Extractor
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
import pickle

class MLBoilerplateDetector:
    def __init__(self):
        self.model = None
        self.vectorizer = TfidfVectorizer(max_features=1000, stop_words='english')
        self.feature_names = [
            'text_density', 'link_density', 'text_length', 'tag_depth',
            'has_content_class', 'has_nav_class', 'paragraph_count',
            'list_count', 'image_count', 'form_count', 'table_count'
        ]
    
    def extract_features(self, element, soup_context):
        """Extract comprehensive features for ML classification"""
        text = element.get_text(strip=True)
        
        features = {
            'text_density': len(text) / max(1, len(str(element))),
            'link_density': len(''.join(a.get_text() for a in element.find_all('a'))) / max(1, len(text)),
            'text_length': len(text),
            'tag_depth': len(list(element.parents)),
            'has_content_class': int(any(cls in str(element.get('class', [])).lower() 
                                       for cls in ['content', 'article', 'main', 'post', 'body'])),
            'has_nav_class': int(any(cls in str(element.get('class', [])).lower() 
                                   for cls in ['nav', 'menu', 'sidebar', 'footer', 'header'])),
            'paragraph_count': len(element.find_all('p')),
            'list_count': len(element.find_all(['ul', 'ol'])),
            'image_count': len(element.find_all('img')),
            'form_count': len(element.find_all('form')),
            'table_count': len(element.find_all('table')),
        }
        
        # Additional contextual features
        features.update({
            'position_in_dom': self._get_element_position(element, soup_context),
            'sibling_content_ratio': self._calculate_sibling_content_ratio(element),
            'parent_content_ratio': self._calculate_parent_content_ratio(element),
            'css_visibility': self._check_css_visibility(element),
            'semantic_score': self._calculate_semantic_score(text)
        })
        
        return np.array([features[name] for name in self.feature_names])
    
    def _get_element_position(self, element, soup):
        """Calculate relative position of element in DOM"""
        all_elements = soup.find_all()
        try:
            position = all_elements.index(element)
            return position / len(all_elements)
        except ValueError:
            return 0.5
    
    def _calculate_sibling_content_ratio(self, element):
        """Ratio of content vs non-content siblings"""
        if not element.parent:
            return 0.5
        
        siblings = element.parent.find_all(recursive=False)
        if len(siblings) <= 1:
            return 0.5
        
        content_siblings = sum(1 for sib in siblings if len(sib.get_text(strip=True)) > 50)
        return content_siblings / len(siblings)
    
    def _calculate_parent_content_ratio(self, element):
        """Content density of parent elements"""
        if not element.parent:
            return 0.5
        
        parent_text = element.parent.get_text(strip=True)
        element_text = element.get_text(strip=True)
        
        if not parent_text:
            return 0
        
        return len(element_text) / len(parent_text)
    
    def _check_css_visibility(self, element):
        """Check if element is visually hidden (simplified)"""
        style = element.get('style', '').lower()
        classes = ' '.join(element.get('class', [])).lower()
        
        hidden_indicators = ['display:none', 'visibility:hidden', 'hidden', 'sr-only']
        return int(not any(indicator in style or indicator in classes 
                          for indicator in hidden_indicators))
    
    def _calculate_semantic_score(self, text):
        """Simple semantic content scoring"""
        if not text:
            return 0
        
        # Common content words vs navigation words
        content_words = ['article', 'story', 'news', 'post', 'content', 'text', 'paragraph']
        nav_words = ['menu', 'navigation', 'links', 'home', 'contact', 'about']
        
        text_lower = text.lower()
        content_score = sum(text_lower.count(word) for word in content_words)
        nav_score = sum(text_lower.count(word) for word in nav_words)
        
        total_score = content_score + nav_score
        return content_score / max(1, total_score)
    
    def train_model(self, training_data):
        """Train the ML model on labeled data"""
        features = []
        labels = []
        
        for html, is_content in training_data:
            soup = BeautifulSoup(html, 'html.parser')
            for element in soup.find_all(['div', 'section', 'article', 'p']):
                feature_vector = self.extract_features(element, soup)
                features.append(feature_vector)
                labels.append(1 if is_content else 0)
        
        X = np.array(features)
        y = np.array(labels)
        
        # Train Random Forest classifier
        self.model = RandomForestClassifier(
            n_estimators=100,
            max_depth=10,
            random_state=42
        )
        self.model.fit(X, y)
        
        return self.model.score(X, y)
    
    def predict_content(self, html):
        """Predict main content using trained model"""
        if not self.model:
            raise ValueError("Model not trained yet")
        
        soup = BeautifulSoup(html, 'html.parser')
        content_candidates = []
        
        for element in soup.find_all(['article', 'main', 'section', 'div']):
            if len(element.get_text(strip=True)) > 50:  # Minimum content length
                features = self.extract_features(element, soup)
                probability = self.model.predict_proba([features])[0][1]  # Probability of being content
                
                content_candidates.append({
                    'element': element,
                    'probability': probability,
                    'text_length': len(element.get_text(strip=True))
                })
        
        # Sort by probability and text length
        content_candidates.sort(
            key=lambda x: x['probability'] * np.log(x['text_length'] + 1),
            reverse=True
        )
        
        if content_candidates and content_candidates[0]['probability'] > 0.6:
            return {
                'content': content_candidates[0]['element'].get_text(strip=True),
                'confidence': content_candidates[0]['probability'],
                'html': str(content_candidates[0]['element'])
            }
        
        return None
    
    def save_model(self, filepath):
        """Save trained model to file"""
        with open(filepath, 'wb') as f:
            pickle.dump({'model': self.model, 'vectorizer': self.vectorizer}, f)
    
    def load_model(self, filepath):
        """Load trained model from file"""
        with open(filepath, 'rb') as f:
            data = pickle.load(f)
            self.model = data['model']
            self.vectorizer = data['vectorizer']

# Usage example
ml_detector = MLBoilerplateDetector()

# Train on labeled data (you would need to collect this)
# training_data = [(html_content, is_main_content), ...]
# accuracy = ml_detector.train_model(training_data)

# Save the trained model
# ml_detector.save_model('boilerplate_model.pkl')

# Use for content extraction
# result = ml_detector.predict_content(html_content)
# if result:
#     print(f"Extracted content with {result['confidence']:.2f} confidence")
#     print(result['content'][:200] + "...")

Real-World Boilerplate Detection Implementations

Mozilla Readability

Powers Firefox Reader Mode and is used across many content extraction applications.

  • • Heuristic-based content scoring algorithm
  • • Handles diverse website layouts and CMSs
  • • Optimized for speed and browser integration
  • • Open-source with wide adoption across tools

Google Web Crawler

Uses advanced boilerplate detection for search indexing and content understanding at web scale.

  • • ML-based content quality assessment
  • • Visual rendering analysis for content blocks
  • • Handles JavaScript-generated content
  • • Processes billions of pages daily

Pocket (Read Later Service)

Extracts clean article content for distraction-free reading across millions of saved articles.

  • • Hybrid approach combining multiple techniques
  • • User feedback loop for improving accuracy
  • • Mobile-optimized content formatting
  • • Real-time processing for instant saves

Mercury Parser (Postlight)

Commercial-grade content extraction API used by news aggregators and content management systems.

  • • Site-specific extraction rules and fallbacks
  • • High accuracy across news and blog sites
  • • Metadata extraction (author, publish date)
  • • Enterprise-grade reliability and support

Boilerplate Detection Best Practices

✅ Do

  • • Combine multiple detection techniques for robustness
  • • Test extensively across diverse website types
  • • Handle dynamic content loading and SPAs
  • • Implement fallback mechanisms for edge cases
  • • Monitor extraction quality and user feedback
  • • Cache results to improve processing efficiency

❌ Don't

  • • Rely solely on CSS selectors or class names
  • • Ignore website-specific extraction patterns
  • • Process pages without considering load performance
  • • Skip validation of extracted content quality
  • • Assume all sites follow similar layout patterns
  • • Forget to handle malformed or unusual HTML
No quiz questions available
Questions prop is empty