System Designer

What is Boilerplate Detection?

Boilerplate detection is the process of automatically identifying and removing template content, navigation elements, advertisements, and other non-essential parts of web pages to extract the main content. This technology is fundamental to web scraping, content aggregation, search engine indexing, and readability tools. By distinguishing between actual article content and surrounding page elements, boilerplate detection enables cleaner data extraction and better user experiences in applications like news readers, research tools, and content management systems.

Modern boilerplate detection combines machine learning techniques, DOM analysis, visual rendering information, and heuristic rules to achieve high accuracy across diverse website layouts and content management systems. The challenge lies in adapting to constantly evolving web designs, dynamic content loading, and the vast heterogeneity of website structures while maintaining processing efficiency for large-scale applications.

Boilerplate Detection Performance Calculator

Page Count: 10,000

Content Complexity: 50%

Target Extraction Accuracy: 85%

Processing Speed Priority: 100%

Average DOM Depth: 10

1125ms

Time/Page

Pages/sec

85%

Accuracy

88%

Content Ratio

Memory Usage: 150MB

Queue Time: 3h

False Positive Rate: 7%

Boilerplate Detection Techniques

DOM-based Analysis

Analyzes HTML structure, element positions, and content distribution patterns.

Features: Text density, link density, DOM depth, element types

Speed: Fast, minimal computational overhead

Accuracy: Good for structured pages, struggles with dynamic content

Visual Rendering

Uses browser rendering to analyze visual layout and content positioning.

Features: Visual blocks, font sizes, content area detection

Speed: Slower, requires browser engine processing

Accuracy: Excellent, handles modern web layouts well

Machine Learning

Trained models using features from DOM, text, and visual analysis combined.

Features: Multi-modal feature vectors, ensemble methods

Speed: Variable, depends on model complexity

Accuracy: Highest, adapts to new layouts and patterns

Heuristic Rules

Hand-crafted rules based on common web design patterns and structures.

Features: CSS selectors, class names, element attributes

Speed: Very fast, simple pattern matching

Accuracy: Moderate, requires manual maintenance

Implementation Examples

DOM-based Content Extraction

Python Boilerplate Detection with BeautifulSoup

from bs4 import BeautifulSoup, Comment
import re
from collections import Counter
import requests

class BoilerplateDetector:
    def __init__(self):
        # Common boilerplate indicators
        self.boilerplate_classes = {
            'nav', 'navigation', 'menu', 'header', 'footer', 'sidebar', 
            'ads', 'advertisement', 'social', 'share', 'comments', 
            'related', 'recommended', 'tags', 'categories'
        }
        
        self.boilerplate_tags = {'nav', 'header', 'footer', 'aside', 'script', 'style'}
        self.content_tags = {'article', 'main', 'section', 'div', 'p'}
    
    def calculate_text_density(self, element):
        """Calculate text-to-markup ratio for an element"""
        text_length = len(element.get_text(strip=True))
        markup_length = len(str(element))
        
        if markup_length == 0:
            return 0
        
        return text_length / markup_length
    
    def calculate_link_density(self, element):
        """Calculate ratio of link text to total text"""
        total_text = element.get_text(strip=True)
        if not total_text:
            return 1.0  # High link density if no text
        
        link_text = ' '.join([a.get_text(strip=True) for a in element.find_all('a')])
        return len(link_text) / len(total_text)
    
    def get_content_indicators(self, element):
        """Extract features that indicate main content"""
        indicators = {
            'text_density': self.calculate_text_density(element),
            'link_density': self.calculate_link_density(element),
            'text_length': len(element.get_text(strip=True)),
            'tag_name': element.name,
            'has_content_class': any(cls in str(element.get('class', [])).lower() 
                                   for cls in ['content', 'article', 'main', 'post']),
            'has_boilerplate_class': any(cls in str(element.get('class', [])).lower() 
                                       for cls in self.boilerplate_classes),
            'paragraph_count': len(element.find_all('p')),
            'image_count': len(element.find_all('img')),
            'dom_depth': len(list(element.parents))
        }
        
        return indicators
    
    def is_boilerplate(self, element, threshold=0.3):
        """Determine if an element is likely boilerplate"""
        indicators = self.get_content_indicators(element)
        
        # Rule-based scoring
        score = 0.5  # neutral starting point
        
        # Positive indicators (likely content)
        if indicators['text_density'] > 0.3:
            score += 0.2
        if indicators['text_length'] > 100:
            score += 0.1
        if indicators['has_content_class']:
            score += 0.3
        if indicators['paragraph_count'] > 2:
            score += 0.1
        
        # Negative indicators (likely boilerplate)
        if indicators['link_density'] > 0.5:
            score -= 0.2
        if indicators['has_boilerplate_class']:
            score -= 0.3
        if indicators['tag_name'] in self.boilerplate_tags:
            score -= 0.4
        if indicators['text_length'] < 50:
            score -= 0.2
        
        return score < threshold
    
    def extract_main_content(self, html):
        """Extract main content from HTML, removing boilerplate"""
        soup = BeautifulSoup(html, 'html.parser')
        
        # Remove obviously non-content elements
        for element in soup.find_all(['script', 'style', 'nav', 'header', 'footer']):
            element.decompose()
        
        # Remove comments
        for comment in soup.find_all(string=lambda text: isinstance(text, Comment)):
            comment.extract()
        
        # Find content candidates
        content_candidates = []
        
        for element in soup.find_all(['article', 'main', 'section', 'div']):
            if not self.is_boilerplate(element):
                indicators = self.get_content_indicators(element)
                content_candidates.append((element, indicators))
        
        # Sort by content quality score
        content_candidates.sort(
            key=lambda x: (x[1]['text_length'] * x[1]['text_density'] * 
                          (1 - x[1]['link_density'])), 
            reverse=True
        )
        
        # Return the best content candidate
        if content_candidates:
            best_content = content_candidates[0][0]
            return {
                'title': self._extract_title(soup),
                'content': best_content.get_text(strip=True, separator=' '),
                'html': str(best_content),
                'quality_score': content_candidates[0][1]
            }
        
        return None
    
    def _extract_title(self, soup):
        """Extract page title using multiple strategies"""
        # Try different title sources in order of preference
        title_selectors = [
            'h1',
            'article h1',
            '.post-title',
            '.article-title', 
            'title'
        ]
        
        for selector in title_selectors:
            element = soup.select_one(selector)
            if element and element.get_text(strip=True):
                return element.get_text(strip=True)
        
        return "No title found"

# Usage example
def extract_article(url):
    """Extract article content from a web page"""
    try:
        response = requests.get(url, timeout=10)
        response.raise_for_status()
        
        detector = BoilerplateDetector()
        result = detector.extract_main_content(response.text)
        
        if result:
            return {
                'url': url,
                'title': result['title'],
                'content': result['content'],
                'word_count': len(result['content'].split()),
                'extraction_quality': result['quality_score']
            }
        else:
            return {'error': 'No main content found'}
    
    except Exception as e:
        return {'error': str(e)}

# Test the extractor
article = extract_article('https://example-news-site.com/article')
if 'error' not in article:
    print(f"Title: {article['title']}")
    print(f"Word count: {article['word_count']}")
    print(f"Content preview: {article['content'][:200]}...")
else:
    print(f"Extraction failed: {article['error']}")

Advanced ML-based Content Extraction

Machine Learning Content Extractor

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
import pickle

class MLBoilerplateDetector:
    def __init__(self):
        self.model = None
        self.vectorizer = TfidfVectorizer(max_features=1000, stop_words='english')
        self.feature_names = [
            'text_density', 'link_density', 'text_length', 'tag_depth',
            'has_content_class', 'has_nav_class', 'paragraph_count',
            'list_count', 'image_count', 'form_count', 'table_count'
        ]
    
    def extract_features(self, element, soup_context):
        """Extract comprehensive features for ML classification"""
        text = element.get_text(strip=True)
        
        features = {
            'text_density': len(text) / max(1, len(str(element))),
            'link_density': len(''.join(a.get_text() for a in element.find_all('a'))) / max(1, len(text)),
            'text_length': len(text),
            'tag_depth': len(list(element.parents)),
            'has_content_class': int(any(cls in str(element.get('class', [])).lower() 
                                       for cls in ['content', 'article', 'main', 'post', 'body'])),
            'has_nav_class': int(any(cls in str(element.get('class', [])).lower() 
                                   for cls in ['nav', 'menu', 'sidebar', 'footer', 'header'])),
            'paragraph_count': len(element.find_all('p')),
            'list_count': len(element.find_all(['ul', 'ol'])),
            'image_count': len(element.find_all('img')),
            'form_count': len(element.find_all('form')),
            'table_count': len(element.find_all('table')),
        }
        
        # Additional contextual features
        features.update({
            'position_in_dom': self._get_element_position(element, soup_context),
            'sibling_content_ratio': self._calculate_sibling_content_ratio(element),
            'parent_content_ratio': self._calculate_parent_content_ratio(element),
            'css_visibility': self._check_css_visibility(element),
            'semantic_score': self._calculate_semantic_score(text)
        })
        
        return np.array([features[name] for name in self.feature_names])
    
    def _get_element_position(self, element, soup):
        """Calculate relative position of element in DOM"""
        all_elements = soup.find_all()
        try:
            position = all_elements.index(element)
            return position / len(all_elements)
        except ValueError:
            return 0.5
    
    def _calculate_sibling_content_ratio(self, element):
        """Ratio of content vs non-content siblings"""
        if not element.parent:
            return 0.5
        
        siblings = element.parent.find_all(recursive=False)
        if len(siblings) <= 1:
            return 0.5
        
        content_siblings = sum(1 for sib in siblings if len(sib.get_text(strip=True)) > 50)
        return content_siblings / len(siblings)
    
    def _calculate_parent_content_ratio(self, element):
        """Content density of parent elements"""
        if not element.parent:
            return 0.5
        
        parent_text = element.parent.get_text(strip=True)
        element_text = element.get_text(strip=True)
        
        if not parent_text:
            return 0
        
        return len(element_text) / len(parent_text)
    
    def _check_css_visibility(self, element):
        """Check if element is visually hidden (simplified)"""
        style = element.get('style', '').lower()
        classes = ' '.join(element.get('class', [])).lower()
        
        hidden_indicators = ['display:none', 'visibility:hidden', 'hidden', 'sr-only']
        return int(not any(indicator in style or indicator in classes 
                          for indicator in hidden_indicators))
    
    def _calculate_semantic_score(self, text):
        """Simple semantic content scoring"""
        if not text:
            return 0
        
        # Common content words vs navigation words
        content_words = ['article', 'story', 'news', 'post', 'content', 'text', 'paragraph']
        nav_words = ['menu', 'navigation', 'links', 'home', 'contact', 'about']
        
        text_lower = text.lower()
        content_score = sum(text_lower.count(word) for word in content_words)
        nav_score = sum(text_lower.count(word) for word in nav_words)
        
        total_score = content_score + nav_score
        return content_score / max(1, total_score)
    
    def train_model(self, training_data):
        """Train the ML model on labeled data"""
        features = []
        labels = []
        
        for html, is_content in training_data:
            soup = BeautifulSoup(html, 'html.parser')
            for element in soup.find_all(['div', 'section', 'article', 'p']):
                feature_vector = self.extract_features(element, soup)
                features.append(feature_vector)
                labels.append(1 if is_content else 0)
        
        X = np.array(features)
        y = np.array(labels)
        
        # Train Random Forest classifier
        self.model = RandomForestClassifier(
            n_estimators=100,
            max_depth=10,
            random_state=42
        )
        self.model.fit(X, y)
        
        return self.model.score(X, y)
    
    def predict_content(self, html):
        """Predict main content using trained model"""
        if not self.model:
            raise ValueError("Model not trained yet")
        
        soup = BeautifulSoup(html, 'html.parser')
        content_candidates = []
        
        for element in soup.find_all(['article', 'main', 'section', 'div']):
            if len(element.get_text(strip=True)) > 50:  # Minimum content length
                features = self.extract_features(element, soup)
                probability = self.model.predict_proba([features])[0][1]  # Probability of being content
                
                content_candidates.append({
                    'element': element,
                    'probability': probability,
                    'text_length': len(element.get_text(strip=True))
                })
        
        # Sort by probability and text length
        content_candidates.sort(
            key=lambda x: x['probability'] * np.log(x['text_length'] + 1),
            reverse=True
        )
        
        if content_candidates and content_candidates[0]['probability'] > 0.6:
            return {
                'content': content_candidates[0]['element'].get_text(strip=True),
                'confidence': content_candidates[0]['probability'],
                'html': str(content_candidates[0]['element'])
            }
        
        return None
    
    def save_model(self, filepath):
        """Save trained model to file"""
        with open(filepath, 'wb') as f:
            pickle.dump({'model': self.model, 'vectorizer': self.vectorizer}, f)
    
    def load_model(self, filepath):
        """Load trained model from file"""
        with open(filepath, 'rb') as f:
            data = pickle.load(f)
            self.model = data['model']
            self.vectorizer = data['vectorizer']

# Usage example
ml_detector = MLBoilerplateDetector()

# Train on labeled data (you would need to collect this)
# training_data = [(html_content, is_main_content), ...]
# accuracy = ml_detector.train_model(training_data)

# Save the trained model
# ml_detector.save_model('boilerplate_model.pkl')

# Use for content extraction
# result = ml_detector.predict_content(html_content)
# if result:
#     print(f"Extracted content with {result['confidence']:.2f} confidence")
#     print(result['content'][:200] + "...")

Real-World Boilerplate Detection Implementations

Mozilla Readability

Powers Firefox Reader Mode and is used across many content extraction applications.

• Heuristic-based content scoring algorithm
• Handles diverse website layouts and CMSs
• Optimized for speed and browser integration
• Open-source with wide adoption across tools

Google Web Crawler

Uses advanced boilerplate detection for search indexing and content understanding at web scale.

• ML-based content quality assessment
• Visual rendering analysis for content blocks
• Handles JavaScript-generated content
• Processes billions of pages daily

Pocket (Read Later Service)

Extracts clean article content for distraction-free reading across millions of saved articles.

• Hybrid approach combining multiple techniques
• User feedback loop for improving accuracy
• Mobile-optimized content formatting
• Real-time processing for instant saves

Mercury Parser (Postlight)

Commercial-grade content extraction API used by news aggregators and content management systems.

• Site-specific extraction rules and fallbacks
• High accuracy across news and blog sites
• Metadata extraction (author, publish date)
• Enterprise-grade reliability and support

Boilerplate Detection Best Practices

✅ Do

• Combine multiple detection techniques for robustness
• Test extensively across diverse website types
• Handle dynamic content loading and SPAs
• Implement fallback mechanisms for edge cases
• Monitor extraction quality and user feedback
• Cache results to improve processing efficiency

❌ Don't

• Rely solely on CSS selectors or class names
• Ignore website-specific extraction patterns
• Process pages without considering load performance
• Skip validation of extracted content quality
• Assume all sites follow similar layout patterns
• Forget to handle malformed or unusual HTML

No quiz questions available

Questions prop is empty