Skip to main contentSkip to user menuSkip to navigation

SimHash & MinHash

Master SimHash & MinHash: locality-sensitive hashing, duplicate detection, and efficient similarity estimation.

45 min readAdvanced
Not Started
Loading...

What are SimHash and MinHash?

SimHash and MinHash are locality-sensitive hashing algorithms designed for efficient similarity detection in large document collections. SimHash creates compact fingerprints where similar documents produce similar hash values, making it ideal for near-duplicate detection. MinHash estimates Jaccard similarity between sets by computing signatures from minimum hash values, enabling fast approximate similarity calculations without pairwise comparisons.

Both algorithms solve the fundamental problem of finding similar items in massive datasets - from web crawling and plagiarism detection to recommendation systems and data deduplication. They trade perfect accuracy for tremendous speed and memory efficiency, making similarity search tractable at web scale with billions of documents.

SimHash & MinHash Performance Calculator

781KB
SimHash Memory
50,000KB
MinHash Memory
95%
SimHash Accuracy
98%
MinHash Accuracy

SimHash Compression: 625:1

MinHash Compression: 10:1

Comparisons Reduced: 5000M → 5000M

SimHash vs MinHash Comparison

SimHash

Locality-sensitive hashing for near-duplicate detection with Hamming distance comparison.

Best for: Near-duplicate detection, web crawling, content fingerprinting
Output: Fixed-size binary hash (64-256 bits)
Comparison: Hamming distance < threshold
Memory: Very compact, scales linearly

MinHash

Jaccard similarity estimation using minimum hash values from random permutations.

Best for: Set similarity, recommendation systems, LSH bucketing
Output: Variable signature size (64-1024+ integers)
Comparison: Jaccard similarity = matching signatures
Memory: Larger signatures, configurable precision

Key Differences

Similarity Metric:

SimHash uses cosine/Hamming similarity, MinHash estimates Jaccard similarity

Memory Usage:

SimHash is more memory-efficient with fixed small hashes

Accuracy Trade-offs:

MinHash provides better similarity estimation with larger signatures

Implementation Examples

SimHash Implementation

Python SimHash Example
import hashlib
from collections import defaultdict

class SimHash:
    def __init__(self, hash_bits=64):
        self.hash_bits = hash_bits
    
    def compute_simhash(self, text):
        """Compute SimHash for a text document"""
        # Tokenize and compute hash for each feature
        tokens = text.lower().split()
        feature_hashes = []
        
        for token in tokens:
            # Hash each token to get bit positions
            hash_val = int(hashlib.md5(token.encode()).hexdigest(), 16)
            feature_hashes.append(hash_val)
        
        # Initialize bit vector
        bit_vector = [0] * self.hash_bits
        
        # For each feature hash, update bit vector
        for hash_val in feature_hashes:
            for i in range(self.hash_bits):
                bit = (hash_val >> i) & 1
                if bit:
                    bit_vector[i] += 1
                else:
                    bit_vector[i] -= 1
        
        # Convert to final hash
        simhash = 0
        for i in range(self.hash_bits):
            if bit_vector[i] > 0:
                simhash |= (1 << i)
        
        return simhash
    
    def hamming_distance(self, hash1, hash2):
        """Calculate Hamming distance between two hashes"""
        return bin(hash1 ^ hash2).count('1')
    
    def is_near_duplicate(self, hash1, hash2, threshold=3):
        """Check if two documents are near-duplicates"""
        return self.hamming_distance(hash1, hash2) <= threshold

# Usage example
simhasher = SimHash(64)
doc1_hash = simhasher.compute_simhash("The quick brown fox jumps")
doc2_hash = simhasher.compute_simhash("The quick brown fox leaps")
is_similar = simhasher.is_near_duplicate(doc1_hash, doc2_hash)

MinHash with LSH

Python MinHash + LSH Implementation
import hashlib
import random
from collections import defaultdict

class MinHashLSH:
    def __init__(self, signature_size=128, bands=16, rows_per_band=8):
        self.signature_size = signature_size
        self.bands = bands
        self.rows_per_band = rows_per_band
        self.hash_functions = self._generate_hash_functions()
        self.buckets = defaultdict(set)
    
    def _generate_hash_functions(self):
        """Generate random hash functions for MinHash"""
        return [(random.randint(1, 2**32), random.randint(0, 2**32)) 
                for _ in range(self.signature_size)]
    
    def compute_minhash(self, shingles):
        """Compute MinHash signature for a set of shingles"""
        signature = []
        
        for a, b in self.hash_functions:
            min_hash = float('inf')
            for shingle in shingles:
                # Hash the shingle
                h = hash(shingle)
                # Apply random hash function
                hash_val = (a * h + b) % (2**32)
                min_hash = min(min_hash, hash_val)
            signature.append(min_hash)
        
        return signature
    
    def get_shingles(self, text, k=3):
        """Generate k-shingles from text"""
        tokens = text.lower().split()
        return set([' '.join(tokens[i:i+k]) 
                   for i in range(len(tokens) - k + 1)])
    
    def add_document(self, doc_id, text):
        """Add document to LSH index"""
        shingles = self.get_shingles(text)
        signature = self.compute_minhash(shingles)
        
        # Add to LSH buckets
        for band in range(self.bands):
            start = band * self.rows_per_band
            end = start + self.rows_per_band
            band_signature = tuple(signature[start:end])
            bucket_key = (band, hash(band_signature))
            self.buckets[bucket_key].add(doc_id)
    
    def get_candidates(self, text):
        """Get candidate similar documents using LSH"""
        shingles = self.get_shingles(text)
        signature = self.compute_minhash(shingles)
        candidates = set()
        
        for band in range(self.bands):
            start = band * self.rows_per_band
            end = start + self.rows_per_band
            band_signature = tuple(signature[start:end])
            bucket_key = (band, hash(band_signature))
            candidates.update(self.buckets.get(bucket_key, set()))
        
        return candidates
    
    def jaccard_similarity(self, sig1, sig2):
        """Estimate Jaccard similarity from MinHash signatures"""
        matches = sum(1 for a, b in zip(sig1, sig2) if a == b)
        return matches / len(sig1)

# Usage example
lsh = MinHashLSH(signature_size=128, bands=16)
lsh.add_document("doc1", "The quick brown fox jumps over lazy dog")
lsh.add_document("doc2", "A quick brown fox leaps over the lazy dog")
candidates = lsh.get_candidates("The fast brown fox jumps over lazy dog")

Real-World SimHash & MinHash Implementations

Google (Web Search)

Uses SimHash for detecting near-duplicate web pages during crawling and indexing billions of documents.

  • • 64-bit SimHash for web page fingerprinting
  • • Hamming distance < 3 for near-duplicate detection
  • • Reduces index size and improves search quality
  • • Processes billions of pages with minimal memory

Spotify (Music Recommendations)

Implements MinHash to find similar users and songs for collaborative filtering and music discovery.

  • • User listening history similarity via MinHash
  • • LSH for efficient candidate generation
  • • 256-signature MinHash for high accuracy
  • • Real-time playlist and discovery features

Twitter (Content Deduplication)

Uses both SimHash and MinHash for detecting duplicate tweets and spam content at massive scale.

  • • SimHash for tweet content fingerprinting
  • • MinHash for user behavior similarity
  • • Real-time duplicate detection pipeline
  • • Spam and bot account identification

Pinterest (Visual Discovery)

Applies MinHash to user pin collections and board similarities for personalized content recommendations.

  • • Board similarity using MinHash on pin sets
  • • User interest profiling and segmentation
  • • Related pin recommendations
  • • Large-scale similarity computations

SimHash & MinHash Best Practices

✅ Do

  • • Choose hash/signature size based on accuracy requirements
  • • Use LSH with MinHash for sublinear similarity search
  • • Preprocess text consistently (lowercase, tokenization)
  • • Monitor false positive/negative rates in production
  • • Cache signatures to avoid recomputation
  • • Use appropriate similarity thresholds for your domain

❌ Don't

  • • Use tiny hash sizes for high-precision requirements
  • • Compare raw documents when signatures exist
  • • Ignore the trade-off between speed and accuracy
  • • Use SimHash for exact Jaccard similarity needs
  • • Forget to validate algorithm choice with real data
  • • Skip tuning LSH parameters for your workload
No quiz questions available
Questions prop is empty