What are SimHash and MinHash?
SimHash and MinHash are locality-sensitive hashing algorithms designed for efficient similarity detection in large document collections. SimHash creates compact fingerprints where similar documents produce similar hash values, making it ideal for near-duplicate detection. MinHash estimates Jaccard similarity between sets by computing signatures from minimum hash values, enabling fast approximate similarity calculations without pairwise comparisons.
Both algorithms solve the fundamental problem of finding similar items in massive datasets - from web crawling and plagiarism detection to recommendation systems and data deduplication. They trade perfect accuracy for tremendous speed and memory efficiency, making similarity search tractable at web scale with billions of documents.
SimHash & MinHash Performance Calculator
SimHash Compression: 625:1
MinHash Compression: 10:1
Comparisons Reduced: 5000M → 5000M
SimHash vs MinHash Comparison
SimHash
Locality-sensitive hashing for near-duplicate detection with Hamming distance comparison.
MinHash
Jaccard similarity estimation using minimum hash values from random permutations.
Key Differences
SimHash uses cosine/Hamming similarity, MinHash estimates Jaccard similarity
SimHash is more memory-efficient with fixed small hashes
MinHash provides better similarity estimation with larger signatures
Implementation Examples
SimHash Implementation
import hashlib
from collections import defaultdict
class SimHash:
def __init__(self, hash_bits=64):
self.hash_bits = hash_bits
def compute_simhash(self, text):
"""Compute SimHash for a text document"""
# Tokenize and compute hash for each feature
tokens = text.lower().split()
feature_hashes = []
for token in tokens:
# Hash each token to get bit positions
hash_val = int(hashlib.md5(token.encode()).hexdigest(), 16)
feature_hashes.append(hash_val)
# Initialize bit vector
bit_vector = [0] * self.hash_bits
# For each feature hash, update bit vector
for hash_val in feature_hashes:
for i in range(self.hash_bits):
bit = (hash_val >> i) & 1
if bit:
bit_vector[i] += 1
else:
bit_vector[i] -= 1
# Convert to final hash
simhash = 0
for i in range(self.hash_bits):
if bit_vector[i] > 0:
simhash |= (1 << i)
return simhash
def hamming_distance(self, hash1, hash2):
"""Calculate Hamming distance between two hashes"""
return bin(hash1 ^ hash2).count('1')
def is_near_duplicate(self, hash1, hash2, threshold=3):
"""Check if two documents are near-duplicates"""
return self.hamming_distance(hash1, hash2) <= threshold
# Usage example
simhasher = SimHash(64)
doc1_hash = simhasher.compute_simhash("The quick brown fox jumps")
doc2_hash = simhasher.compute_simhash("The quick brown fox leaps")
is_similar = simhasher.is_near_duplicate(doc1_hash, doc2_hash)
MinHash with LSH
import hashlib
import random
from collections import defaultdict
class MinHashLSH:
def __init__(self, signature_size=128, bands=16, rows_per_band=8):
self.signature_size = signature_size
self.bands = bands
self.rows_per_band = rows_per_band
self.hash_functions = self._generate_hash_functions()
self.buckets = defaultdict(set)
def _generate_hash_functions(self):
"""Generate random hash functions for MinHash"""
return [(random.randint(1, 2**32), random.randint(0, 2**32))
for _ in range(self.signature_size)]
def compute_minhash(self, shingles):
"""Compute MinHash signature for a set of shingles"""
signature = []
for a, b in self.hash_functions:
min_hash = float('inf')
for shingle in shingles:
# Hash the shingle
h = hash(shingle)
# Apply random hash function
hash_val = (a * h + b) % (2**32)
min_hash = min(min_hash, hash_val)
signature.append(min_hash)
return signature
def get_shingles(self, text, k=3):
"""Generate k-shingles from text"""
tokens = text.lower().split()
return set([' '.join(tokens[i:i+k])
for i in range(len(tokens) - k + 1)])
def add_document(self, doc_id, text):
"""Add document to LSH index"""
shingles = self.get_shingles(text)
signature = self.compute_minhash(shingles)
# Add to LSH buckets
for band in range(self.bands):
start = band * self.rows_per_band
end = start + self.rows_per_band
band_signature = tuple(signature[start:end])
bucket_key = (band, hash(band_signature))
self.buckets[bucket_key].add(doc_id)
def get_candidates(self, text):
"""Get candidate similar documents using LSH"""
shingles = self.get_shingles(text)
signature = self.compute_minhash(shingles)
candidates = set()
for band in range(self.bands):
start = band * self.rows_per_band
end = start + self.rows_per_band
band_signature = tuple(signature[start:end])
bucket_key = (band, hash(band_signature))
candidates.update(self.buckets.get(bucket_key, set()))
return candidates
def jaccard_similarity(self, sig1, sig2):
"""Estimate Jaccard similarity from MinHash signatures"""
matches = sum(1 for a, b in zip(sig1, sig2) if a == b)
return matches / len(sig1)
# Usage example
lsh = MinHashLSH(signature_size=128, bands=16)
lsh.add_document("doc1", "The quick brown fox jumps over lazy dog")
lsh.add_document("doc2", "A quick brown fox leaps over the lazy dog")
candidates = lsh.get_candidates("The fast brown fox jumps over lazy dog")
Real-World SimHash & MinHash Implementations
Google (Web Search)
Uses SimHash for detecting near-duplicate web pages during crawling and indexing billions of documents.
- • 64-bit SimHash for web page fingerprinting
- • Hamming distance < 3 for near-duplicate detection
- • Reduces index size and improves search quality
- • Processes billions of pages with minimal memory
Spotify (Music Recommendations)
Implements MinHash to find similar users and songs for collaborative filtering and music discovery.
- • User listening history similarity via MinHash
- • LSH for efficient candidate generation
- • 256-signature MinHash for high accuracy
- • Real-time playlist and discovery features
Twitter (Content Deduplication)
Uses both SimHash and MinHash for detecting duplicate tweets and spam content at massive scale.
- • SimHash for tweet content fingerprinting
- • MinHash for user behavior similarity
- • Real-time duplicate detection pipeline
- • Spam and bot account identification
Pinterest (Visual Discovery)
Applies MinHash to user pin collections and board similarities for personalized content recommendations.
- • Board similarity using MinHash on pin sets
- • User interest profiling and segmentation
- • Related pin recommendations
- • Large-scale similarity computations
SimHash & MinHash Best Practices
✅ Do
- • Choose hash/signature size based on accuracy requirements
- • Use LSH with MinHash for sublinear similarity search
- • Preprocess text consistently (lowercase, tokenization)
- • Monitor false positive/negative rates in production
- • Cache signatures to avoid recomputation
- • Use appropriate similarity thresholds for your domain
❌ Don't
- • Use tiny hash sizes for high-precision requirements
- • Compare raw documents when signatures exist
- • Ignore the trade-off between speed and accuracy
- • Use SimHash for exact Jaccard similarity needs
- • Forget to validate algorithm choice with real data
- • Skip tuning LSH parameters for your workload