Skip to main contentSkip to user menuSkip to navigation

Perceptual Hashing (pHash/dHash)

Master perceptual hashing: pHash, dHash, image similarity detection, and visual duplicate identification.

35 min readIntermediate
Not Started
Loading...

What is Perceptual Hashing?

Perceptual hashing creates compact fingerprints of multimedia content (primarily images) that remain similar even when the content undergoes common transformations like compression, resizing, or slight modifications. Unlike cryptographic hashes that change completely with any alteration, perceptual hashes are designed to capture the essential visual characteristics of an image, making them invaluable for duplicate detection, content identification, and similarity search applications.

The technology powers reverse image search engines, copyright detection systems, photo organization tools, and content moderation platforms. By converting high-dimensional image data into compact binary signatures, perceptual hashing enables efficient similarity comparisons across massive image collections while being robust to the variations that commonly occur in real-world scenarios.

Perceptual Hashing Performance Calculator

781KB
Hash Storage
500
Images/sec
87%
Accuracy
393,216:1
Compression

Search Time: 100000s for full comparison

False Positive Rate: 3.13%

Total Comparisons: 5000M pairs

Perceptual Hashing Algorithms

Average Hash (aHash)

Simplest and fastest algorithm comparing each pixel to the image average.

Process: Scale → Grayscale → Average → Binary comparison
Speed: Very fast, minimal computation
Robustness: Basic, handles simple modifications

Difference Hash (dHash)

Compares adjacent pixels to capture gradient information and structure.

Process: Scale → Grayscale → Adjacent pixel comparison
Speed: Fast, slightly more complex than aHash
Robustness: Better than aHash for brightness changes

Perceptual Hash (pHash)

Uses Discrete Cosine Transform to capture frequency domain characteristics.

Process: Scale → Grayscale → DCT → Low frequencies → Hash
Speed: Moderate, DCT computation overhead
Robustness: Excellent for various transformations

Wavelet Hash (wHash)

Applies Discrete Wavelet Transform for multi-resolution analysis.

Process: Scale → Grayscale → DWT → Coefficients → Hash
Speed: Slow, complex wavelet computation
Robustness: Superior for geometric transformations

Implementation Examples

Average Hash (aHash) Implementation

Python Average Hash Example
from PIL import Image
import numpy as np

def average_hash(image_path, hash_size=8):
    """
    Compute Average Hash for an image
    
    Args:
        image_path: Path to the image file
        hash_size: Size of the hash (default 8x8 = 64 bits)
    
    Returns:
        Binary hash as integer
    """
    # Open and convert to grayscale
    image = Image.open(image_path).convert('L')
    
    # Resize to hash_size x hash_size
    image = image.resize((hash_size, hash_size), Image.LANCZOS)
    
    # Convert to numpy array
    pixels = np.array(image)
    
    # Calculate average pixel value
    avg = pixels.mean()
    
    # Create binary hash
    hash_bits = []
    for pixel in pixels.flatten():
        hash_bits.append(1 if pixel > avg else 0)
    
    # Convert to integer
    hash_value = 0
    for i, bit in enumerate(hash_bits):
        hash_value |= bit << i
    
    return hash_value

def hamming_distance(hash1, hash2):
    """Calculate Hamming distance between two hashes"""
    return bin(hash1 ^ hash2).count('1')

def are_similar(image1_path, image2_path, threshold=5):
    """Check if two images are similar"""
    hash1 = average_hash(image1_path)
    hash2 = average_hash(image2_path)
    distance = hamming_distance(hash1, hash2)
    return distance <= threshold

# Usage example
similarity = are_similar('image1.jpg', 'image2.jpg', threshold=5)
print(f"Images are similar: {similarity}")

Difference Hash (dHash) Implementation

Python Difference Hash Example
from PIL import Image
import numpy as np

def difference_hash(image_path, hash_size=8):
    """
    Compute Difference Hash for an image
    
    Args:
        image_path: Path to the image file
        hash_size: Size of the hash matrix
    
    Returns:
        Binary hash as integer
    """
    # Open and convert to grayscale
    image = Image.open(image_path).convert('L')
    
    # Resize to (hash_size+1) x hash_size for horizontal differences
    image = image.resize((hash_size + 1, hash_size), Image.LANCZOS)
    
    # Convert to numpy array
    pixels = np.array(image)
    
    # Calculate horizontal differences
    hash_bits = []
    for row in pixels:
        for i in range(len(row) - 1):
            # Compare adjacent pixels
            hash_bits.append(1 if row[i] > row[i + 1] else 0)
    
    # Convert to integer
    hash_value = 0
    for i, bit in enumerate(hash_bits):
        hash_value |= bit << i
    
    return hash_value

class ImageHasher:
    """Class for efficient image hashing and similarity detection"""
    
    def __init__(self, algorithm='dhash', hash_size=8):
        self.algorithm = algorithm
        self.hash_size = hash_size
        self.hash_cache = {}
    
    def hash_image(self, image_path):
        """Hash an image using the selected algorithm"""
        if image_path in self.hash_cache:
            return self.hash_cache[image_path]
        
        if self.algorithm == 'ahash':
            hash_value = average_hash(image_path, self.hash_size)
        elif self.algorithm == 'dhash':
            hash_value = difference_hash(image_path, self.hash_size)
        else:
            raise ValueError(f"Unknown algorithm: {self.algorithm}")
        
        self.hash_cache[image_path] = hash_value
        return hash_value
    
    def find_similar_images(self, target_image, image_list, threshold=5):
        """Find similar images from a list"""
        target_hash = self.hash_image(target_image)
        similar_images = []
        
        for image_path in image_list:
            image_hash = self.hash_image(image_path)
            distance = hamming_distance(target_hash, image_hash)
            
            if distance <= threshold:
                similar_images.append({
                    'path': image_path,
                    'distance': distance,
                    'similarity': 1 - (distance / (self.hash_size ** 2))
                })
        
        # Sort by similarity (lowest distance first)
        similar_images.sort(key=lambda x: x['distance'])
        return similar_images

# Usage example
hasher = ImageHasher(algorithm='dhash', hash_size=8)
similar = hasher.find_similar_images('target.jpg', 
                                   ['img1.jpg', 'img2.jpg', 'img3.jpg'], 
                                   threshold=5)

Real-World Perceptual Hashing Implementations

Google (Image Search)

Uses perceptual hashing for reverse image search and duplicate detection across billions of indexed images.

  • • Custom pHash variants for web-scale image indexing
  • • Real-time similar image suggestions
  • • Robust to JPEG compression and resizing
  • • Integration with machine learning for accuracy

Facebook (Content Moderation)

Implements perceptual hashing to detect and remove duplicate harmful content and spam images.

  • • PhotoDNA technology for harmful content detection
  • • Real-time upload screening pipeline
  • • Variant detection for modified prohibited images
  • • 99.5% accuracy with sub-second processing

Adobe (Stock Photos)

Uses perceptual hashing for duplicate detection and similar image recommendations in their stock library.

  • • Duplicate prevention in contributor uploads
  • • Similar image clustering and recommendations
  • • Multi-algorithm approach for accuracy
  • • 200+ million image collection management

Apple (Photos App)

Leverages perceptual hashing for duplicate photo detection and organization in the Photos app.

  • • On-device duplicate detection and merging
  • • Similar photo grouping and suggestions
  • • Live Photos and burst mode organization
  • • Privacy-preserving local processing

Perceptual Hashing Best Practices

✅ Do

  • • Choose hash size based on your similarity requirements
  • • Test different algorithms with your specific image types
  • • Cache computed hashes to avoid reprocessing
  • • Use appropriate Hamming distance thresholds
  • • Preprocess images consistently (grayscale, resize)
  • • Validate results with human evaluation for tuning

❌ Don't

  • • Use very small hash sizes for large-scale applications
  • • Assume one algorithm works for all image types
  • • Ignore the trade-off between speed and accuracy
  • • Use overly strict thresholds that miss valid matches
  • • Process original high-resolution images unnecessarily
  • • Forget to handle edge cases like solid color images
No quiz questions available
Questions prop is empty