Design ML System: Duplicate Detection

Problem Statement

What is Near-Duplicate Detection?

Near-duplicate detection identifies content that is semantically similar but not identical. Unlike exact matching, it uses ML to find variations, rephrases, or slightly modified versions of content. This is crucial for data quality in ML pipelines where duplicate examples can bias model training and reduce dataset value.

Your Task: Design an ML system for Scale AI that detects near-duplicate content in datasets. The system should handle both text (prompts/responses) and images, providing real-time duplicate detection when new items are added to datasets. Focus on the complete ML pipeline from ingestion to feedback learning.

Duration:60 minutes

Difficulty:Advanced

Focus:ML System Design

Company:Scale AI

Q: What types of content are we detecting duplicates for?

A: Primarily text data (prompts/responses) and images. Text data includes customer datasets with prompts and responses for generative AI training. Images include vision labeling datasets.

Analysis: This affects our choice of embedding models and similarity metrics. We'll need separate pipelines for text and image content.

Q: What constitutes a 'near-duplicate'? How similar is too similar?

A: Near-duplicates are items that are semantically similar but not identical. For text: similar meaning/intent but different wording (e.g., 'How to cook pasta?' vs 'What's the best way to make pasta?'). For images: visually similar but potentially different angles, lighting, or minor variations.

Analysis: We need configurable similarity thresholds based on use case. Different domains may have different tolerance levels.

Q: What's the expected scale - how many items per dataset and how many datasets?

A: Individual datasets range from 1,000 to 1 million items. We handle hundreds of active datasets simultaneously. New items are added continuously throughout the day.

Analysis: This scale requires efficient indexing strategies and distributed processing. We'll need to partition by dataset and use approximate nearest neighbor search.

Q: What's the performance requirement for duplicate detection?

A: Near real-time detection under 1 second 95% of the time when adding new items. The candidate pool update can happen every few hours, and model updates weekly.

Analysis: Real-time requirements suggest we need a serving layer with pre-computed embeddings and fast similarity search infrastructure.

Q: How do we handle the continuous learning aspect?

A: We receive a stream of human-verified positive and negative pairs. These should improve our similarity thresholds and potentially fine-tune our models over time.

Analysis: We need a feedback loop system that can adapt thresholds dynamically and potentially retrain models with new labeled data.

Q: What are the consequences of false positives vs false negatives?

A: False positives (flagging unique content as duplicate) waste human reviewer time but prevent low-quality datasets. False negatives (missing actual duplicates) reduce dataset quality and may disappoint customers.

Analysis: We should err on the side of false positives for data quality, but need configurable precision/recall trade-offs per customer.

Real-Time Duplicate Detection Architecture

┌─────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ API Gateway │───>│ Embedding │───>│ Vector Database │ │ │ │ Service │ │ (HNSW Index) │ │ Load Balancer│ │ Text/Image │ │ 50M vectors │ │ 50 req/sec │ │ 100ms/200ms │ │ <50ms search │ └─────────────┘ └─────────────────┘ └─────────────────┘ │ │ │ v v v ┌─────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ Threshold │ │ Feedback │ │ Metadata Store │ │ Manager │ │ Collection │ │ PostgreSQL │ │ Configurable│ │ Human Labels │ │ Dataset Info │ └─────────────┘ └─────────────────┘ └─────────────────┘

Data Flow: New content → Embedding generation → Vector similarity search → Threshold evaluation → Duplicate candidates

Data Processing Pipeline

1. Ingestion

Raw content validation

2. Embedding

Feature extraction

3. Search

Similarity matching

4. Decision

Threshold application

↓ ↓ ↓ ↓

Output: Similarity Report

Duplicate candidates with confidence scores, similar items preview, and actionable recommendations

High-Level Architecture

Real-Time Serving Layer

• API Gateway for duplicate detection requests
• Embedding service (text & image models)
• Vector similarity search engine
• Configurable threshold manager

Batch Processing Layer

• Periodic candidate pool updates
• Embedding recomputation pipeline
• Model training & evaluation
• Data quality metrics computation

Storage Layer

• Vector database (embeddings)
• Metadata store (PostgreSQL)
• Model registry & artifacts
• Feedback data store

Feedback & Learning

• Human feedback collection API
• Threshold adaptation engine
• Model fine-tuning pipeline
• Performance monitoring

Core APIs

POST /api/v1/detect-duplicates

{
  "dataset_id": "dataset_123",
  "content": {
    "type": "text|image",
    "data": "base64_content_or_text"
  },
  "threshold_config": {
    "similarity_threshold": 0.85,
    "max_candidates": 50
  }
}

Response:
{
  "is_duplicate": true,
  "confidence": 0.92,
  "similar_items": [
    {
      "item_id": "item_456",
      "similarity_score": 0.92,
      "content_preview": "..."
    }
  ]
}

POST /api/v1/feedback

{
  "dataset_id": "dataset_123",
  "feedback_type": "positive|negative",
  "item_pair": {
    "item_1": "item_123",
    "item_2": "item_456"
  },
  "similarity_judgment": "duplicate|not_duplicate",
  "reviewer_id": "reviewer_789"
}

Model Deployment & Serving

Deployment Strategy

• Blue-Green Deployment: Zero-downtime model updates
• Canary Releases: Gradual rollout with monitoring
• Shadow Mode: Test new models without affecting users
• Feature Flags: Dynamic model switching

Serving Infrastructure

• Model Registry: Versioned model artifacts (MLflow)
• Inference Service: TorchServe/TensorFlow Serving
• Auto-scaling: HPA based on request volume
• Load Balancing: Distribute requests across replicas

Monitoring & Observability

Model Performance

• Accuracy Metrics: Daily precision/recall tracking
• Distribution Drift: Input feature drift detection
• Prediction Drift: Output distribution changes
• Confidence Calibration: Similarity score reliability

System Health

• Latency Monitoring: P95/P99 response times
• Error Rates: 4xx/5xx HTTP responses
• Resource Usage: CPU/GPU/memory utilization
• Dependency Health: Vector DB, model service status

Business Metrics

• Customer Feedback: Manual review rates
• Dataset Quality: Acceptance vs rejection rates
• Cost Efficiency: Cost per duplicate detection
• User Experience: Time to review completion

Continuous Learning & Improvement

Data Collection Loop

• Feedback Aggregation: Collect human judgments systematically
• Hard Example Mining: Focus on model failure cases
• Label Quality Control: Inter-annotator agreement validation
• Privacy Compliance: Anonymize sensitive content

Model Retraining Pipeline

• Trigger Conditions: Performance degradation, new data availability
• Automated Pipeline: Data preprocessing → training → validation
• Model Comparison: Champion vs challenger evaluation
• Safe Deployment: Automated rollback on performance drops

Incident Response & Recovery

Common Failure Modes

• Model Degradation: Accuracy drops below threshold
• Latency Spikes: Vector search timeouts
• Memory Leaks: Embedding service crashes
• Data Poisoning: Bad training data affects model

Recovery Procedures

• Automatic Rollback: Revert to previous model version
• Circuit Breaker: Fall back to rule-based detection
• Traffic Rerouting: Redirect to healthy instances
• Manual Override: Emergency threshold adjustments

No quiz questions available

Quiz ID "ml-duplicate-detection" not found

🎯 Interview Practice Questions

Practice these follow-up questions to demonstrate deep understanding of ML duplicate detection systems in interviews.

1. Embedding Model Selection Strategy

"Your duplicate detection system needs to handle both short social media posts and long-form articles. How do you design an embedding strategy that captures semantic similarity across different content lengths while maintaining consistent similarity thresholds?"

2. Real-time Duplicate Prevention

"Design a system that prevents duplicate content submission in real-time during content creation. How do you handle partial matches, typing delays, and ensure <200ms response time while maintaining 99.9% accuracy for 10M+ content submissions per day?"

3. Multi-Modal Duplicate Detection

"Extend your system to detect duplicates across text, images, and videos. How do you design unified similarity scoring, handle cross-modal duplicates (same video with different titles), and balance computational costs across different media types?"

4. Threshold Optimization and Feedback Loop

"Your system has 85% precision but reviewers complain about false positives. Design an adaptive threshold system that learns from human feedback, handles concept drift, and automatically adjusts similarity thresholds per content category while maintaining audit logs."

5. Scalable Vector Search Architecture

"Your embedding database grows from 10M to 1B vectors. Design a sharding and indexing strategy for HNSW that maintains sub-second search latency, handles hot-partitioning, and supports incremental index updates without service interruption."

6. Cross-Language and Localization Challenges

"Design duplicate detection for a global platform supporting 50+ languages. How do you handle cross-language duplicates (same content in different languages), cultural context variations, and ensure consistent quality across different language embeddings?"

Design ML System: Duplicate Detection

Problem Statement

What is Near-Duplicate Detection?

1. Clarifying Questions & Requirements

Q: What types of content are we detecting duplicates for?

Q: What constitutes a 'near-duplicate'? How similar is too similar?

Q: What's the expected scale - how many items per dataset and how many datasets?

Q: What's the performance requirement for duplicate detection?

Q: How do we handle the continuous learning aspect?

Q: What are the consequences of false positives vs false negatives?

2. Back-of-the-Envelope Calculations

Scale Estimates

Storage Requirements

Compute Requirements

Latency Breakdown

3. Framing as ML Task

Real-Time Duplicate Detection Architecture

Data Processing Pipeline

High-Level Architecture

Real-Time Serving Layer

Batch Processing Layer

Storage Layer

Feedback & Learning

Core APIs

POST /api/v1/detect-duplicates

POST /api/v1/feedback

4. Deep Dive - Critical Components

Embedding Strategy

Text Embeddings

Image Embeddings

Vector Search & Indexing

Index Architecture

Search Strategy

Continuous Learning System

Threshold Adaptation

Model Fine-tuning

5. ML Problem Framing & Model Design

ML Problem Formulation

Learning Problem Type

Input/Output Specification

ML Pipeline Components

Feature Engineering

Model Selection Rationale

Training Strategy

Data Requirements

Training Approach

6. Model Evaluation & Metrics

Evaluation Metrics

Classification Metrics

Business Metrics

System Metrics

Testing & Validation Strategy

Dataset Partitioning

Evaluation Scenarios

A/B Testing Framework

Experimental Design

Success Criteria

6. Evaluation Design

Model Deployment & Serving

Deployment Strategy

Serving Infrastructure

Monitoring & Observability

Model Performance

System Health

Business Metrics

Continuous Learning & Improvement

Data Collection Loop

Model Retraining Pipeline

Incident Response & Recovery

Common Failure Modes

Recovery Procedures

7. Operations & Production Deployment

Key Design Decisions

Separate Text/Image Pipelines

Two-Stage Search

Dataset-Specific Indices

Trade-off Analysis

Accuracy vs Latency

Storage vs Compute

Model Complexity vs Maintenance