System Designer

Data pipelines are the foundation of ML systems. They're responsible for extracting, transforming, and delivering high-quality features to models. Poor pipeline design is the #1 cause of ML system failures in production.

Unlike traditional ETL, ML pipelines must handle feature engineering, data validation, versioning, and the unique challenge of maintaining training/serving consistency across batch and real-time systems.

⚡ Quick Decision

Choose Batch When:

• Daily/hourly predictions acceptable
• Large historical datasets
• Cost optimization priority

Choose Streaming When:

• <100ms inference latency needed
• Real-time decision making
• Event-driven business logic

Choose Hybrid When:

• Complex feature engineering
• Multiple latency requirements
• Different model types

ML Data Pipeline Patterns

📦

Batch Processing Pipeline

Process large volumes of data in scheduled chunks

Use Case: Training data preparation, daily model scoring, feature engineering

Key Components

🗄️

Data Sources

Databases, files, APIs

⏰

Scheduler

Airflow, Cron, Kubernetes Jobs

⚙️

Processing Engine

Spark, Dask, BigQuery

🏪

Feature Store

Store processed features

🎯

Model Training

Scheduled retraining

✅ Advantages

• High throughput

• Cost effective

• Simple to debug

• Good for historical analysis

⚠️ Challenges

• High latency

• Not real-time

• Resource scheduling complexity

Common Architecture Patterns

1. Lambda Architecture (batch + speed layer)

2. Data Lake → Feature Store → Model

3. ETL → Training → Batch Inference

Essential Data Quality Checks

Schema ValidationCritical

Ensure data types, columns, and structure match expectations

Example: Detect when a string field suddenly contains numbers

Tools: Great Expectations, Deequ, custom validators

Distribution DriftCritical

Statistical tests to detect changes in data distribution

Example: User age distribution shifts from 20-40 to 40-60

Tools: Evidently AI, MLflow, custom KS tests

Data FreshnessHigh

Verify data is recent enough for model performance

Example: Feature pipeline stops, model uses week-old features

Tools: Airflow SLAs, custom timestamp checks

Missing ValuesHigh

Track percentage of null/missing values over time

Example: Integration breaks, 50% of features become null

Tools: Pandas profiling, custom null rate monitoring

Value RangesMedium

Detect outliers and impossible values

Example: Age = -5, price = $999,999,999

Tools: Statistical outlier detection, domain rules

Feature CorrelationMedium

Monitor correlations between features changing

Example: Previously correlated features become independent

Tools: Custom correlation monitoring, drift detection

🏗️ Data Architecture Patterns

Medallion Architecture

Bronze (raw) → Silver (cleaned) → Gold (business-ready)

Best for: Data lakes with multiple consumers

Pros:

• Clear data lineage
• Incremental quality improvement
• Multiple abstraction levels

Cons:

• Storage overhead
• Complexity for simple use cases

Event Sourcing

Store events, not state. Replay events to build features

Best for: Complex business logic, audit requirements

Pros:

• Complete history
• Replayable
• Time travel debugging

Cons:

• Storage growth
• Complexity
• Eventual consistency

CQRS (Command Query Responsibility Segregation)

Separate read/write models optimized for their use case

Best for: High-scale systems with different read/write patterns

Pros:

• Optimized for use case
• Independent scaling
• Better performance

Cons:

• Complexity
• Data synchronization
• Learning curve

🎯 Pipeline Design Best Practices

✅ Do This

Version everything: Data schemas, transformations, and feature definitions

Idempotent operations: Rerunning pipeline produces same results

Lineage tracking: Know exactly where each feature comes from

Graceful degradation: System works with partial data

❌ Avoid This

Hard-coded values: Use configuration for thresholds and parameters

Silent failures: Every step should have monitoring and alerts

Tight coupling: Changes to one part shouldn't break everything

No backfill strategy: Plan for historical data reprocessing

No quiz questions available

Quiz ID "data-pipeline-design" not found

Data Pipeline Architecture for ML

⚡ Quick Decision

Choose Batch When:

Choose Streaming When:

Choose Hybrid When:

ML Data Pipeline Patterns

Batch Processing Pipeline

Key Components

✅ Advantages

⚠️ Challenges

Common Architecture Patterns

Essential Data Quality Checks

🏗️ Data Architecture Patterns

Medallion Architecture

Event Sourcing

CQRS (Command Query Responsibility Segregation)

🎯 Pipeline Design Best Practices

✅ Do This

❌ Avoid This