System Designer

What is Apache Kafka?

Apache Kafka is a distributed event streaming platform originally developed by LinkedIn, now maintained by the Apache Software Foundation. It's designed to handle high-throughput, real-time data feeds with fault tolerance and horizontal scalability.

2M+

Messages/sec capacity

<10ms

End-to-end latency

99.9%

Availability

Data scale

Key Features & Use Cases

Core Capabilities

Distributed by Design

Horizontal scaling across multiple brokers

Fault Tolerant

Automatic failover and data replication

Persistent Storage

Configurable retention and replay capability

High Throughput

Millions of messages per second

Common Use Cases

Real-time data pipelines

Event-driven microservices

Stream processing and analytics

Log aggregation and monitoring

Change data capture (CDC)

Core Concepts

Topic

A category or feed name to which events are written

Key Features

Ordered within partitions
Immutable log
Configurable retention
Multi-producer/consumer

Example

user-events, order-updates, payment-notifications

🧮 Kafka Cluster Calculator

Cluster Configuration

Brokers: 3

Partitions: 6

Replication Factor: 2

Retention Hours: 168

Performance Metrics

Throughput54,000 msg/sec

Storage/Day8,899 GB

Fault Tolerance1 failures

Can survive 1 broker failure(s)

Parallelism6x

Maximum consumer parallelism

Event Streaming Patterns

Event Sourcing

Store all changes as a sequence of events instead of current state

Implementation

topic: account-events, events: AccountCreated, MoneyDeposited, MoneyWithdrawn

✅ Advantages

Complete audit trail
Temporal queries
System replay capability
Natural fit for Kafka

⚠️ Challenges

Storage overhead
Query complexity
Snapshot overhead
Learning curve

Performance & Best Practices

Producer Optimization

Batching

Group messages together to improve throughput

batch.size=16384, linger.ms=5

Compression

Reduce network bandwidth usage

compression.type=snappy

Partitioning

Distribute load evenly across partitions

partitioner.class=hash(key)

Consumer Optimization

Fetch Size

Balance between latency and throughput

fetch.min.bytes=1, max.poll.records=500

Parallel Processing

Scale consumers to match partitions

consumers = partitions

Offset Management

Handle failures gracefully

enable.auto.commit=false

🏢 Real-world Implementations

LinkedIn: Activity Streams

• 1+ trillion messages per day

• 2000+ Kafka clusters

• User activity, feed updates, recommendations

• 7 million messages/second peak

Pattern: Event sourcing for user activity, CQRS for feed generation

Netflix: Stream Processing

• 700+ billion events per day

• Real-time recommendations

• A/B testing data pipeline

• 8+ petabytes of data daily

Pattern: Stream processing for real-time analytics and recommendations

Uber: Real-time Updates

• Ride tracking and ETAs

• Driver location updates

• Surge pricing calculations

• 100+ microservices coordination

Pattern: Event-driven microservices with real-time location streaming

Shopify: E-commerce Events

• Order processing pipeline

• Inventory updates

• Payment processing events

• 1+ million merchants supported

Pattern: Saga pattern for distributed transactions, CDC for data sync

💡 Key Takeaways

• Start Simple: Begin with basic pub/sub, evolve to complex patterns as needed
• Plan Partitions: More partitions = more parallelism, but also more complexity
• Monitor Everything: Lag, throughput, error rates are critical metrics
• Schema Evolution: Plan for backwards-compatible message formats
• Operational Excellence: Kafka requires dedicated expertise to run at scale

No quiz questions available

Quiz ID "kafka" not found