Design a Chat System (WhatsApp/Slack)

Build a real-time messaging system supporting millions of users with low latency, message ordering, and reliable delivery across web and mobile platforms.

System Requirements

Functional Requirements

  • Send and receive messages in real time
  • 1:1 chats and group chats (up to 256 members)
  • Online presence and typing indicators
  • Message history and search
  • Media sharing (photos, videos, files)
  • Push notifications when offline
  • Message delivery receipts (sent, delivered, read)

Non-Functional Requirements

  • 100M daily active users
  • 1B messages sent per day
  • End-to-end delivery latency < 100ms
  • 99.99% availability
  • Message ordering within conversations
  • Eventual consistency acceptable
  • Scale to support 10M concurrent connections

Capacity Estimation

Traffic Patterns

Peak vs Average
30% loadAverage
100% loadPeak
Message Types
80% splitText
20% splitMedia
User Activity
70% usersLurkers
30% usersActive

Core Metrics

Daily Active Users
Peak: 50M concurrent
100M
Messages per Day
Average: 10 per user
1B
Message Size
Text messages average
100 bytes
Storage Growth
Including media and metadata
36TB/year

Infrastructure Requirements

WebSocket Servers
5K servers for 50M concurrent connections
Message Throughput
Peak: 50K messages/second
Database Load
Write-heavy: 80% writes, 20% reads

Message Delivery Strategies

1

Push Model (WebSockets)

Server pushes messages to clients via persistent connections

Pros

  • Real-time delivery
  • Low latency
  • Bidirectional communication

Cons

  • Connection management complexity
  • Resource intensive
  • Firewall issues

Best For

Online users with active sessions
2

Pull Model (HTTP Polling)

Clients periodically poll server for new messages

Pros

  • Simple implementation
  • Firewall friendly
  • Stateless servers

Cons

  • Higher latency
  • Wasted requests
  • Battery drain on mobile

Best For

Fallback for WebSocket failures
3

Hybrid Model

WebSockets for active users, push notifications for offline

Pros

  • Best of both worlds
  • Optimized battery usage
  • Reliable delivery

Cons

  • Implementation complexity
  • Multiple delivery paths

Best For

Production chat systems (WhatsApp, Slack)

System Architecture

Mobile/Web Clients ↔ CDN ↔ Load Balancer
WebSocket Gateway Cluster (Connection Management)
↓ ↑ (Redis Pub/Sub)
Message Router (Kafka Partitioned by user_id)
Chat Service Cluster ↔ Presence Service
Database Cluster (Cassandra) ↔ Cache (Redis)
Push Notification Service → APNs/FCM

WebSocket Gateway

Node.js, Socket.io, Redis pub/sub
Purpose:
Maintain persistent connections with clients
Scale:
10K connections per server

Message Router

Apache Kafka, partitioned by user_id
Purpose:
Route messages between users and services
Scale:
1M messages/second throughput

Chat Service

Java/Go, stateless, auto-scaling
Purpose:
Business logic for message processing
Scale:
Horizontally scalable microservices

Presence Service

Redis, Bloom filters, heartbeat system
Purpose:
Track user online status and typing indicators
Scale:
100M user status updates/day

Scaling Challenges & Solutions

1

Connection Management

Problem: 10M+ concurrent WebSocket connections across servers
Solution: Connection pooling with sticky sessions and load balancing
Implementation: HAProxy with consistent hashing, connection draining
2

Message Ordering

Problem: Ensure messages appear in sent order within conversations
Solution: Conversation-level sequence numbers with vector clocks
Implementation: Monotonic counters per chat, conflict resolution
3

Group Message Fanout

Problem: Efficiently deliver messages to large group members
Solution: Tree-based fanout with intelligent batching
Implementation: Message queue fanout, async delivery with backpressure
4

Presence at Scale

Problem: Track online status for 100M users in real-time
Solution: Distributed presence service with bloom filters
Implementation: Redis clusters, periodic heartbeats, smart aggregation

Database Design

Message Schema

messages table: - message_id (UUID, PK) - conversation_id (UUID, indexed) - sender_id (UUID, indexed) - content (text/media_url) - message_type (text/image/file) - created_at (timestamp) - sequence_number (bigint) - status (sent/delivered/read)

Conversation Schema

conversations table: - conversation_id (UUID, PK) - type (direct/group) - created_at (timestamp) - updated_at (timestamp) - last_message_id (UUID) participants table: - conversation_id (UUID) - user_id (UUID) - joined_at (timestamp) - last_read_message_id (UUID)

Partitioning Strategy

By Conversation ID

• Keep all messages together
• Efficient range queries
• Natural message ordering

By Time Window

• Archive old messages
• Hot/cold data separation
• Cost optimization

Replication

• RF=3 for availability
• Cross-datacenter async
• Eventual consistency

Practice Questions

1

How would you handle message ordering in a distributed system? What happens when servers have different clocks?

2

Design a presence system that can handle 100M users. How do you avoid broadcasting every status change?

3

How would you implement end-to-end encryption while maintaining search functionality?

4

Design group chat for 10,000 members. How do you handle message fanout without overwhelming the system?

5

How would you implement message search across billions of messages? Design the indexing strategy.