Design a Chat System (WhatsApp/Slack)
Build a real-time messaging system supporting millions of users with low latency, message ordering, and reliable delivery across web and mobile platforms.
System Requirements
Functional Requirements
- Send and receive messages in real time
- 1:1 chats and group chats (up to 256 members)
- Online presence and typing indicators
- Message history and search
- Media sharing (photos, videos, files)
- Push notifications when offline
- Message delivery receipts (sent, delivered, read)
Non-Functional Requirements
- 100M daily active users
- 1B messages sent per day
- End-to-end delivery latency < 100ms
- 99.99% availability
- Message ordering within conversations
- Eventual consistency acceptable
- Scale to support 10M concurrent connections
Capacity Estimation
Traffic Patterns
Core Metrics
Infrastructure Requirements
Message Delivery Strategies
Push Model (WebSockets)
Server pushes messages to clients via persistent connections
Pros
- • Real-time delivery
- • Low latency
- • Bidirectional communication
Cons
- • Connection management complexity
- • Resource intensive
- • Firewall issues
Best For
Pull Model (HTTP Polling)
Clients periodically poll server for new messages
Pros
- • Simple implementation
- • Firewall friendly
- • Stateless servers
Cons
- • Higher latency
- • Wasted requests
- • Battery drain on mobile
Best For
Hybrid Model
WebSockets for active users, push notifications for offline
Pros
- • Best of both worlds
- • Optimized battery usage
- • Reliable delivery
Cons
- • Implementation complexity
- • Multiple delivery paths
Best For
System Architecture
WebSocket Gateway
Node.js, Socket.io, Redis pub/subMessage Router
Apache Kafka, partitioned by user_idChat Service
Java/Go, stateless, auto-scalingPresence Service
Redis, Bloom filters, heartbeat systemScaling Challenges & Solutions
Connection Management
Message Ordering
Group Message Fanout
Presence at Scale
Database Design
Message Schema
messages table:
- message_id (UUID, PK)
- conversation_id (UUID, indexed)
- sender_id (UUID, indexed)
- content (text/media_url)
- message_type (text/image/file)
- created_at (timestamp)
- sequence_number (bigint)
- status (sent/delivered/read)
Conversation Schema
conversations table:
- conversation_id (UUID, PK)
- type (direct/group)
- created_at (timestamp)
- updated_at (timestamp)
- last_message_id (UUID)
participants table:
- conversation_id (UUID)
- user_id (UUID)
- joined_at (timestamp)
- last_read_message_id (UUID)
Partitioning Strategy
By Conversation ID
By Time Window
Replication
Practice Questions
How would you handle message ordering in a distributed system? What happens when servers have different clocks?
Design a presence system that can handle 100M users. How do you avoid broadcasting every status change?
How would you implement end-to-end encryption while maintaining search functionality?
Design group chat for 10,000 members. How do you handle message fanout without overwhelming the system?
How would you implement message search across billions of messages? Design the indexing strategy.