Design a Distributed Cache System

Build a high-performance, fault-tolerant distributed cache that can handle millions of operations per second with consistent hashing and automatic failover.

System Requirements

Functional Requirements

  • In-memory key-value storage with TTL support
  • GET, SET, DELETE, INCREMENT operations
  • Support for data structures (strings, lists, sets, hashes)
  • Atomic operations and transactions
  • Pub/sub messaging capabilities
  • Lua scripting support
  • Backup and restore functionality
  • Multi-tenant access with authentication

Non-Functional Requirements

  • Sub-millisecond read/write latency
  • 1M+ operations per second per node
  • 99.9% availability with automatic failover
  • Horizontal scaling to 1000+ nodes
  • Memory optimization with configurable eviction
  • Cross-region replication support
  • Zero-downtime node addition/removal
  • Consistent performance under high load

Data Partitioning Strategies

Consistent Hashing

Ring-based hashing with virtual nodes for balanced distribution

O(log N) lookup
Large clusters with frequent scaling
Pros:
  • Minimal reshuffling on scale
  • Hot spot mitigation
  • Fault tolerance
Cons:
  • Complex implementation
  • Potential load imbalance
  • Metadata overhead

Hash Slots (Redis Cluster)

16384 fixed slots distributed across nodes

O(1) lookup
Redis-compatible distributed caching
Pros:
  • Simple slot assignment
  • Easy rebalancing
  • Predictable behavior
Cons:
  • Fixed slot count
  • Manual rebalancing
  • Cross-slot operation limits

Range Partitioning

Partition data by key ranges across nodes

O(log N) lookup
Ordered data with range queries
Pros:
  • Simple range queries
  • Sequential access
  • Easy implementation
Cons:
  • Hot spotting risk
  • Uneven distribution
  • Complex rebalancing

Replication & Consistency Models

Master-Slave Async

Consistency:
Eventually Consistent
Write Latency:
< 1ms writes
Availability:
99.9%
Durability:
At-risk during failures
Best For:
High-performance caching

Master-Slave Sync

Consistency:
Strong Consistency
Write Latency:
2-5ms writes
Availability:
99.5%
Durability:
Guaranteed
Best For:
Critical data storage

Multi-Master

Consistency:
Eventual w/ Conflicts
Write Latency:
< 1ms writes
Availability:
99.99%
Durability:
High with complexity
Best For:
Global distribution

System Architecture Components

Client Layer

  • • Smart client routing
  • • Connection pooling
  • • Automatic failover
  • • Load balancing
  • • Circuit breaker pattern

Cluster Manager

  • • Node discovery
  • • Health monitoring
  • • Cluster state management
  • • Automatic failover
  • • Rebalancing coordination

Cache Nodes

  • • In-memory storage
  • • Eviction policies
  • • Persistence options
  • • Replication handling
  • • Command processing

Monitoring

  • • Performance metrics
  • • Memory usage tracking
  • • Hit/miss ratios
  • • Latency monitoring
  • • Alert management

Persistence Layer

  • • Snapshot creation
  • • Write-ahead logging
  • • Backup scheduling
  • • Point-in-time recovery
  • • Cross-region replication

Security

  • • Authentication (ACL)
  • • TLS encryption
  • • Network security
  • • Audit logging
  • • Access control

Capacity Estimation

Cache Performance & Distribution

Operation Types
80%Reads
20%Writes
Cache Hit Ratio
85%Hits
15%Misses
Data Distribution
20%Hot Data
80%Cold Data

Performance Metrics

Operations/Second
Per node capacity
1M+
Read Latency P99
In-memory access
< 1ms
Write Latency P99
With replication
< 2ms
Memory Efficiency
After compression
85%
Availability
With auto-failover
99.9%

Infrastructure Requirements

Memory per Node
64GB RAM + 20% overhead
Network
10Gbps for replication
Cluster Size
50-1000 nodes typical

Practice Questions

1

How would you handle node failures and implement automatic failover without data loss in a distributed cache?

2

Design a consistent hashing algorithm that minimizes data movement during cluster rebalancing operations.

3

How would you prevent and mitigate hot keys that could overwhelm individual cache nodes?

4

Compare write-through, write-behind, and write-around caching strategies. When would you use each?

5

Design an eviction policy system that balances LRU, LFU, and TTL-based eviction for optimal memory usage.