SSTables
Master the sorted string table data structure used in LSM-tree storage engines
25 min read•
Not Started
🗃️ SSTable Storage Calculator
📊 Storage Metrics
Uncompressed Size: 1007.1 MB
Compressed Size: 705.0 MB
Total Blocks: 16,130
Records per Block: 62
Index Entries: 7,813
Index Size: 305.2 KB
Bloom Filter Size: 1167.9 KB
⚡ Performance Metrics
Avg Read Time: 10.0 ms
Write Time: 7049.6 ms
Memory Usage: 1.4 MB
Compaction Cost: 1409.9 MB
Hash Functions: 7
💡 Optimization Tips
What are SSTables?
Sorted String Tables (SSTables) are immutable, sorted files that store key-value pairs. They form the foundation of many modern NoSQL databases and storage engines.
Key Properties
- Immutable: Never modified after creation
- Sorted: Keys are stored in lexicographic order
- Compressed: Uses block-level compression
- Indexed: Sparse index for efficient lookups
- Self-contained: Includes metadata and bloom filters
Advantages
- Excellent sequential write performance
- High compression ratios
- Efficient range queries
- Simple crash recovery
- Predictable performance characteristics
SSTable Internal Structure
📁 File Layout (from bottom to top)
1. Data Blocks
Compressed blocks containing the actual key-value pairs in sorted order
- Each block contains multiple records
- Block size typically 4KB-64KB
- Compression applied per block
- Checksum for integrity verification
2. Meta Blocks
Optional blocks for metadata like bloom filters and statistics
- Bloom filter for existence checks
- Column family metadata
- Statistics and histograms
- Custom application metadata
3. Index Block
Sparse index mapping keys to data block offsets
- One entry per data block
- Contains last key of each block
- Enables binary search for range queries
- Typically fits in memory
4. Footer
Fixed-size footer containing essential metadata
- Offset to index block
- Offset to meta blocks
- File format version
- Magic number for validation
SSTable Read Operations
🔍 Point Lookups
Step 1: Bloom Filter Check
Quick check if key might exist (eliminates ~99% of negative lookups)
Step 2: Index Search
Binary search in sparse index to find relevant data block
Step 3: Block Read
Read and decompress the target data block
Step 4: Block Search
Binary search within block to find exact key
Time Complexity: O(log B + log R/B)
B = blocks, R = records
B = blocks, R = records
📊 Range Queries
Step 1: Find Start Block
Use index to locate first block containing start key
Step 2: Sequential Scan
Read consecutive blocks until end key is reached
Step 3: Filter Results
Apply additional predicates and collect matching records
Optimization:
Block-level filtering using metadata can skip entire blocks
Performance: Excellent for range scans
Minimal seeks, sequential I/O
Minimal seeks, sequential I/O
LSM Trees and Write Path
🌳 Log-Structured Merge Trees
📝 MemTable (L0)
In-memory buffer
Sorted tree (Red-Black/Skip List)
Fast writes
Volatile
→
💾 SSTables (L1+)
Immutable files
Sorted on disk
Compressed
Persistent
📥 Write Process
1. Writes go to WAL (Write-Ahead Log) for durability
2. Data inserted into in-memory MemTable
3. When MemTable is full, flush to L0 SSTable
4. Background compaction merges SSTables across levels
5. Each level has size limits, triggering cascading compactions
Compaction Strategies
🎯 Size-Tiered Compaction
Merge SSTables of similar sizes when threshold is reached
Pros:
- Good for write-heavy workloads
- Lower write amplification
- Simpler implementation
Cons:
- Higher space amplification
- Temporary space spikes during compaction
- Less predictable read performance
Used by: Cassandra (default), ScyllaDB
📊 Leveled Compaction
Maintain fixed levels with non-overlapping key ranges
Pros:
- Better space utilization
- Predictable read performance
- Lower space amplification
Cons:
- Higher write amplification
- More complex implementation
- Can hurt write-heavy workloads
Used by: LevelDB, RocksDB, Cassandra (option)
Bloom Filters in SSTables
🌸 How Bloom Filters Work
Probabilistic Data Structure:
Can definitively say "NOT present" but only "MAYBE present"
Bit Array + Hash Functions:
Multiple hash functions set bits in array for each key
Query Process:
Check if all hash positions are set; if not, key definitely absent
False Positive Rate:
Tunable based on bits per key and hash functions
📈 Performance Impact
Read Optimization:
Eliminates 90-99% of unnecessary disk I/O
Memory Cost:
~10 bits per key for 1% false positive rate
CPU Cost:
Multiple hash computations per lookup
Trade-off: Small memory cost for massive I/O savings
SSTable Implementations
🏛️ Apache Cassandra
SSTable Format:
Custom format optimized for wide rows and time-series data
Compaction:
Size-tiered by default, leveled optional
Features:
- Partition key bloom filters
- Compression per-block
- Tombstone markers for deletes
- TTL support
🪨 RocksDB/LevelDB
SSTable Format:
Highly optimized for key-value workloads
Compaction:
Leveled compaction with non-overlapping levels
Features:
- Prefix bloom filters
- Block-level checksums
- Columnar storage support
- Write-ahead logging
📊 Apache HBase
SSTable Format:
HFile format with strong consistency guarantees
Compaction:
Region-based compaction with automatic splits
Features:
- Bloom filters per column family
- Block cache for hot data
- Data block encoding
- HDFS integration
⚡ ScyllaDB
SSTable Format:
Enhanced format with C++ optimizations
Compaction:
Incremental compaction to reduce latency spikes
Features:
- NUMA-aware processing
- Seastar async framework
- Advanced caching
- Workload conditioning
Performance Characteristics
📝 Write Performance
- • Excellent: Sequential writes only
- • Batching: Amortizes I/O costs
- • Write Amplification: 2-10x due to compaction
- • Latency: Bounded by memtable size
📖 Read Performance
- • Point reads: O(log N) with bloom filters
- • Range scans: Very efficient
- • Read Amplification: Multiple SSTable checks
- • Cache friendly: Predictable access patterns
💾 Space Efficiency
- • Compression: 50-90% space savings
- • Space Amplification: 1.1-2x of raw data
- • Overhead: Indexes, bloom filters, metadata
- • Compaction: Requires temporary space
SSTable Best Practices
✅ Optimization Strategies
- • Tune bloom filter false positive rates
- • Choose appropriate block sizes for workload
- • Monitor compaction metrics and adjust strategy
- • Use compression algorithms suited to data type
- • Size MemTables based on available memory
- • Partition data to avoid hotspots
- • Monitor write amplification
⚠️ Common Pitfalls
- • Ignoring compaction strategy for workload
- • Insufficient disk space for compaction
- • Poor key design causing hotspots
- • Bloom filter false positive rate too high
- • Block size not aligned with read patterns
- • Not monitoring write amplification
- • Forgetting to tune compression settings
📝 SSTable Knowledge Quiz
1 of 5Current: 0/5