SSTables

Master the sorted string table data structure used in LSM-tree storage engines

25 min read
Not Started

🗃️ SSTable Storage Calculator

📊 Storage Metrics

Uncompressed Size: 1007.1 MB
Compressed Size: 705.0 MB
Total Blocks: 16,130
Records per Block: 62
Index Entries: 7,813
Index Size: 305.2 KB
Bloom Filter Size: 1167.9 KB

⚡ Performance Metrics

Avg Read Time: 10.0 ms
Write Time: 7049.6 ms
Memory Usage: 1.4 MB
Compaction Cost: 1409.9 MB
Hash Functions: 7

💡 Optimization Tips

What are SSTables?

Sorted String Tables (SSTables) are immutable, sorted files that store key-value pairs. They form the foundation of many modern NoSQL databases and storage engines.

Key Properties

  • Immutable: Never modified after creation
  • Sorted: Keys are stored in lexicographic order
  • Compressed: Uses block-level compression
  • Indexed: Sparse index for efficient lookups
  • Self-contained: Includes metadata and bloom filters

Advantages

  • Excellent sequential write performance
  • High compression ratios
  • Efficient range queries
  • Simple crash recovery
  • Predictable performance characteristics

SSTable Internal Structure

📁 File Layout (from bottom to top)

1. Data Blocks

Compressed blocks containing the actual key-value pairs in sorted order

  • Each block contains multiple records
  • Block size typically 4KB-64KB
  • Compression applied per block
  • Checksum for integrity verification

2. Meta Blocks

Optional blocks for metadata like bloom filters and statistics

  • Bloom filter for existence checks
  • Column family metadata
  • Statistics and histograms
  • Custom application metadata

3. Index Block

Sparse index mapping keys to data block offsets

  • One entry per data block
  • Contains last key of each block
  • Enables binary search for range queries
  • Typically fits in memory

4. Footer

Fixed-size footer containing essential metadata

  • Offset to index block
  • Offset to meta blocks
  • File format version
  • Magic number for validation

SSTable Read Operations

🔍 Point Lookups

Step 1: Bloom Filter Check
Quick check if key might exist (eliminates ~99% of negative lookups)
Step 2: Index Search
Binary search in sparse index to find relevant data block
Step 3: Block Read
Read and decompress the target data block
Step 4: Block Search
Binary search within block to find exact key
Time Complexity: O(log B + log R/B)
B = blocks, R = records

📊 Range Queries

Step 1: Find Start Block
Use index to locate first block containing start key
Step 2: Sequential Scan
Read consecutive blocks until end key is reached
Step 3: Filter Results
Apply additional predicates and collect matching records
Optimization:
Block-level filtering using metadata can skip entire blocks
Performance: Excellent for range scans
Minimal seeks, sequential I/O

LSM Trees and Write Path

🌳 Log-Structured Merge Trees

📝 MemTable (L0)

In-memory buffer
Sorted tree (Red-Black/Skip List)
Fast writes
Volatile

💾 SSTables (L1+)

Immutable files
Sorted on disk
Compressed
Persistent

📥 Write Process

1. Writes go to WAL (Write-Ahead Log) for durability
2. Data inserted into in-memory MemTable
3. When MemTable is full, flush to L0 SSTable
4. Background compaction merges SSTables across levels
5. Each level has size limits, triggering cascading compactions

Compaction Strategies

🎯 Size-Tiered Compaction

Merge SSTables of similar sizes when threshold is reached
Pros:
  • Good for write-heavy workloads
  • Lower write amplification
  • Simpler implementation
Cons:
  • Higher space amplification
  • Temporary space spikes during compaction
  • Less predictable read performance
Used by: Cassandra (default), ScyllaDB

📊 Leveled Compaction

Maintain fixed levels with non-overlapping key ranges
Pros:
  • Better space utilization
  • Predictable read performance
  • Lower space amplification
Cons:
  • Higher write amplification
  • More complex implementation
  • Can hurt write-heavy workloads
Used by: LevelDB, RocksDB, Cassandra (option)

Bloom Filters in SSTables

🌸 How Bloom Filters Work

Probabilistic Data Structure:
Can definitively say "NOT present" but only "MAYBE present"
Bit Array + Hash Functions:
Multiple hash functions set bits in array for each key
Query Process:
Check if all hash positions are set; if not, key definitely absent
False Positive Rate:
Tunable based on bits per key and hash functions

📈 Performance Impact

Read Optimization:
Eliminates 90-99% of unnecessary disk I/O
Memory Cost:
~10 bits per key for 1% false positive rate
CPU Cost:
Multiple hash computations per lookup
Trade-off: Small memory cost for massive I/O savings

SSTable Implementations

🏛️ Apache Cassandra

SSTable Format:
Custom format optimized for wide rows and time-series data
Compaction:
Size-tiered by default, leveled optional
Features:
  • Partition key bloom filters
  • Compression per-block
  • Tombstone markers for deletes
  • TTL support

🪨 RocksDB/LevelDB

SSTable Format:
Highly optimized for key-value workloads
Compaction:
Leveled compaction with non-overlapping levels
Features:
  • Prefix bloom filters
  • Block-level checksums
  • Columnar storage support
  • Write-ahead logging

📊 Apache HBase

SSTable Format:
HFile format with strong consistency guarantees
Compaction:
Region-based compaction with automatic splits
Features:
  • Bloom filters per column family
  • Block cache for hot data
  • Data block encoding
  • HDFS integration

⚡ ScyllaDB

SSTable Format:
Enhanced format with C++ optimizations
Compaction:
Incremental compaction to reduce latency spikes
Features:
  • NUMA-aware processing
  • Seastar async framework
  • Advanced caching
  • Workload conditioning

Performance Characteristics

📝 Write Performance

  • Excellent: Sequential writes only
  • Batching: Amortizes I/O costs
  • Write Amplification: 2-10x due to compaction
  • Latency: Bounded by memtable size

📖 Read Performance

  • Point reads: O(log N) with bloom filters
  • Range scans: Very efficient
  • Read Amplification: Multiple SSTable checks
  • Cache friendly: Predictable access patterns

💾 Space Efficiency

  • Compression: 50-90% space savings
  • Space Amplification: 1.1-2x of raw data
  • Overhead: Indexes, bloom filters, metadata
  • Compaction: Requires temporary space

SSTable Best Practices

✅ Optimization Strategies

  • • Tune bloom filter false positive rates
  • • Choose appropriate block sizes for workload
  • • Monitor compaction metrics and adjust strategy
  • • Use compression algorithms suited to data type
  • • Size MemTables based on available memory
  • • Partition data to avoid hotspots
  • • Monitor write amplification

⚠️ Common Pitfalls

  • • Ignoring compaction strategy for workload
  • • Insufficient disk space for compaction
  • • Poor key design causing hotspots
  • • Bloom filter false positive rate too high
  • • Block size not aligned with read patterns
  • • Not monitoring write amplification
  • • Forgetting to tune compression settings

📝 SSTable Knowledge Quiz

1 of 5Current: 0/5

What does SSTable stand for?