What is an LSM-Tree?
The Log-Structured Merge Tree (LSM-Tree) is a data structure designed for high write throughput. Instead of updating data in place (like B-Trees), LSM-Trees batch writes in memory and periodically flush them to disk as sorted files.
This design makes writes 10-100x faster than traditional B-Tree databases while maintaining good read performance.
Who Uses LSM-Trees?
- LevelDB (Google)
- RocksDB (Facebook)
- Cassandra (Apache)
- HBase (Apache)
- NexaDB (Yes, we use it too!)
- SQLite4 (proposed)
How LSM-Trees Work
The Write Path
Write Request
↓
[1] Write-Ahead Log (WAL) - durability
↓
[2] MemTable (in-memory sorted structure)
↓ (when full)
[3] Flush to SSTable (Sorted String Table)
↓
[4] Background Compaction (merge SSTables)
Let's break down each component:
1. Write-Ahead Log (WAL)
Before any write, the operation is logged to disk:
# NexaDB's WAL implementation (simplified)
def write(self, key, value):
# First, log to WAL for durability
self.wal.append(serialize(key, value))
self.wal.fsync() # Ensure it's on disk
# Then, write to MemTable
self.memtable.put(key, value)
If the system crashes, the WAL is replayed on startup to recover data.
2. MemTable
An in-memory sorted data structure (usually a skip list or red-black tree):
class MemTable:
def __init__(self, max_size=4*1024*1024): # 4MB default
self.data = SortedDict() # O(log n) inserts
self.size = 0
self.max_size = max_size
def put(self, key, value):
self.data[key] = value
self.size += len(key) + len(value)
if self.size >= self.max_size:
self.flush()
NexaDB Enhancement: We use a dual MemTable architecture. While one MemTable is being flushed, writes continue to a second MemTable, achieving 2x write throughput.
3. SSTables (Sorted String Tables)
When the MemTable is full, it's flushed to disk as an immutable SSTable:
SSTable Format:
┌─────────────────────────────────────────────┐
│ Data Block 1: [key1:value1, key2:value2...] │
├─────────────────────────────────────────────┤
│ Data Block 2: [key3:value3, key4:value4...] │
├─────────────────────────────────────────────┤
│ ... │
├─────────────────────────────────────────────┤
│ Index Block: [key1→offset, key3→offset...] │
├─────────────────────────────────────────────┤
│ Bloom Filter (for fast negative lookups) │
├─────────────────────────────────────────────┤
│ Footer: metadata, offsets │
└─────────────────────────────────────────────┘
4. Compaction
Over time, SSTables accumulate. Compaction merges them to:
- Remove deleted/overwritten keys
- Reduce read amplification
- Reclaim disk space
Level 0: [SST-1] [SST-2] [SST-3] (may overlap)
↓ compaction
Level 1: [SST-A──────────] [SST-B──────────] (no overlap)
↓ compaction
Level 2: [SST-X────────────────────────────] (larger files)
The Read Path
Reading from an LSM-Tree checks multiple locations:
def get(self, key):
# 1. Check MemTable (newest data)
if key in self.memtable:
return self.memtable[key]
# 2. Check Immutable MemTable (being flushed)
if key in self.immutable_memtable:
return self.immutable_memtable[key]
# 3. Check SSTables (newest to oldest)
for sstable in reversed(self.sstables):
# Use Bloom filter for fast negative check
if not sstable.bloom_filter.might_contain(key):
continue # Definitely not here
if value := sstable.get(key):
return value
return None # Key not found
Bloom Filters: The Secret Weapon
Bloom filters prevent unnecessary disk reads:
class BloomFilter:
def __init__(self, expected_items, false_positive_rate=0.01):
self.bits = calculate_optimal_size(expected_items, false_positive_rate)
self.hash_count = calculate_optimal_hashes(...)
def add(self, key):
for i in range(self.hash_count):
index = hash(key, seed=i) % len(self.bits)
self.bits[index] = 1
def might_contain(self, key):
for i in range(self.hash_count):
index = hash(key, seed=i) % len(self.bits)
if self.bits[index] == 0:
return False # Definitely not present
return True # Might be present (check SSTable)
With a 1% false positive rate, Bloom filters eliminate 99% of unnecessary disk reads.
LSM-Tree vs B-Tree
| Aspect | LSM-Tree | B-Tree |
|---|---|---|
| Write Speed | Very Fast (sequential) | Slower (random I/O) |
| Read Speed | Good (with optimizations) | Excellent |
| Write Amplification | Higher (compaction) | Lower |
| Space Amplification | Higher (multiple copies) | Lower |
| Best For | Write-heavy workloads | Read-heavy workloads |
Optimizations in NexaDB
1. Dual MemTable
class DualMemTable:
def __init__(self):
self.active = MemTable()
self.immutable = None
def put(self, key, value):
self.active.put(key, value)
if self.active.is_full():
# Swap atomically
self.immutable = self.active
self.active = MemTable()
# Flush in background
threading.Thread(target=self.flush_immutable).start()
2. LRU Block Cache
class BlockCache:
def __init__(self, capacity=10000):
self.cache = LRUCache(capacity)
def get_block(self, sstable_id, block_offset):
key = (sstable_id, block_offset)
if block := self.cache.get(key):
return block
block = self.load_from_disk(sstable_id, block_offset)
self.cache.put(key, block)
return block
3. Leveled Compaction
def compact_level(self, level):
# Pick SSTable with most overlap
source = self.pick_sstable(level)
# Find overlapping SSTables in next level
overlapping = self.find_overlapping(source, level + 1)
# Merge and write new SSTables
new_sstables = self.merge(source, overlapping)
# Atomically update manifest
self.manifest.update(
remove=[source] + overlapping,
add=new_sstables
)
Real-World Performance
NexaDB's LSM-Tree achieves:
- 89,000 writes/sec (direct API)
- 25,500 ops/sec (at 1M scale via binary protocol)
- 0.039ms average write latency
- 10ms maximum data loss window (WAL fsync interval)
Implementing Your Own LSM-Tree
Here's a minimal implementation:
class SimpleLSMTree:
def __init__(self, data_dir):
self.data_dir = data_dir
self.wal = WriteAheadLog(f"{data_dir}/wal")
self.memtable = SortedDict()
self.sstables = []
self.memtable_size = 0
self.max_memtable_size = 4 * 1024 * 1024 # 4MB
def put(self, key, value):
# WAL first
self.wal.append(key, value)
# Then MemTable
self.memtable[key] = value
self.memtable_size += len(key) + len(value)
if self.memtable_size >= self.max_memtable_size:
self._flush()
def get(self, key):
# Check MemTable
if key in self.memtable:
return self.memtable[key]
# Check SSTables (newest first)
for sstable in reversed(self.sstables):
if value := sstable.get(key):
return value
return None
def _flush(self):
# Create new SSTable
sstable = SSTable.create(
f"{self.data_dir}/sst_{time.time()}.db",
self.memtable
)
self.sstables.append(sstable)
# Clear MemTable
self.memtable = SortedDict()
self.memtable_size = 0
self.wal.clear()
Conclusion
LSM-Trees power some of the world's most demanding databases. By understanding their design, you can:
- Choose the right database for your workload
- Tune performance parameters effectively
- Debug issues with compaction and write amplification
NexaDB builds on LSM-Tree fundamentals with modern enhancements like dual MemTables, HNSW vector indexes, and the TOON format for AI applications.
Want to see LSM-Trees in action? Install NexaDB and explore the architecture yourself.