What is an LSM-Tree?

The Log-Structured Merge Tree (LSM-Tree) is a data structure designed for high write throughput. Instead of updating data in place (like B-Trees), LSM-Trees batch writes in memory and periodically flush them to disk as sorted files.

This design makes writes 10-100x faster than traditional B-Tree databases while maintaining good read performance.

Who Uses LSM-Trees?

LevelDB (Google)
RocksDB (Facebook)
Cassandra (Apache)
HBase (Apache)
NexaDB (Yes, we use it too!)
SQLite4 (proposed)

How LSM-Trees Work

The Write Path

Write Request
     ↓
[1] Write-Ahead Log (WAL) - durability
     ↓
[2] MemTable (in-memory sorted structure)
     ↓  (when full)
[3] Flush to SSTable (Sorted String Table)
     ↓
[4] Background Compaction (merge SSTables)

Let's break down each component:

1. Write-Ahead Log (WAL)

Before any write, the operation is logged to disk:

# NexaDB's WAL implementation (simplified)
def write(self, key, value):
    # First, log to WAL for durability
    self.wal.append(serialize(key, value))
    self.wal.fsync()  # Ensure it's on disk

    # Then, write to MemTable
    self.memtable.put(key, value)

If the system crashes, the WAL is replayed on startup to recover data.

2. MemTable

An in-memory sorted data structure (usually a skip list or red-black tree):

class MemTable:
    def __init__(self, max_size=4*1024*1024):  # 4MB default
        self.data = SortedDict()  # O(log n) inserts
        self.size = 0
        self.max_size = max_size

    def put(self, key, value):
        self.data[key] = value
        self.size += len(key) + len(value)

        if self.size >= self.max_size:
            self.flush()

NexaDB Enhancement: We use a dual MemTable architecture. While one MemTable is being flushed, writes continue to a second MemTable, achieving 2x write throughput.

3. SSTables (Sorted String Tables)

When the MemTable is full, it's flushed to disk as an immutable SSTable:

SSTable Format:
┌─────────────────────────────────────────────┐
│ Data Block 1: [key1:value1, key2:value2...] │
├─────────────────────────────────────────────┤
│ Data Block 2: [key3:value3, key4:value4...] │
├─────────────────────────────────────────────┤
│ ...                                         │
├─────────────────────────────────────────────┤
│ Index Block: [key1→offset, key3→offset...]  │
├─────────────────────────────────────────────┤
│ Bloom Filter (for fast negative lookups)    │
├─────────────────────────────────────────────┤
│ Footer: metadata, offsets                   │
└─────────────────────────────────────────────┘

4. Compaction

Over time, SSTables accumulate. Compaction merges them to:

Remove deleted/overwritten keys
Reduce read amplification
Reclaim disk space

Level 0: [SST-1] [SST-2] [SST-3] (may overlap)
              ↓ compaction
Level 1: [SST-A──────────] [SST-B──────────] (no overlap)
              ↓ compaction
Level 2: [SST-X────────────────────────────] (larger files)

The Read Path

Reading from an LSM-Tree checks multiple locations:

def get(self, key):
    # 1. Check MemTable (newest data)
    if key in self.memtable:
        return self.memtable[key]

    # 2. Check Immutable MemTable (being flushed)
    if key in self.immutable_memtable:
        return self.immutable_memtable[key]

    # 3. Check SSTables (newest to oldest)
    for sstable in reversed(self.sstables):
        # Use Bloom filter for fast negative check
        if not sstable.bloom_filter.might_contain(key):
            continue  # Definitely not here

        if value := sstable.get(key):
            return value

    return None  # Key not found

Bloom Filters: The Secret Weapon

Bloom filters prevent unnecessary disk reads:

class BloomFilter:
    def __init__(self, expected_items, false_positive_rate=0.01):
        self.bits = calculate_optimal_size(expected_items, false_positive_rate)
        self.hash_count = calculate_optimal_hashes(...)

    def add(self, key):
        for i in range(self.hash_count):
            index = hash(key, seed=i) % len(self.bits)
            self.bits[index] = 1

    def might_contain(self, key):
        for i in range(self.hash_count):
            index = hash(key, seed=i) % len(self.bits)
            if self.bits[index] == 0:
                return False  # Definitely not present
        return True  # Might be present (check SSTable)

With a 1% false positive rate, Bloom filters eliminate 99% of unnecessary disk reads.

LSM-Tree vs B-Tree

Aspect	LSM-Tree	B-Tree
Write Speed	Very Fast (sequential)	Slower (random I/O)
Read Speed	Good (with optimizations)	Excellent
Write Amplification	Higher (compaction)	Lower
Space Amplification	Higher (multiple copies)	Lower
Best For	Write-heavy workloads	Read-heavy workloads

Optimizations in NexaDB

1. Dual MemTable

class DualMemTable:
    def __init__(self):
        self.active = MemTable()
        self.immutable = None

    def put(self, key, value):
        self.active.put(key, value)

        if self.active.is_full():
            # Swap atomically
            self.immutable = self.active
            self.active = MemTable()

            # Flush in background
            threading.Thread(target=self.flush_immutable).start()

2. LRU Block Cache

class BlockCache:
    def __init__(self, capacity=10000):
        self.cache = LRUCache(capacity)

    def get_block(self, sstable_id, block_offset):
        key = (sstable_id, block_offset)
        if block := self.cache.get(key):
            return block

        block = self.load_from_disk(sstable_id, block_offset)
        self.cache.put(key, block)
        return block

3. Leveled Compaction

def compact_level(self, level):
    # Pick SSTable with most overlap
    source = self.pick_sstable(level)

    # Find overlapping SSTables in next level
    overlapping = self.find_overlapping(source, level + 1)

    # Merge and write new SSTables
    new_sstables = self.merge(source, overlapping)

    # Atomically update manifest
    self.manifest.update(
        remove=[source] + overlapping,
        add=new_sstables
    )

Real-World Performance

NexaDB's LSM-Tree achieves:

89,000 writes/sec (direct API)
25,500 ops/sec (at 1M scale via binary protocol)
0.039ms average write latency
10ms maximum data loss window (WAL fsync interval)

Implementing Your Own LSM-Tree

Here's a minimal implementation:

class SimpleLSMTree:
    def __init__(self, data_dir):
        self.data_dir = data_dir
        self.wal = WriteAheadLog(f"{data_dir}/wal")
        self.memtable = SortedDict()
        self.sstables = []
        self.memtable_size = 0
        self.max_memtable_size = 4 * 1024 * 1024  # 4MB

    def put(self, key, value):
        # WAL first
        self.wal.append(key, value)

        # Then MemTable
        self.memtable[key] = value
        self.memtable_size += len(key) + len(value)

        if self.memtable_size >= self.max_memtable_size:
            self._flush()

    def get(self, key):
        # Check MemTable
        if key in self.memtable:
            return self.memtable[key]

        # Check SSTables (newest first)
        for sstable in reversed(self.sstables):
            if value := sstable.get(key):
                return value

        return None

    def _flush(self):
        # Create new SSTable
        sstable = SSTable.create(
            f"{self.data_dir}/sst_{time.time()}.db",
            self.memtable
        )
        self.sstables.append(sstable)

        # Clear MemTable
        self.memtable = SortedDict()
        self.memtable_size = 0
        self.wal.clear()

Conclusion

LSM-Trees power some of the world's most demanding databases. By understanding their design, you can:

Choose the right database for your workload
Tune performance parameters effectively
Debug issues with compaction and write amplification

NexaDB builds on LSM-Tree fundamentals with modern enhancements like dual MemTables, HNSW vector indexes, and the TOON format for AI applications.

Want to see LSM-Trees in action? Install NexaDB and explore the architecture yourself.

LSM-Tree: The Storage Engine Behind Modern Databases