The Log-Structured Merge-Tree (LSM-Tree)

Whitepapers

15 items

MapReduce: Simplified Data Processing on Large Clusters

30mintermediate

Kafka: A Distributed Messaging System for Log Processing

30mintermediate

Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services

25mintermediate

Key Takeaways

Sequential Writes Beat Random Writes

Disk I/O speed differs dramatically between sequential and random access—10-1000x on HDDs, 10-100x even on SSDs. LSM-trees convert random writes into sequential writes by batching and sorting before flushing to disk.

Memory as Write Buffer

Recent writes live in an in-memory structure (memtable). This absorbs write bursts, enables batching, and allows the system to sort data before writing. The memtable is typically implemented as a skip list or red-black tree.

Immutable Sorted Runs

Once written to disk, data files (SSTables) are immutable. Updates create new entries; deletes create tombstones. Immutability simplifies concurrency, enables compression, and makes backups trivial.

Compaction Reclaims Space

Background compaction merges sorted runs, eliminating duplicates and tombstones. This is the key maintenance operation—it bounds space amplification and read amplification at the cost of write amplification.

Bloom Filters Accelerate Reads

Each SSTable has a Bloom filter that answers "is this key definitely NOT here?" with zero disk I/O. This optimization is critical—without it, reads would need to check every level.

Tunable Trade-offs

LSM-tree parameters (memtable size, level sizes, compaction strategy) let you tune the balance between write amplification, read amplification, and space amplification for your workload.

Deep Dive

Traditional databases use B-trees for indexing. B-trees maintain sorted data on disk, enabling efficient point lookups and range scans. But they have a fundamental problem with writes.

B-tree write path: 1. Read the page containing the key 2. Modify the page in memory 3. Write the entire page back to disk

For a single key update, you read and write an entire page (typically 4-16 KB). Worse, pages are scattered across the disk, causing random I/O—the slowest operation on any storage device.

B-Tree Write Amplification

The numbers are stark:

| Storage Type | Sequential Write | Random Write | Ratio | |-------------|------------------|--------------|-------| | HDD | 100 MB/s | 1 MB/s | 100x | | SSD | 500 MB/s | 50 MB/s | 10x | | NVMe | 3 GB/s | 500 MB/s | 6x |

Even on the fastest NVMe drives, sequential writes are significantly faster than random writes. On HDDs (still common in large clusters), the difference is 100x or more.

For write-heavy workloads—logging, time-series, event streaming—B-trees become a bottleneck. We need a data structure that converts random writes into sequential writes.

Trade-offs

Aspect	Advantage	Disadvantage
Write Performance	Sequential writes are 10-100x faster than random; batching amortizes overhead; excellent write throughput	Write amplification from compaction means each byte is written multiple times (10-30x typical)
Read Performance	Recent data served from memory; Bloom filters skip irrelevant files; sorted files enable efficient range scans	May need to check multiple levels; worst-case reads touch many files; read amplification increases with data size
Space Efficiency	Compression very effective on sorted data (2-5x typical); leveled compaction keeps space amplification ~1.1x	Size-tiered compaction can use 2x space; tombstones consume space until compacted; need temporary space during compaction
Predictability

Whitepapers

MapReduce: Simplified Data Processing on Large Clusters

Kafka: A Distributed Messaging System for Log Processing

Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services

Bitcoin: A Peer-to-Peer Electronic Cash System

In Search of an Understandable Consensus Algorithm (Raft)

TAO: Facebook's Distributed Data Store for the Social Graph

The Google File System