SystemExpertsSystemExperts
Pricing

Whitepapers

15 items

MapReduce: Simplified Data Processing on Large Clusters

30mintermediate

Kafka: A Distributed Messaging System for Log Processing

30mintermediate

Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services

25mintermediate

Bitcoin: A Peer-to-Peer Electronic Cash System

30mintermediate

In Search of an Understandable Consensus Algorithm (Raft)

35madvanced

TAO: Facebook's Distributed Data Store for the Social Graph

35madvanced

The Google File System

35madvanced

The Log-Structured Merge-Tree (LSM-Tree)

35madvanced

The Chubby Lock Service for Loosely-Coupled Distributed Systems

30madvanced

Spanner: Google's Globally Distributed Database

40madvanced

Bigtable: A Distributed Storage System for Structured Data

35madvanced

Scaling Memcache at Facebook

35madvanced

Large-scale cluster management at Google with Borg

35madvanced

The Next 700 Programming Languages

30madvanced

The Part-Time Parliament

40madvanced
storagedatabasesdata-structuresrocksdbcassandrawrite-optimizationadvanced

The Log-Structured Merge-Tree (LSM-Tree)

The data structure that powers Cassandra, RocksDB, LevelDB, and most modern write-heavy databases

Patrick O'Neil, Edward Cheng, Dieter Gawlick, Elizabeth O'Neil|University of Massachusetts Boston|1996|35 min read
View Original Paper

Summary

The LSM-Tree is a data structure optimized for write-heavy workloads. Instead of updating data in-place (like B-trees), it buffers writes in memory, then flushes them sequentially to disk in sorted batches. Reads merge data from multiple levels. This design trades read performance for dramatically better write throughput—often 10-100x faster than B-trees for write-intensive workloads. LSM-trees power virtually every modern NoSQL database: Cassandra, HBase, RocksDB, LevelDB, ScyllaDB, and many more.

Key Takeaways

Sequential Writes Beat Random Writes

Disk I/O speed differs dramatically between sequential and random access—10-1000x on HDDs, 10-100x even on SSDs. LSM-trees convert random writes into sequential writes by batching and sorting before flushing to disk.

Memory as Write Buffer

Recent writes live in an in-memory structure (memtable). This absorbs write bursts, enables batching, and allows the system to sort data before writing. The memtable is typically implemented as a skip list or red-black tree.

Immutable Sorted Runs

Once written to disk, data files (SSTables) are immutable. Updates create new entries; deletes create tombstones. Immutability simplifies concurrency, enables compression, and makes backups trivial.

Compaction Reclaims Space

Background compaction merges sorted runs, eliminating duplicates and tombstones. This is the key maintenance operation—it bounds space amplification and read amplification at the cost of write amplification.

Bloom Filters Accelerate Reads

Each SSTable has a Bloom filter that answers "is this key definitely NOT here?" with zero disk I/O. This optimization is critical—without it, reads would need to check every level.

Tunable Trade-offs

LSM-tree parameters (memtable size, level sizes, compaction strategy) let you tune the balance between write amplification, read amplification, and space amplification for your workload.

Deep Dive

Traditional databases use B-trees for indexing. B-trees maintain sorted data on disk, enabling efficient point lookups and range scans. But they have a fundamental problem with writes.

B-tree write path: 1. Read the page containing the key 2. Modify the page in memory 3. Write the entire page back to disk

For a single key update, you read and write an entire page (typically 4-16 KB). Worse, pages are scattered across the disk, causing random I/O—the slowest operation on any storage device.

B-Tree Write Amplification

The numbers are stark:

| Storage Type | Sequential Write | Random Write | Ratio | |-------------|------------------|--------------|-------| | HDD | 100 MB/s | 1 MB/s | 100x | | SSD | 500 MB/s | 50 MB/s | 10x | | NVMe | 3 GB/s | 500 MB/s | 6x |

Even on the fastest NVMe drives, sequential writes are significantly faster than random writes. On HDDs (still common in large clusters), the difference is 100x or more.

For write-heavy workloads—logging, time-series, event streaming—B-trees become a bottleneck. We need a data structure that converts random writes into sequential writes.

Trade-offs

AspectAdvantageDisadvantage
Write PerformanceSequential writes are 10-100x faster than random; batching amortizes overhead; excellent write throughputWrite amplification from compaction means each byte is written multiple times (10-30x typical)
Read PerformanceRecent data served from memory; Bloom filters skip irrelevant files; sorted files enable efficient range scansMay need to check multiple levels; worst-case reads touch many files; read amplification increases with data size
Space EfficiencyCompression very effective on sorted data (2-5x typical); leveled compaction keeps space amplification ~1.1xSize-tiered compaction can use 2x space; tombstones consume space until compacted; need temporary space during compaction
Predictability
Writes have consistent latency (just WAL + memtable); no page splits or fragmentation
Compaction causes latency spikes; read latency varies based on data location; background I/O competes with foreground
Operational ComplexityImmutable files simplify backups and replication; crash recovery is straightforward (replay WAL)Many tuning parameters; compaction strategy selection is workload-dependent; monitoring write/read/space amplification required