Whitepapers
15 items
15 items
The data structure that powers Cassandra, RocksDB, LevelDB, and most modern write-heavy databases
The LSM-Tree is a data structure optimized for write-heavy workloads. Instead of updating data in-place (like B-trees), it buffers writes in memory, then flushes them sequentially to disk in sorted batches. Reads merge data from multiple levels. This design trades read performance for dramatically better write throughput—often 10-100x faster than B-trees for write-intensive workloads. LSM-trees power virtually every modern NoSQL database: Cassandra, HBase, RocksDB, LevelDB, ScyllaDB, and many more.
Disk I/O speed differs dramatically between sequential and random access—10-1000x on HDDs, 10-100x even on SSDs. LSM-trees convert random writes into sequential writes by batching and sorting before flushing to disk.
Recent writes live in an in-memory structure (memtable). This absorbs write bursts, enables batching, and allows the system to sort data before writing. The memtable is typically implemented as a skip list or red-black tree.
Once written to disk, data files (SSTables) are immutable. Updates create new entries; deletes create tombstones. Immutability simplifies concurrency, enables compression, and makes backups trivial.
Background compaction merges sorted runs, eliminating duplicates and tombstones. This is the key maintenance operation—it bounds space amplification and read amplification at the cost of write amplification.
Each SSTable has a Bloom filter that answers "is this key definitely NOT here?" with zero disk I/O. This optimization is critical—without it, reads would need to check every level.
LSM-tree parameters (memtable size, level sizes, compaction strategy) let you tune the balance between write amplification, read amplification, and space amplification for your workload.
Traditional databases use B-trees for indexing. B-trees maintain sorted data on disk, enabling efficient point lookups and range scans. But they have a fundamental problem with writes.
B-tree write path: 1. Read the page containing the key 2. Modify the page in memory 3. Write the entire page back to disk
For a single key update, you read and write an entire page (typically 4-16 KB). Worse, pages are scattered across the disk, causing random I/O—the slowest operation on any storage device.
The numbers are stark:
| Storage Type | Sequential Write | Random Write | Ratio | |-------------|------------------|--------------|-------| | HDD | 100 MB/s | 1 MB/s | 100x | | SSD | 500 MB/s | 50 MB/s | 10x | | NVMe | 3 GB/s | 500 MB/s | 6x |
Even on the fastest NVMe drives, sequential writes are significantly faster than random writes. On HDDs (still common in large clusters), the difference is 100x or more.
For write-heavy workloads—logging, time-series, event streaming—B-trees become a bottleneck. We need a data structure that converts random writes into sequential writes.
| Aspect | Advantage | Disadvantage |
|---|---|---|
| Write Performance | Sequential writes are 10-100x faster than random; batching amortizes overhead; excellent write throughput | Write amplification from compaction means each byte is written multiple times (10-30x typical) |
| Read Performance | Recent data served from memory; Bloom filters skip irrelevant files; sorted files enable efficient range scans | May need to check multiple levels; worst-case reads touch many files; read amplification increases with data size |
| Space Efficiency | Compression very effective on sorted data (2-5x typical); leveled compaction keeps space amplification ~1.1x | Size-tiered compaction can use 2x space; tombstones consume space until compacted; need temporary space during compaction |
| Predictability |
| Writes have consistent latency (just WAL + memtable); no page splits or fragmentation |
| Compaction causes latency spikes; read latency varies based on data location; background I/O competes with foreground |
| Operational Complexity | Immutable files simplify backups and replication; crash recovery is straightforward (replay WAL) | Many tuning parameters; compaction strategy selection is workload-dependent; monitoring write/read/space amplification required |