Open Source
10 items
10 items
The database that handles millions of writes per second for Netflix, Apple, and Discord with no single point of failure
Cassandra is a distributed NoSQL database designed for massive write throughput and linear scalability. It uses a masterless ring architecture where every node is equal, eliminating single points of failure. Data is partitioned across nodes using consistent hashing and replicated for fault tolerance. Cassandra trades consistency for availability - it offers tunable consistency levels, letting you choose between strong consistency and high availability per query.
Every node in Cassandra is identical. There is no master, no leader election, no single point of failure. Any node can accept reads and writes for any data. This simplifies operations and provides true high availability.
Cassandra lets you choose consistency per query. Write with QUORUM for durability, read with ONE for speed, or use ALL for strong consistency. This flexibility lets you optimize for your specific use case.
Writes go to an append-only commit log and memtable (in-memory). Memtables flush to immutable SSTables on disk. This write path is extremely fast - sequential I/O only, no random seeks.
Data is organized by partition key, which determines node placement. Within a partition, rows are sorted by clustering columns. This model excels at time-series data and wide rows with millions of columns.
Nodes share cluster state using gossip - each node periodically exchanges state with random peers. Within seconds, all nodes learn about membership changes, failures, and schema updates.
Background compaction merges SSTables, removing tombstones and consolidating data. This trades write amplification for read performance - fewer SSTables means fewer disk seeks per query.
Cassandra was created at Facebook in 2007 to solve their inbox search problem - they needed to handle massive write loads across multiple data centers with no downtime. The project combined ideas from two influential papers:
The result is a database optimized for:
When to use Cassandra:
When NOT to use Cassandra:
| Aspect | Advantage | Disadvantage |
|---|---|---|
| Masterless Architecture | No single point of failure, any node can serve any request, simplified operations | Coordination overhead for consistency, all nodes must agree on schema changes |
| Eventual Consistency Default | High availability, low latency writes, works across datacenters with high latency | Application must handle stale reads, conflict resolution via last-write-wins |
| Log-Structured Storage | Extremely fast writes (sequential I/O only), no read-before-write | Read amplification (check multiple SSTables), space amplification during compaction |
| Partition-Based Model | Predictable performance, data locality, efficient range queries within partition | Must know queries upfront, no ad-hoc joins, denormalization required |
| Tunable Consistency | Flexibility to choose consistency per query, optimize for specific use cases | Complexity in choosing right level, easy to misconfigure and get inconsistent reads |
| Wide-Column Model | Excellent for time-series, sparse data, dynamic columns without schema changes | Not suitable for complex relationships, aggregations require full scans |
| Multi-Datacenter Replication | Built-in geo-distribution, disaster recovery, serve users from nearest DC | Cross-DC consistency is expensive, operational complexity increases |
| JVM-Based | Rich ecosystem, good tooling, mature garbage collectors | GC pauses can cause latency spikes, requires JVM tuning expertise |