SystemExpertsSystemExperts
Pricing

Whitepapers

15 items

MapReduce: Simplified Data Processing on Large Clusters

30mintermediate

Kafka: A Distributed Messaging System for Log Processing

30mintermediate

Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services

25mintermediate

Bitcoin: A Peer-to-Peer Electronic Cash System

30mintermediate

In Search of an Understandable Consensus Algorithm (Raft)

35madvanced

TAO: Facebook's Distributed Data Store for the Social Graph

35madvanced

The Google File System

35madvanced

The Log-Structured Merge-Tree (LSM-Tree)

35madvanced

The Chubby Lock Service for Loosely-Coupled Distributed Systems

30madvanced

Spanner: Google's Globally Distributed Database

40madvanced

Bigtable: A Distributed Storage System for Structured Data

35madvanced

Scaling Memcache at Facebook

35madvanced

Large-scale cluster management at Google with Borg

35madvanced

The Next 700 Programming Languages

30madvanced

The Part-Time Parliament

40madvanced
distributed-systemscachingsocial-graphfacebookeventual-consistencydata-storeadvanced

TAO: Facebook's Distributed Data Store for the Social Graph

How Facebook serves billions of reads per second with sub-millisecond latency by co-designing cache and storage for graph workloads

Nathan Bronson, Zach Amsden, George Cabrera, Prasad Chakka, Peter Dimov, Hui Ding, Jack Ferris, Anthony Giardullo, Sachin Kulkarni, Harry Li, Mark Marchukov, Dmitri Petrov, Lovro Puzar, Yee Jiun Song, Venkat Venkataramani|Facebook|2013|35 min read
View Original Paper

Summary

TAO is Facebook's geographically distributed data store optimized for the social graph. It replaces a memcache-based caching layer with a graph-aware cache that understands objects and associations. TAO achieves 99.8% cache hit rate, serves billions of reads per second with sub-millisecond latency, and handles the extreme read-heavy workload (500:1 read-to-write ratio) that characterizes social networking. The key insight: treating the cache as a first-class citizen with graph semantics, not just a generic key-value store in front of MySQL.

Key Takeaways

Graph-Aware Data Model

TAO models data as objects (nodes) and associations (edges) rather than generic key-value pairs. This graph abstraction maps perfectly to social data—users, posts, photos, friendships, likes—and enables optimized caching and query patterns.

Two-Level Caching Hierarchy

Followers (leaf caches) handle the massive read load while Leaders maintain consistency with the database. This separation allows aggressive read scaling while keeping write coordination tractable.

Read-Your-Writes via Leaders

After a write, the client reads from the Leader cache until the Follower is updated. This provides read-your-writes consistency without strong consistency overhead across the entire system.

Facebook's data is fundamentally a graph: users connected by friendships, posts connected to authors, photos tagged with people, comments on content. Every page load traverses this graph:

  • News Feed: Fetch posts from friends, sorted by relevance
  • Profile: Fetch user's posts, photos, friends, about info
  • Photo: Fetch photo, tags, likes, comments, who else liked it

By 2013, Facebook had: - 1 billion+ users - Billions of reads/second (peak) - 500:1 read-to-write ratio - Sub-millisecond latency requirement

Social Graph Structure

The original architecture used memcache as a generic cache in front of MySQL:

Client → Memcache → MySQL

Problems with this approach:

  1. Thundering herds: Cache miss causes thousands of concurrent MySQL queries for hot objects
  2. Inefficient invalidation: Deleting a user requires invalidating all their posts, comments, likes individually
  3. No graph semantics: Can't efficiently answer "get all friends who liked this post"
  4. Complex client logic: Application code managed cache consistency, leading to bugs
  5. Read-after-write issues: User creates post, refreshes, post not visible (stale cache)

Summary

TAO is Facebook's geographically distributed data store optimized for the social graph. It replaces a memcache-based caching layer with a graph-aware cache that understands objects and associations. TAO achieves 99.8% cache hit rate, serves billions of reads per second with sub-millisecond latency, and handles the extreme read-heavy workload (500:1 read-to-write ratio) that characterizes social networking. The key insight: treating the cache as a first-class citizen with graph semantics, not just a generic key-value store in front of MySQL.

Key Takeaways

Graph-Aware Data Model

TAO models data as objects (nodes) and associations (edges) rather than generic key-value pairs. This graph abstraction maps perfectly to social data—users, posts, photos, friendships, likes—and enables optimized caching and query patterns.

Two-Level Caching Hierarchy

Followers (leaf caches) handle the massive read load while Leaders maintain consistency with the database. This separation allows aggressive read scaling while keeping write coordination tractable.

Read-Your-Writes via Leaders

After a write, the client reads from the Leader cache until the Follower is updated. This provides read-your-writes consistency without strong consistency overhead across the entire system.

Eventual Consistency with Bounded Staleness

TAO accepts eventual consistency but bounds staleness through cache invalidation and version checks. For social workloads, reading a slightly stale like count is acceptable; reading your own deleted post is not.

Premium Content

Sign in to access this content or upgrade for full access.

Association Lists as First-Class Citizens

TAO optimizes for the common 'get all edges from this node' query. Association lists are cached as sorted lists with cursor-based pagination, avoiding the need to fetch entire potentially-huge lists.

Write-Through Caching

All writes go through the cache to the database, then invalidate/update caches. This maintains cache consistency without complex invalidation protocols and ensures the cache always reflects database state.

Deep Dive

Facebook's data is fundamentally a graph: users connected by friendships, posts connected to authors, photos tagged with people, comments on content. Every page load traverses this graph:

  • News Feed: Fetch posts from friends, sorted by relevance
  • Profile: Fetch user's posts, photos, friends, about info
  • Photo: Fetch photo, tags, likes, comments, who else liked it

By 2013, Facebook had: - 1 billion+ users - Billions of reads/second (peak) - 500:1 read-to-write ratio - Sub-millisecond latency requirement

Social Graph Structure

The original architecture used memcache as a generic cache in front of MySQL:

Client → Memcache → MySQL

Problems with this approach:

  1. Thundering herds: Cache miss causes thousands of concurrent MySQL queries for hot objects
  2. Inefficient invalidation: Deleting a user requires invalidating all their posts, comments, likes individually
  3. No graph semantics: Can't efficiently answer "get all friends who liked this post"
  4. Complex client logic: Application code managed cache consistency, leading to bugs
  5. Read-after-write issues: User creates post, refreshes, post not visible (stale cache)

Trade-offs

AspectAdvantageDisadvantage
Graph-Specialized DesignAPI and caching optimized for objects and associations; efficient graph traversalsNot suitable for arbitrary data; requires mapping data to graph model
Eventual ConsistencyEnables massive read scalability; local reads in all regionsUsers may see stale data; read-your-writes only for own content
Two-Tier CachingSeparates read scaling (Followers) from write consistency (Leaders)Additional complexity; two cache layers to manage and monitor
Write-Through CachingCache always consistent with database; no stale reads after writesWrite latency includes cache update; single write path can bottleneck
Primary Region ArchitectureSimplifies consistency; one source of truth for writesCross-region write latency; primary region is critical dependency