System Design Masterclass
Social Mediacountershigh-throughputcachingapproximationshardingintermediate

Design High-Profile Likes Counter

Design a system to count millions of likes per second for celebrity posts

1M+ likes/sec on viral posts, 1B+ total likes/day|Similar to Facebook, Instagram, Twitter, TikTok, YouTube|45 min read

Summary

Design a system that can count likes on posts from high-profile users (celebrities, influencers) who can receive millions of likes per second. The core challenge is the hot key problem - a single post receiving extreme write throughput that overwhelms any single database node. We solve this with write sharding, delayed aggregation, and approximate counting.

Key Takeaways

Core Problem

This is fundamentally a hot key problem - millions of writes to a single counter. No single machine can handle this write rate on one key.

The Hard Part

A single post_id becomes a hot key receiving 1M writes/sec. Traditional database row locks or even Redis single-key INCR cannot handle this.

Scaling Axis

Shard the counter across N sub-counters. Write to random shard, aggregate on read. Trade read consistency for write throughput.

Critical Invariant

Likes must never be lost. Approximate display count is OK, but the actual like records must be durable for unlike operations.

Performance Insight

Display count can lag by seconds and be approximate. Users cannot tell if a post has 1,234,567 or 1,234,589 likes.

Key Tradeoff

We trade read consistency (exact count) for write availability (handle millions of likes/sec). Eventually consistent counters are acceptable for social media.

Design Walkthrough

Problem Statement

The Question: Design a system that can handle counting likes on posts from celebrities and viral content, where a single post can receive millions of likes per second.

This is a classic problem at Facebook, Instagram, Twitter, and TikTok. When Taylor Swift posts, or when a major news event happens, individual posts can receive extreme engagement.

Why this is hard: - A viral post is a hot key - extreme write concentration on one data point - Traditional databases cannot handle 1M writes/sec to a single row - Users expect to see their like reflected (write acknowledgment) - The count should be reasonably accurate for display

What to say first

Before I design, let me understand the scale and consistency requirements. The key challenge here is the hot key problem - extreme write throughput to a single logical counter.

What interviewers are testing: - Do you recognize the hot key problem? - Can you design for extreme write throughput? - Do you understand consistency vs availability tradeoffs? - Can you separate the like record from the like count?

Clarifying Questions

Ask these questions to scope the problem and demonstrate senior thinking.

Question 1: Peak Write Rate

What is the peak like rate we need to handle on a single post? Are we talking thousands or millions per second?

Why this matters: Determines if simple solutions work or we need sharding Typical answer: Up to 1M likes/sec on viral celebrity posts Impact: Must shard the counter - no single node can handle this

Question 2: Count Accuracy

Does the displayed count need to be exact in real-time, or is approximate OK? Can it lag by a few seconds?

Why this matters: Exact real-time requires synchronous aggregation (slow) Typical answer: Approximate is fine, can lag by seconds Impact: Can use async aggregation, cached counts

Question 3: Unlike Support

Do users need to unlike? If so, we need to track individual likes, not just increment a counter.

Why this matters: Unlike requires knowing who liked what Typical answer: Yes, unlike must work Impact: Need to store individual like records, not just counters

Question 4: Read vs Write Ratio

How often is the like count read vs written? Is this read-heavy or write-heavy?

Why this matters: Determines caching strategy Typical answer: Read-heavy overall, but viral posts are write-heavy bursts Impact: Cache aggressively, but cache invalidation is tricky during bursts

State assumptions

I will assume: 1M likes/sec peak on hot posts, approximate counts OK (lag up to 5s), unlike must work, and the system needs to handle 1B+ likes per day across all posts.

The Hard Part

Say this out loud

The hard part here is the hot key problem. When a celebrity posts, that single post_id becomes a hot key receiving millions of writes per second. No single database node or cache key can handle this write rate.

Why traditional approaches fail:

1. Single Database Row

-- This creates a hot row with lock contention
UPDATE posts SET like_count = like_count + 1 WHERE post_id = 123;

-- At 1M writes/sec:
-- - Row-level lock contention
-- - Single CPU bottleneck
-- - ~10K writes/sec max on one row

2. Single Redis Key

INCR post:123:likes

-- Redis single-key INCR: ~100K ops/sec
-- Still not enough for 1M likes/sec
-- Also single point of failure

3. The Real Challenge

Even if we could handle the writes, we have another problem: the like must be recorded durably so users can unlike. A simple counter increment loses this information.

Common mistake

Candidates often propose Redis INCR as the solution. It works for normal posts but fails for viral content. Always design for the worst case (celebrity posting during a major event).

The key insight: Separate the like record (who liked what) from the like count (display number). They have different consistency and durability requirements.

Scale and Access Patterns

Let me estimate the scale and understand access patterns.

DimensionValueImpact
Peak likes/sec (single post)1,000,000Hot key - must shard counter
Average likes/sec (platform)100,000Manageable with good architecture
+ 5 more rows...

What to say

The access pattern is bimodal. Most posts are cold (few writes, cache friendly). But viral posts flip to write-heavy with millions of likes in minutes. We need to optimize for both cases.

Traffic Pattern Analysis:

  • 99% of posts: Receive < 1000 likes total, simple solutions work - 0.9% of posts: Receive 1K-100K likes, need some optimization - 0.1% of posts: Receive 100K-10M+ likes, need full hot key handling

The challenge is that we do not know in advance which posts will go viral. Taylor Swift posting is predictable, but random viral content is not.

Like records:
- 1B likes/day x 50 bytes = 50 GB/day
- 30 days retention = 1.5 TB
+ 9 more lines...

High-Level Architecture

The key insight is to separate concerns: 1. Like Service: Records the like (durable, for unlike support) 2. Counter Service: Maintains counts (can be eventually consistent) 3. Read Path: Serves cached/aggregated counts

What to say

I will separate the like record from the like count. Likes are written to a durable store for unlike support. Counts are maintained separately and can be eventually consistent.

Likes Counter Architecture

Write Path (when user likes): 1. Like Service receives request 2. Write like record to database (sharded by user_id) 3. Publish event to Kafka 4. Counter workers consume events and increment sharded counters 5. Return success to user immediately

Read Path (when viewing like count): 1. Check CDN/edge cache (1s TTL for hot posts) 2. If miss, check Redis aggregated counter 3. If miss, aggregate from sharded counters 4. Cache and return

Why Kafka in the middle?

Kafka decouples like recording from counter updates. If counters are slow, likes are not lost. Kafka handles backpressure and allows replay if counters need rebuilding.

Data Model and Storage

We need two data models: one for like records (source of truth) and one for counters (derived, optimized for reads).

What to say

The like records are the source of truth, sharded by user_id for even distribution. Counters are derived data that can be rebuilt from like records if needed.

-- Sharded by user_id (even distribution, users like many posts)
CREATE TABLE likes (
    user_id     BIGINT NOT NULL,
+ 11 more lines...

Sharded Counter Design:

For hot posts, we cannot use a single counter. Instead, we use N sub-counters:

# For normal posts - single counter
likes:count:{post_id} = 12345
+ 11 more lines...
def increment_like_counter(post_id: str, num_shards: int = 100):
    # Pick a random shard to distribute writes
    shard = random.randint(0, num_shards - 1)
+ 26 more lines...

Important detail

The number of shards should be dynamic. Normal posts use 1 shard. When a post becomes hot (detected by write rate), we expand to more shards. This avoids overhead for cold posts.

Algorithm Deep Dive - Hot Key Detection

We need to dynamically detect hot posts and expand their counter shards. Here is the algorithm:

Hot Key Detection Flow

class AdaptiveCounter:
    # Thresholds for tier promotion
    WARM_THRESHOLD = 100    # likes/sec
+ 58 more lines...

Why sample at 1%?

Tracking rate for every like adds overhead. Sampling at 1% gives statistically accurate rate estimation while reducing tracking overhead by 100x. At 1M likes/sec, we still get 10K samples/sec - plenty for accurate rate estimation.

Handling Unlike:

Unlike is trickier than like because we need to decrement the correct shard.

def unlike(user_id: str, post_id: str):
    # 1. Delete from source of truth (database)
    result = db.execute(
+ 28 more lines...

Consistency and Invariants

System Invariants

1. A user can only like a post once (enforced by primary key). 2. Likes must not be lost - database is source of truth. 3. Unlike must work - requires durable like records. 4. Display count can be approximate but must not drift unboundedly.

Consistency Model:

We use eventual consistency for the like count with bounded staleness:

ComponentConsistencyJustification
Like record in DBStrong (per user)User sees their own like immediately
Kafka eventsAt-least-onceMay process same like twice, counters handle idempotency
Shard countersEventualMay lag by seconds during high traffic
Aggregated countEventual (5s max)Cached aggregate refreshed every 5 seconds
CDN cached countEventual (1-60s)Different users may see different counts

What to say

Users care about seeing their own like reflected immediately, but they do not care if the total count is 1,234,567 vs 1,234,589. We optimize for perceived consistency, not global consistency.

Handling Double Counting:

Kafka is at-least-once, so we may process the same like event twice.

def process_like_event(event):
    event_id = event["event_id"]  # Unique per like
    post_id = event["post_id"]
+ 11 more lines...

Periodic Reconciliation:

To prevent counter drift, we periodically reconcile counters with the source of truth:

def reconcile_counter(post_id: str):
    """Run periodically for active posts to fix any drift"""
    
+ 18 more lines...

Failure Modes and Resilience

Proactively discuss failures

Let me walk through what happens when things fail. The key is that likes are never lost because we write to the database first.

FailureImpactMitigationWhy It Works
Redis downCounters unavailableFall back to DB count + cacheDB is source of truth, slower but correct
Kafka lagCounter updates delayedCounters lag, likes not lostEvents are durable, counters catch up
+ 4 more rows...

Graceful Degradation Strategy:

def get_like_count_with_fallback(post_id: str) -> int:
    # Level 1: CDN/Edge cache (fastest)
    # Handled by CDN, not in this code
+ 30 more lines...

Defense in depth

Six levels of fallback ensure we almost always show something. Users do not notice if the count is slightly stale, but they do notice if the page errors or shows nothing.

Evolution and Scaling

What to say

This design handles up to 1M likes/sec on a single post. Let me discuss how it evolves for even larger scale and additional features.

Evolution Path:

Stage 1: Simple Counter (up to 10K likes/sec) - Single Redis counter per post - Direct increment on like - Works for 99% of posts

Stage 2: Sharded Counter (up to 1M likes/sec) - Multiple counter shards - Random shard selection - Periodic aggregation

Stage 3: Hierarchical Counting (up to 10M+ likes/sec) - Local counting at edge - Regional aggregation - Global aggregation - Multi-level caching

Hierarchical Counting at Global Scale

Alternative Approaches:

1. Approximate Counting (HyperLogLog)

# HyperLogLog gives approximate unique count with O(1) space
# Useful if you only need approximate count, not exact
+ 10 more lines...

2. Count-Min Sketch for Rate Limiting Hot Keys

# Use Count-Min Sketch to detect hot posts without tracking each one

from collections import defaultdict
+ 23 more lines...

Alternative design

If exact unlike was not required, I would use HyperLogLog for counting and skip the database write for likes entirely. This scales infinitely but loses the ability to unlike or list who liked a post.

Features to add in v2:

  1. 1.Reaction types (like, love, haha, etc.) - separate counters per type 2. Like notifications - fan-out problem, separate system 3. Analytics - time-series of likes for creators 4. Fraud detection - detect bot-like patterns in like velocity

Design Trade-offs

Advantages

  • +Simple implementation
  • +Exact count
  • +Easy to debug

Disadvantages

  • -Hot key bottleneck
  • -Max ~100K writes/sec
  • -Single point of failure
When to use

Normal posts with < 10K likes/sec