High-Profile Likes Counter

Design Walkthrough

Problem Statement

The Question: Design a system that can handle counting likes on posts from celebrities and viral content, where a single post can receive millions of likes per second.

This is a classic problem at Facebook, Instagram, Twitter, and TikTok. When Taylor Swift posts, or when a major news event happens, individual posts can receive extreme engagement.

Why this is hard: - A viral post is a hot key - extreme write concentration on one data point - Traditional databases cannot handle 1M writes/sec to a single row - Users expect to see their like reflected (write acknowledgment) - The count should be reasonably accurate for display

What to say first

Before I design, let me understand the scale and consistency requirements. The key challenge here is the hot key problem - extreme write throughput to a single logical counter.

What interviewers are testing: - Do you recognize the hot key problem? - Can you design for extreme write throughput? - Do you understand consistency vs availability tradeoffs? - Can you separate the like record from the like count?

Clarifying Questions

Ask these questions to scope the problem and demonstrate senior thinking.

Question 1: Peak Write Rate

What is the peak like rate we need to handle on a single post? Are we talking thousands or millions per second?

Why this matters: Determines if simple solutions work or we need sharding Typical answer: Up to 1M likes/sec on viral celebrity posts Impact: Must shard the counter - no single node can handle this

Question 2: Count Accuracy

Does the displayed count need to be exact in real-time, or is approximate OK? Can it lag by a few seconds?

Why this matters: Exact real-time requires synchronous aggregation (slow) Typical answer: Approximate is fine, can lag by seconds Impact: Can use async aggregation, cached counts

Question 3: Unlike Support

Do users need to unlike? If so, we need to track individual likes, not just increment a counter.

Why this matters: Unlike requires knowing who liked what Typical answer: Yes, unlike must work Impact: Need to store individual like records, not just counters

Question 4: Read vs Write Ratio

How often is the like count read vs written? Is this read-heavy or write-heavy?

Why this matters: Determines caching strategy Typical answer: Read-heavy overall, but viral posts are write-heavy bursts Impact: Cache aggressively, but cache invalidation is tricky during bursts

State assumptions

I will assume: 1M likes/sec peak on hot posts, approximate counts OK (lag up to 5s), unlike must work, and the system needs to handle 1B+ likes per day across all posts.

The Hard Part

Say this out loud

The hard part here is the hot key problem. When a celebrity posts, that single post_id becomes a hot key receiving millions of writes per second. No single database node or cache key can handle this write rate.

Why traditional approaches fail:

1. Single Database Row

-- This creates a hot row with lock contention
UPDATE posts SET like_count = like_count + 1 WHERE post_id = 123;

-- At 1M writes/sec:
-- - Row-level lock contention
-- - Single CPU bottleneck
-- - ~10K writes/sec max on one row

2. Single Redis Key

INCR post:123:likes

-- Redis single-key INCR: ~100K ops/sec
-- Still not enough for 1M likes/sec
-- Also single point of failure

3. The Real Challenge

Even if we could handle the writes, we have another problem: the like must be recorded durably so users can unlike. A simple counter increment loses this information.

Common mistake

Candidates often propose Redis INCR as the solution. It works for normal posts but fails for viral content. Always design for the worst case (celebrity posting during a major event).

The key insight: Separate the like record (who liked what) from the like count (display number). They have different consistency and durability requirements.

Scale and Access Patterns

Let me estimate the scale and understand access patterns.

Dimension	Value	Impact
Peak likes/sec (single post)	1,000,000	Hot key - must shard counter
Average likes/sec (platform)	100,000	Manageable with good architecture

+ 5 more rows...

What to say

The access pattern is bimodal. Most posts are cold (few writes, cache friendly). But viral posts flip to write-heavy with millions of likes in minutes. We need to optimize for both cases.

Traffic Pattern Analysis:

99% of posts: Receive < 1000 likes total, simple solutions work - 0.9% of posts: Receive 1K-100K likes, need some optimization - 0.1% of posts: Receive 100K-10M+ likes, need full hot key handling

The challenge is that we do not know in advance which posts will go viral. Taylor Swift posting is predictable, but random viral content is not.

Like records:
- 1B likes/day x 50 bytes = 50 GB/day
- 30 days retention = 1.5 TB

+ 9 more lines...

High-Level Architecture

The key insight is to separate concerns: 1. Like Service: Records the like (durable, for unlike support) 2. Counter Service: Maintains counts (can be eventually consistent) 3. Read Path: Serves cached/aggregated counts

What to say

I will separate the like record from the like count. Likes are written to a durable store for unlike support. Counts are maintained separately and can be eventually consistent.

Likes Counter Architecture

Write Path (when user likes): 1. Like Service receives request 2. Write like record to database (sharded by user_id) 3. Publish event to Kafka 4. Counter workers consume events and increment sharded counters 5. Return success to user immediately

Read Path (when viewing like count): 1. Check CDN/edge cache (1s TTL for hot posts) 2. If miss, check Redis aggregated counter 3. If miss, aggregate from sharded counters 4. Cache and return

Why Kafka in the middle?

Kafka decouples like recording from counter updates. If counters are slow, likes are not lost. Kafka handles backpressure and allows replay if counters need rebuilding.

Data Model and Storage

We need two data models: one for like records (source of truth) and one for counters (derived, optimized for reads).

What to say

The like records are the source of truth, sharded by user_id for even distribution. Counters are derived data that can be rebuilt from like records if needed.

-- Sharded by user_id (even distribution, users like many posts)
CREATE TABLE likes (
    user_id     BIGINT NOT NULL,

+ 11 more lines...

Sharded Counter Design:

For hot posts, we cannot use a single counter. Instead, we use N sub-counters:

# For normal posts - single counter
likes:count:{post_id} = 12345

+ 11 more lines...

def increment_like_counter(post_id: str, num_shards: int = 100):
    # Pick a random shard to distribute writes
    shard = random.randint(0, num_shards - 1)

+ 26 more lines...

Important detail

The number of shards should be dynamic. Normal posts use 1 shard. When a post becomes hot (detected by write rate), we expand to more shards. This avoids overhead for cold posts.

Algorithm Deep Dive - Hot Key Detection

We need to dynamically detect hot posts and expand their counter shards. Here is the algorithm:

Hot Key Detection Flow

class AdaptiveCounter:
    # Thresholds for tier promotion
    WARM_THRESHOLD = 100    # likes/sec

+ 58 more lines...

Why sample at 1%?

Tracking rate for every like adds overhead. Sampling at 1% gives statistically accurate rate estimation while reducing tracking overhead by 100x. At 1M likes/sec, we still get 10K samples/sec - plenty for accurate rate estimation.

Handling Unlike:

Unlike is trickier than like because we need to decrement the correct shard.

def unlike(user_id: str, post_id: str):
    # 1. Delete from source of truth (database)
    result = db.execute(

+ 28 more lines...

Consistency and Invariants

System Invariants

1. A user can only like a post once (enforced by primary key). 2. Likes must not be lost - database is source of truth. 3. Unlike must work - requires durable like records. 4. Display count can be approximate but must not drift unboundedly.

Consistency Model:

We use eventual consistency for the like count with bounded staleness:

Component	Consistency	Justification
Like record in DB	Strong (per user)	User sees their own like immediately
Kafka events	At-least-once	May process same like twice, counters handle idempotency
Shard counters	Eventual	May lag by seconds during high traffic
Aggregated count	Eventual (5s max)	Cached aggregate refreshed every 5 seconds
CDN cached count	Eventual (1-60s)	Different users may see different counts

What to say

Users care about seeing their own like reflected immediately, but they do not care if the total count is 1,234,567 vs 1,234,589. We optimize for perceived consistency, not global consistency.

Handling Double Counting:

Kafka is at-least-once, so we may process the same like event twice.

def process_like_event(event):
    event_id = event["event_id"]  # Unique per like
    post_id = event["post_id"]

+ 11 more lines...

Periodic Reconciliation:

To prevent counter drift, we periodically reconcile counters with the source of truth:

def reconcile_counter(post_id: str):
    """Run periodically for active posts to fix any drift"""

+ 18 more lines...

Failure Modes and Resilience

Proactively discuss failures

Let me walk through what happens when things fail. The key is that likes are never lost because we write to the database first.

Failure	Impact	Mitigation	Why It Works
Redis down	Counters unavailable	Fall back to DB count + cache	DB is source of truth, slower but correct
Kafka lag	Counter updates delayed	Counters lag, likes not lost	Events are durable, counters catch up

+ 4 more rows...

Graceful Degradation Strategy:

def get_like_count_with_fallback(post_id: str) -> int:
    # Level 1: CDN/Edge cache (fastest)
    # Handled by CDN, not in this code

+ 30 more lines...

Defense in depth

Six levels of fallback ensure we almost always show something. Users do not notice if the count is slightly stale, but they do notice if the page errors or shows nothing.

Evolution and Scaling

What to say

This design handles up to 1M likes/sec on a single post. Let me discuss how it evolves for even larger scale and additional features.

Evolution Path:

Stage 1: Simple Counter (up to 10K likes/sec) - Single Redis counter per post - Direct increment on like - Works for 99% of posts

Stage 2: Sharded Counter (up to 1M likes/sec) - Multiple counter shards - Random shard selection - Periodic aggregation

Stage 3: Hierarchical Counting (up to 10M+ likes/sec) - Local counting at edge - Regional aggregation - Global aggregation - Multi-level caching

Hierarchical Counting at Global Scale

Alternative Approaches:

1. Approximate Counting (HyperLogLog)

# HyperLogLog gives approximate unique count with O(1) space
# Useful if you only need approximate count, not exact

+ 10 more lines...

2. Count-Min Sketch for Rate Limiting Hot Keys

# Use Count-Min Sketch to detect hot posts without tracking each one

from collections import defaultdict

+ 23 more lines...

Alternative design

If exact unlike was not required, I would use HyperLogLog for counting and skip the database write for likes entirely. This scales infinitely but loses the ability to unlike or list who liked a post.

Features to add in v2:

1.Reaction types (like, love, haha, etc.) - separate counters per type 2. Like notifications - fan-out problem, separate system 3. Analytics - time-series of likes for creators 4. Fraud detection - detect bot-like patterns in like velocity

System Design Masterclass

Weather Application with Forecasting

URL Shortener

Live Comments Feature

API Rate Limiter

On-Call Escalation System

Hotel Booking and Reservation System

Parts Compatibility Validation

Real-time Stock Price Viewer

Top-K Rankings System

File Download and Sync Library

Real-time Active Viewers

Marketplace Features

Price Alert System

Netflix Screen Concurrency Limits

Live Reactions System

Top K Most Shared Articles