System Design Masterclass

58 items

System Design Masterclass

E-Commerceweathercachinggeolocationapi-designcdnbeginner

Design Weather Application with Forecasting

Design a weather application serving millions of queries daily

100M+ queries/day, global coverage|Similar to Weather.com, AccuWeather, Apple Weather, Google, Dark Sky|45 min read

Summary

A weather application serves millions of location-based queries daily with data that changes predictably (hourly/daily updates). The core challenges are aggressive caching strategies for read-heavy workloads, efficient geospatial queries, handling data from multiple external providers, and serving personalized forecasts at scale. This is often used as a warm-up interview question but has depth when you consider scale, caching, and data freshness tradeoffs.

Key Takeaways

Core Problem

This is fundamentally a read-heavy caching problem. Weather data updates on predictable schedules (hourly), making it ideal for aggressive caching strategies.

The Hard Part

Balancing cache hit rates with data freshness. Users expect current weather but querying upstream for every request is wasteful and expensive.

Scaling Axis

Scale by geographic region and time granularity. Cache at multiple levels - CDN edge, application cache, and database query cache.

The Question: Design a weather application that serves millions of queries daily with current conditions and forecasts.

A weather application must provide: - Current conditions: Temperature, humidity, wind, precipitation - Forecasts: Hourly (48h), daily (10-14 days), minute-by-minute (next hour) - Alerts: Severe weather warnings by location - Historical data: Past conditions for analytics - Location-based: By coordinates, city name, or ZIP code

What to say first

Before I design this, let me clarify the requirements. I want to understand the scale, data sources, freshness requirements, and whether we are building data collection infrastructure or just the serving layer.

Hidden requirements interviewers are testing: - Do you recognize this as a caching problem? - Can you design for read-heavy workloads (99.9% reads)? - Do you understand geospatial data access patterns? - Can you handle external data dependencies gracefully?

Summary

Key Takeaways

Core Problem

This is fundamentally a read-heavy caching problem. Weather data updates on predictable schedules (hourly), making it ideal for aggressive caching strategies.

The Hard Part

Balancing cache hit rates with data freshness. Users expect current weather but querying upstream for every request is wasteful and expensive.

Scaling Axis

Scale by geographic region and time granularity. Cache at multiple levels - CDN edge, application cache, and database query cache.

Critical Invariant

Weather data must never be stale by more than the update frequency (e.g., if data updates hourly, cache TTL must be less than 1 hour).

Performance Requirement

P99 latency under 100ms for current weather, under 200ms for 7-day forecast. This drives the multi-layer caching strategy.

Key Tradeoff

We trade perfect freshness for massive cost savings through caching. Weather changes slowly enough that 15-minute-old data is acceptable.

Design Walkthrough

Problem Statement

The Question: Design a weather application that serves millions of queries daily with current conditions and forecasts.

What to say first

Clarifying Questions

Ask these questions to scope the problem correctly. Weather apps can range from simple API wrappers to complex forecasting systems.

Question 1: Data Source

Are we collecting weather data ourselves (sensors, satellites) or aggregating from existing providers like NOAA, OpenWeatherMap?

Why this matters: Building data collection is a completely different system. Typical answer: Aggregate from existing providers (NOAA, private APIs) Architecture impact: Focus on caching, aggregation, and serving - not data collection

Question 2: Scale

What is our expected query volume? How many unique locations do we need to support?

Why this matters: Determines caching strategy granularity. Typical answer: 100M queries/day, global coverage (millions of lat/long combinations) Architecture impact: Cannot cache every possible coordinate - need spatial bucketing

Question 3: Freshness

How fresh must the weather data be? Is 15-minute-old data acceptable?

Why this matters: Directly impacts cache TTLs and hit rates. Typical answer: Current conditions - 15 min acceptable, forecasts - 1 hour acceptable Architecture impact: Can use aggressive caching with short TTLs

Question 4: Features

What features beyond basic weather? Alerts, historical, radar maps, air quality?

Why this matters: Each feature has different data characteristics. Typical answer: Start with current + forecast + alerts Architecture impact: Alerts need push capability, not just request/response

Stating assumptions

I will assume: aggregating from external providers, 100M queries/day, 15-minute freshness for current weather, global coverage, current conditions + 7-day forecast + severe weather alerts.

The Hard Part

Say this out loud

The hard part here is achieving high cache hit rates while maintaining acceptable data freshness, especially when users query by arbitrary coordinates that do not map cleanly to cached data.

Why this is genuinely hard:

1.Infinite Location Space: Users query by lat/long with arbitrary precision. You cannot cache every possible coordinate (40.7128, -74.0060 vs 40.7129, -74.0061).
2.Freshness vs Efficiency: Weather data updates hourly from providers. Fetching on every request wastes money and adds latency. But caching too aggressively shows stale data.
3.Provider Dependencies: External weather APIs have rate limits, costs, and can fail. You need resilience against upstream failures.
4.Personalization Kills Caching: Features like "feels like" temperature based on user preferences, or hyperlocal minute-by-minute forecasts, reduce cache effectiveness.

Common mistake

Candidates often design for the write path (how data gets into the system) when 99.9% of traffic is reads. Focus on the read path and caching strategy first.

The key insight:

Weather does not change at coordinate-level granularity. The weather at (40.7128, -74.0060) is identical to (40.7130, -74.0062) - they are 20 meters apart. We can bucket coordinates into grid cells and cache per-cell.

Precision    Grid Size    Use Case
---------    ---------    --------
1 decimal    ~11 km       Country-level
2 decimals   ~1.1 km      City-level (good default)
3 decimals   ~110 m       Neighborhood-level
4 decimals   ~11 m        Street-level (overkill)

Scale and Access Patterns

Let me estimate the scale and understand access patterns.

Dimension	Value	Impact
Daily queries	100,000,000	~1,200 QPS average, 10K QPS peak
Unique locations	~1M grid cells at 2 decimal precision	Manageable cache size

+ 4 more rows...

What to say

This is an extremely read-heavy system with predictable update patterns. The entire working dataset (10GB) fits in memory, and 80% of traffic hits popular locations. This screams multi-layer caching.

Access Pattern Analysis:

Temporal locality: Same location queried many times per hour (people checking weather repeatedly) - Spatial locality: Popular cities get 1000x more queries than rural areas - Predictable updates: Weather data updates on schedule, not randomly - Burst patterns: Morning commute, evening, and before weekends see spikes

Cache hit rate analysis:

- 1M unique grid cells globally

+ 12 more lines...

High-Level Architecture

The architecture is designed around multi-layer caching with geographic distribution.

What to say

I will design a multi-layer caching system. CDN at the edge handles most requests, application cache handles the rest, and we only hit upstream providers on cache misses or scheduled refreshes.

Weather Application Architecture

Component Responsibilities:

1. CDN Edge Layer - First line of defense for caching - Cache by grid cell (rounded coordinates) - TTL: 10-15 minutes for current, 1 hour for forecasts - Handles 80%+ of requests without hitting origin

2. Application Layer (Weather Service) - Coordinate normalization (snap to grid) - Cache lookup and population - Provider failover logic - Response formatting

3. Redis Cache - Application-level cache for CDN misses - Faster than DB for hot data - Pub/sub for cache invalidation

4. PostgreSQL + PostGIS - Geospatial queries for location lookup - City/ZIP to coordinate mapping - Stores canonical weather data

5. Background Data Fetcher - Proactively fetches weather for popular locations - Runs on schedule (every 15 min for top locations) - Handles provider rate limits and failover

Real-world reference

Apple Weather uses a similar architecture with aggressive edge caching. Dark Sky (now Apple) was known for hyperlocal forecasts using a dense grid of cached predictions.

Data Model and Storage

The data model must support efficient geospatial lookups and time-series weather data.

What to say

I will use a grid-based storage model where each grid cell has associated weather data. This normalizes the infinite coordinate space into a finite set of cacheable cells.

-- Grid cells for spatial bucketing
CREATE TABLE grid_cells (
    cell_id VARCHAR(20) PRIMARY KEY,  -- e.g., "40.71_-74.01"

+ 67 more lines...

Grid Cell ID Generation:

def coords_to_cell_id(lat: float, lon: float, precision: int = 2) -> str:
    """
    Convert coordinates to grid cell ID.

+ 10 more lines...

Cache Key Design:

Key patterns:

  weather:current:{cell_id}

+ 13 more lines...

Important detail

Store fetched_at timestamp with cached data. This allows serving slightly stale data with a warning if upstream providers fail, rather than returning errors.

API Design

The API must be simple, cacheable, and support multiple lookup methods.

GET /v1/weather/current
    ?lat=40.7128&lon=-74.0060    # By coordinates
    ?city=New+York,US            # By city name

+ 17 more lines...

{
  "location": {
    "lat": 40.71,

+ 27 more lines...

Cache-friendly design

Notice the API returns the grid cell coordinates (40.71, -74.01), not the exact input (40.7128, -74.0060). This makes responses identical for nearby coordinates, improving cache hit rates.

Caching Strategy Deep Dive

Caching is the most critical aspect of this system. Let me detail the multi-layer approach.

Cache Layers

Layer	Hit Rate	Latency	TTL	Capacity
CDN Edge	~80%	5-20ms	10-15 min	Unlimited (distributed)
Redis	~15%	1-5ms	15-60 min	10GB (hot data)
Database	~4%	10-50ms	N/A	Full dataset
Upstream Fetch	~1%	200-500ms	N/A	Rate limited

Cache Warming Strategy:

Do not wait for cache misses - proactively warm the cache for popular locations.

class CacheWarmer:
    def __init__(self):
        # Top 10K locations by query volume

+ 26 more lines...

Stale-While-Revalidate Pattern:

async def get_weather(cell_id: str) -> WeatherData:
    # Check cache
    cached = await redis.get(f"weather:current:{cell_id}")

+ 15 more lines...

CDN configuration

Use Cache-Control: public, max-age=900, stale-while-revalidate=1800. This tells CDN to serve cached response while fetching fresh data in background if cache is 15-30 min old.

Consistency and Invariants

System Invariants

Weather data must never be more than 2 hours stale under normal operation. During provider outages, serve stale data with clear indication rather than errors.

Consistency Model: Eventual Consistency

This system is a perfect fit for eventual consistency because: - Weather changes gradually (no instant state changes) - Slightly stale data (15 min old) is still useful - Strong consistency would destroy cache hit rates - No financial or safety-critical decisions (alerts are separate)

Data Freshness Guarantees:

Data Type	Max Staleness	Update Trigger	Failure Behavior
Current conditions	15 minutes	Scheduled fetch every 15 min	Serve stale with warning
Hourly forecast	1 hour	Scheduled fetch every hour	Serve stale with warning
Daily forecast	4 hours	Scheduled fetch every 4 hours	Serve stale with warning
Severe alerts	5 minutes	Push from providers + polling	Fail open (show alert if uncertain)

Alerts are different

Severe weather alerts have stricter requirements. It is better to show a false alert than miss a real tornado warning. Use fail-open for alerts - if uncertain, display the alert.

Multi-Provider Consistency:

When aggregating from multiple providers, they may disagree. Handle this explicitly:

async def get_aggregated_weather(cell_id: str) -> WeatherData:
    # Fetch from multiple providers concurrently
    results = await asyncio.gather(

+ 25 more lines...

Failure Modes and Resilience

Proactively discuss failures

Let me walk through failure scenarios. The key principle is graceful degradation - always serve something useful, even if not perfect.

Failure	Impact	Mitigation	Why It Works
Primary provider down	Cannot fetch fresh data	Failover to backup providers	Multiple providers with same coverage
All providers down	No fresh data available	Serve stale cached data with warning	2-hour-old weather still useful

+ 4 more rows...

Provider Failover:

class WeatherProviderClient:
    def __init__(self):
        self.providers = [

+ 29 more lines...

Graceful Degradation Response:

{
  "location": { ... },
  "current": {

+ 12 more lines...

User experience during degradation

Apps should display a subtle indicator (not block the UI) when showing stale data. Something like a yellow dot next to the time, or Updated 1 hour ago text. Users would rather see old weather than an error.

Evolution and Scaling

What to say

This design handles 100M queries/day comfortably. Let me discuss how it evolves for 10x scale and additional features like hyperlocal forecasts.

Evolution Path:

Stage 1: Basic Weather App (Current Design) - 100M queries/day - 2-decimal precision grid (~1km) - Single region deployment - 3-4 upstream providers

Stage 2: Global Scale (1B queries/day) - Multi-region deployment with geo-routing - Regional weather data caches - Cross-region cache replication for popular locations - Provider load balancing by region

Stage 3: Hyperlocal Weather - 3-decimal precision grid (~100m) - Integration with local weather stations - User-reported conditions (crowdsourcing) - ML-based nowcasting (next 2 hours)

Multi-Region Architecture

Additional Features to Discuss:

Feature	Architecture Impact	Complexity
Push notifications for alerts	Add message queue + push service (FCM/APNS)	Medium
Historical weather data	Add TimescaleDB for time-series, separate API	Medium

+ 4 more rows...

Alternative approach

If we needed sub-minute freshness (like for aviation weather), I would switch to a push model with WebSockets. Subscribe clients to location channels and push updates when data changes. This inverts the architecture from pull-based to push-based.

Cost Optimization at Scale:

1.Tiered caching by popularity: Hot locations in expensive fast cache, cold locations in cheaper slow storage
2.Provider negotiation: At scale, negotiate bulk pricing or direct data feeds
3.Predictive prefetching: Fetch data before users request based on time-of-day patterns
4.Compression: Weather data compresses well (80%+), reduce storage and transfer costs

Design Trade-offs

Advantages

+Lowest cost
+Lowest latency
+Highest availability

Disadvantages

-Data can be 15+ minutes stale
-Cache invalidation complexity

When to use

Consumer weather apps where slight staleness is acceptable

System Design Masterclass

Weather Application with Forecasting

URL Shortener

Live Comments Feature

API Rate Limiter

On-Call Escalation System

Hotel Booking and Reservation System

Parts Compatibility Validation

Real-time Stock Price Viewer

Top-K Rankings System

File Download and Sync Library

Real-time Active Viewers

Marketplace Features

Price Alert System

Netflix Screen Concurrency Limits

Live Reactions System

Top K Most Shared Articles

High-Profile Likes Counter

Authentication and User Login

Google Calendar

Web Crawler

News Feed

Video Streaming Platform

IoC / Dependency Injection Framework

Distributed Control Infrastructure

Notification Service

Distributed Tracing System

P2P File Transfer System

Large Data Migration to Cloud

Wire Transfer API

Large Data Sorting and Processing

Database Control Plane

Distributed Metrics Logging and Aggregation

Ads Management & Delivery System

Flash Sale Backend

Photo Sharing Platform

Cluster Health Monitoring System

Rider Matching System

Surge Pricing System

Collaborative Editing System

Server Metrics Collection System

User Analytics Dashboard & Event Pipeline

Dropbox / Google Drive

Distributed Message Queue

ETA and Live Location Sharing

Distributed Key-Value Store

Distributed Stream Processing System

Payment Processing System

Distributed Job Scheduler

WhatsApp / Messenger

Payment Wallet at Global Scale

Uber / Ride Sharing

Web Search Engine

Globally Distributed SQL Database

Real-Time Analytics System

Recommendation System (Netflix)

Multi-Region Disaster Recovery System

Time-Series Database

Fraud Detection System

Design Weather Application with Forecasting

Summary

Key Takeaways

Core Problem

The Hard Part

Scaling Axis

Problem Statement

What to say first

Premium Content