Design Walkthrough
Problem Statement
The Question: Design a weather application that serves millions of queries daily with current conditions and forecasts.
A weather application must provide: - Current conditions: Temperature, humidity, wind, precipitation - Forecasts: Hourly (48h), daily (10-14 days), minute-by-minute (next hour) - Alerts: Severe weather warnings by location - Historical data: Past conditions for analytics - Location-based: By coordinates, city name, or ZIP code
What to say first
Before I design this, let me clarify the requirements. I want to understand the scale, data sources, freshness requirements, and whether we are building data collection infrastructure or just the serving layer.
Hidden requirements interviewers are testing: - Do you recognize this as a caching problem? - Can you design for read-heavy workloads (99.9% reads)? - Do you understand geospatial data access patterns? - Can you handle external data dependencies gracefully?
Clarifying Questions
Ask these questions to scope the problem correctly. Weather apps can range from simple API wrappers to complex forecasting systems.
Question 1: Data Source
Are we collecting weather data ourselves (sensors, satellites) or aggregating from existing providers like NOAA, OpenWeatherMap?
Why this matters: Building data collection is a completely different system. Typical answer: Aggregate from existing providers (NOAA, private APIs) Architecture impact: Focus on caching, aggregation, and serving - not data collection
Question 2: Scale
What is our expected query volume? How many unique locations do we need to support?
Why this matters: Determines caching strategy granularity. Typical answer: 100M queries/day, global coverage (millions of lat/long combinations) Architecture impact: Cannot cache every possible coordinate - need spatial bucketing
Question 3: Freshness
How fresh must the weather data be? Is 15-minute-old data acceptable?
Why this matters: Directly impacts cache TTLs and hit rates. Typical answer: Current conditions - 15 min acceptable, forecasts - 1 hour acceptable Architecture impact: Can use aggressive caching with short TTLs
Question 4: Features
What features beyond basic weather? Alerts, historical, radar maps, air quality?
Why this matters: Each feature has different data characteristics. Typical answer: Start with current + forecast + alerts Architecture impact: Alerts need push capability, not just request/response
Stating assumptions
I will assume: aggregating from external providers, 100M queries/day, 15-minute freshness for current weather, global coverage, current conditions + 7-day forecast + severe weather alerts.
The Hard Part
Say this out loud
The hard part here is achieving high cache hit rates while maintaining acceptable data freshness, especially when users query by arbitrary coordinates that do not map cleanly to cached data.
Why this is genuinely hard:
- 1.Infinite Location Space: Users query by lat/long with arbitrary precision. You cannot cache every possible coordinate (40.7128, -74.0060 vs 40.7129, -74.0061).
- 2.Freshness vs Efficiency: Weather data updates hourly from providers. Fetching on every request wastes money and adds latency. But caching too aggressively shows stale data.
- 3.Provider Dependencies: External weather APIs have rate limits, costs, and can fail. You need resilience against upstream failures.
- 4.Personalization Kills Caching: Features like "feels like" temperature based on user preferences, or hyperlocal minute-by-minute forecasts, reduce cache effectiveness.
Common mistake
Candidates often design for the write path (how data gets into the system) when 99.9% of traffic is reads. Focus on the read path and caching strategy first.
The key insight:
Weather does not change at coordinate-level granularity. The weather at (40.7128, -74.0060) is identical to (40.7130, -74.0062) - they are 20 meters apart. We can bucket coordinates into grid cells and cache per-cell.
Precision Grid Size Use Case
--------- --------- --------
1 decimal ~11 km Country-level
2 decimals ~1.1 km City-level (good default)
3 decimals ~110 m Neighborhood-level
4 decimals ~11 m Street-level (overkill)Scale and Access Patterns
Let me estimate the scale and understand access patterns.
| Dimension | Value | Impact |
|---|---|---|
| Daily queries | 100,000,000 | ~1,200 QPS average, 10K QPS peak |
| Unique locations | ~1M grid cells at 2 decimal precision | Manageable cache size |
What to say
This is an extremely read-heavy system with predictable update patterns. The entire working dataset (10GB) fits in memory, and 80% of traffic hits popular locations. This screams multi-layer caching.
Access Pattern Analysis:
- Temporal locality: Same location queried many times per hour (people checking weather repeatedly) - Spatial locality: Popular cities get 1000x more queries than rural areas - Predictable updates: Weather data updates on schedule, not randomly - Burst patterns: Morning commute, evening, and before weekends see spikes
Cache hit rate analysis:
- 1M unique grid cells globallyHigh-Level Architecture
The architecture is designed around multi-layer caching with geographic distribution.
What to say
I will design a multi-layer caching system. CDN at the edge handles most requests, application cache handles the rest, and we only hit upstream providers on cache misses or scheduled refreshes.
Weather Application Architecture
Component Responsibilities:
1. CDN Edge Layer - First line of defense for caching - Cache by grid cell (rounded coordinates) - TTL: 10-15 minutes for current, 1 hour for forecasts - Handles 80%+ of requests without hitting origin
2. Application Layer (Weather Service) - Coordinate normalization (snap to grid) - Cache lookup and population - Provider failover logic - Response formatting
3. Redis Cache - Application-level cache for CDN misses - Faster than DB for hot data - Pub/sub for cache invalidation
4. PostgreSQL + PostGIS - Geospatial queries for location lookup - City/ZIP to coordinate mapping - Stores canonical weather data
5. Background Data Fetcher - Proactively fetches weather for popular locations - Runs on schedule (every 15 min for top locations) - Handles provider rate limits and failover
Real-world reference
Apple Weather uses a similar architecture with aggressive edge caching. Dark Sky (now Apple) was known for hyperlocal forecasts using a dense grid of cached predictions.
Data Model and Storage
The data model must support efficient geospatial lookups and time-series weather data.
What to say
I will use a grid-based storage model where each grid cell has associated weather data. This normalizes the infinite coordinate space into a finite set of cacheable cells.
-- Grid cells for spatial bucketing
CREATE TABLE grid_cells (
cell_id VARCHAR(20) PRIMARY KEY, -- e.g., "40.71_-74.01"Grid Cell ID Generation:
def coords_to_cell_id(lat: float, lon: float, precision: int = 2) -> str:
"""
Convert coordinates to grid cell ID.Cache Key Design:
Key patterns:
weather:current:{cell_id}Important detail
Store fetched_at timestamp with cached data. This allows serving slightly stale data with a warning if upstream providers fail, rather than returning errors.
API Design
The API must be simple, cacheable, and support multiple lookup methods.
GET /v1/weather/current
?lat=40.7128&lon=-74.0060 # By coordinates
?city=New+York,US # By city name{
"location": {
"lat": 40.71,Cache-friendly design
Notice the API returns the grid cell coordinates (40.71, -74.01), not the exact input (40.7128, -74.0060). This makes responses identical for nearby coordinates, improving cache hit rates.
Caching Strategy Deep Dive
Caching is the most critical aspect of this system. Let me detail the multi-layer approach.
Cache Layers
| Layer | Hit Rate | Latency | TTL | Capacity |
|---|---|---|---|---|
| CDN Edge | ~80% | 5-20ms | 10-15 min | Unlimited (distributed) |
| Redis | ~15% | 1-5ms | 15-60 min | 10GB (hot data) |
| Database | ~4% | 10-50ms | N/A | Full dataset |
| Upstream Fetch | ~1% | 200-500ms | N/A | Rate limited |
Cache Warming Strategy:
Do not wait for cache misses - proactively warm the cache for popular locations.
class CacheWarmer:
def __init__(self):
# Top 10K locations by query volumeStale-While-Revalidate Pattern:
async def get_weather(cell_id: str) -> WeatherData:
# Check cache
cached = await redis.get(f"weather:current:{cell_id}")CDN configuration
Use Cache-Control: public, max-age=900, stale-while-revalidate=1800. This tells CDN to serve cached response while fetching fresh data in background if cache is 15-30 min old.
Consistency and Invariants
System Invariants
Weather data must never be more than 2 hours stale under normal operation. During provider outages, serve stale data with clear indication rather than errors.
Consistency Model: Eventual Consistency
This system is a perfect fit for eventual consistency because: - Weather changes gradually (no instant state changes) - Slightly stale data (15 min old) is still useful - Strong consistency would destroy cache hit rates - No financial or safety-critical decisions (alerts are separate)
Data Freshness Guarantees:
| Data Type | Max Staleness | Update Trigger | Failure Behavior |
|---|---|---|---|
| Current conditions | 15 minutes | Scheduled fetch every 15 min | Serve stale with warning |
| Hourly forecast | 1 hour | Scheduled fetch every hour | Serve stale with warning |
| Daily forecast | 4 hours | Scheduled fetch every 4 hours | Serve stale with warning |
| Severe alerts | 5 minutes | Push from providers + polling | Fail open (show alert if uncertain) |
Alerts are different
Severe weather alerts have stricter requirements. It is better to show a false alert than miss a real tornado warning. Use fail-open for alerts - if uncertain, display the alert.
Multi-Provider Consistency:
When aggregating from multiple providers, they may disagree. Handle this explicitly:
async def get_aggregated_weather(cell_id: str) -> WeatherData:
# Fetch from multiple providers concurrently
results = await asyncio.gather(Failure Modes and Resilience
Proactively discuss failures
Let me walk through failure scenarios. The key principle is graceful degradation - always serve something useful, even if not perfect.
| Failure | Impact | Mitigation | Why It Works |
|---|---|---|---|
| Primary provider down | Cannot fetch fresh data | Failover to backup providers | Multiple providers with same coverage |
| All providers down | No fresh data available | Serve stale cached data with warning | 2-hour-old weather still useful |
Provider Failover:
class WeatherProviderClient:
def __init__(self):
self.providers = [Graceful Degradation Response:
{
"location": { ... },
"current": {User experience during degradation
Apps should display a subtle indicator (not block the UI) when showing stale data. Something like a yellow dot next to the time, or Updated 1 hour ago text. Users would rather see old weather than an error.
Evolution and Scaling
What to say
This design handles 100M queries/day comfortably. Let me discuss how it evolves for 10x scale and additional features like hyperlocal forecasts.
Evolution Path:
Stage 1: Basic Weather App (Current Design) - 100M queries/day - 2-decimal precision grid (~1km) - Single region deployment - 3-4 upstream providers
Stage 2: Global Scale (1B queries/day) - Multi-region deployment with geo-routing - Regional weather data caches - Cross-region cache replication for popular locations - Provider load balancing by region
Stage 3: Hyperlocal Weather - 3-decimal precision grid (~100m) - Integration with local weather stations - User-reported conditions (crowdsourcing) - ML-based nowcasting (next 2 hours)
Multi-Region Architecture
Additional Features to Discuss:
| Feature | Architecture Impact | Complexity |
|---|---|---|
| Push notifications for alerts | Add message queue + push service (FCM/APNS) | Medium |
| Historical weather data | Add TimescaleDB for time-series, separate API | Medium |
Alternative approach
If we needed sub-minute freshness (like for aviation weather), I would switch to a push model with WebSockets. Subscribe clients to location channels and push updates when data changes. This inverts the architecture from pull-based to push-based.
Cost Optimization at Scale:
- 1.Tiered caching by popularity: Hot locations in expensive fast cache, cold locations in cheaper slow storage
- 2.Provider negotiation: At scale, negotiate bulk pricing or direct data feeds
- 3.Predictive prefetching: Fetch data before users request based on time-of-day patterns
- 4.Compression: Weather data compresses well (80%+), reduce storage and transfer costs