Design Walkthrough
Problem Statement
The Question: Design a hotel booking system that handles millions of daily searches and bookings while ensuring no room is ever double-booked.
A hotel booking system must support: - Property Search: Find available hotels by location, dates, price, amenities - Availability Check: Real-time room availability for specific dates - Booking Flow: Reserve rooms with payment processing - Inventory Management: Hotels update room availability and pricing - Booking Management: Users view, modify, or cancel reservations
What to say first
Before I design, let me clarify the requirements. The key challenge here is preventing double-bookings in a distributed system while maintaining fast search performance. I will focus on the booking flow and inventory management.
Hidden requirements interviewers test: - Do you understand the double-booking problem? - Can you handle the read-heavy workload efficiently? - Do you know when to use strong vs eventual consistency? - Can you design for both search optimization and booking correctness?
Clarifying Questions
Ask these questions to shape your architecture. Each answer has significant design implications.
Question 1: Scale
What is the expected scale? How many properties, searches per second, and bookings per day?
Why this matters: Determines sharding strategy and caching needs. Typical answer: 1M+ properties, 10K searches/sec, 1M bookings/day Architecture impact: Need distributed search, aggressive caching, sharded booking database
Question 2: Consistency requirements
Is it acceptable to show slightly stale availability in search results? What about during the actual booking flow?
Why this matters: Determines consistency model. Typical answer: Stale search OK (within minutes), booking must be strongly consistent Architecture impact: Eventual consistency for reads, strong consistency for writes
Question 3: Booking hold time
Should we hold inventory while user completes payment? For how long?
Why this matters: Determines locking strategy. Typical answer: 10-15 minute hold during checkout Architecture impact: Need TTL-based locks or reservation state machine
Question 4: Pricing complexity
Is pricing static or dynamic? Does it vary by date, demand, or user segment?
Why this matters: Dynamic pricing adds complexity. Typical answer: Dynamic by date and demand, possibly personalized Architecture impact: Need pricing service, price caching strategy
Stating assumptions
I will assume: 1M properties, 10K searches/sec, 1M bookings/day, eventual consistency OK for search, strong consistency required for booking, 15-minute hold during checkout.
The Hard Part
Say this out loud
The hard part here is preventing double-bookings in a distributed system. Two users searching simultaneously might both see a room as available, but only one booking can succeed.
Why this is genuinely hard:
- 1.Race Conditions: User A and User B both see Room 101 available for Dec 25. Both click book. Both reach payment. Who gets the room?
- 2.Distributed State: Inventory data may be cached across multiple servers. How do you ensure all servers agree on availability?
- 3.Date Range Overlaps: Booking Dec 20-25 conflicts with Dec 23-28. Detecting overlaps efficiently across millions of existing bookings is non-trivial.
- 4.Hold and Release: User starts checkout but abandons. The held inventory must be released, but what if they come back?
Common mistake
Candidates often propose checking availability then booking as two separate steps without locking. This creates a race condition window where double-booking can occur.
The fundamental tension:
High Availability for Search -> Cache availability -> Stale data possible
Strong Consistency for Booking -> Lock inventory -> Potential contentionThe solution is to separate the read path (search) from the write path (booking): - Reads: Eventually consistent, heavily cached, optimized for speed - Writes: Strongly consistent, serialized per property, optimized for correctness
Double Booking Race Condition
Scale and Access Patterns
Let me estimate the scale and understand access patterns to drive design decisions.
| Dimension | Value | Impact |
|---|---|---|
| Properties | 1,000,000 | Need efficient search indexing |
| Rooms per property | 50 average | 50M total room-nights to manage |
What to say
At 10K searches/sec but only 12 bookings/sec, this is extremely read-heavy. We should optimize aggressively for search performance while ensuring booking correctness.
Storage estimation:
- 1M properties x 10KB metadata = 10 GB
- 50M rooms x 1KB = 50 GBAccess Pattern Analysis:
- Search: Geo-based, date-filtered, sorted by price/rating. Highly cacheable. - Availability: Property + date range lookup. Moderate cacheability (changes on booking). - Booking: Write operation, must be serialized per room-date. - Hot spots: Popular destinations, holiday dates, last-minute bookings.
High-Level Architecture
Let me design an architecture that separates read and write paths for optimal performance and correctness.
What to say
I will separate the read path (search, availability) from the write path (booking). Reads can be eventually consistent and heavily cached. Writes must be strongly consistent and serialized per property.
Hotel Booking System Architecture
Component Responsibilities:
- 1.Search Service: Handles property discovery queries - Uses Elasticsearch for full-text and geo search - Heavily cached, eventually consistent - Filters by location, dates, price, amenities
- 2.Availability Service: Real-time room availability - Reads from cache first, falls back to replica - Invalidated on booking events - Returns available rooms and pricing
- 3.Booking Service: Handles reservation creation - Acquires distributed lock per room-date - Validates availability in primary DB - Creates booking atomically - Publishes booking event
- 4.Payment Service: Processes payments - Called after inventory hold - On failure, releases hold - Supports refunds for cancellations
Real-world reference
Airbnb uses a similar architecture with separate search (Elasticsearch) and booking (MySQL) paths. They use Redis for availability caching and distributed locking.
Data Model and Storage
The data model must support efficient querying for both search and booking while preventing double-bookings.
What to say
The key insight is that availability is the intersection of what the hotel offers and what is not already booked. I will model this with separate tables for inventory and bookings.
-- Properties (hotels, vacation rentals)
CREATE TABLE properties (
id UUID PRIMARY KEY,Availability Calculation:
The key query is: "How many rooms of type X are available on dates Y to Z?"
-- Check availability for a room type and date range
WITH date_range AS (
SELECT generate_series(Important optimization
The availability query above is expensive for real-time use. In production, we pre-compute and cache available_count in the room_pricing table, updating it asynchronously when bookings change.
Booking Flow Deep Dive
The booking flow is the critical path that must prevent double-bookings. Let me walk through it step by step.
Booking Flow State Machine
Step-by-step booking with locking:
class BookingService:
def __init__(self, db, redis, payment_service):
self.db = dbCritical: Always read from primary
During the booking flow, always read availability from the PRIMARY database, not replicas or cache. Stale data here leads to double-bookings.
Alternative: Optimistic Locking
Instead of distributed locks, use database-level optimistic locking:
-- Add version column to room_pricing
ALTER TABLE room_pricing ADD COLUMN version INT DEFAULT 0;
Search and Availability Optimization
Search is 100x more frequent than booking. We must optimize aggressively without sacrificing booking correctness.
What to say
I will use Elasticsearch for property search and Redis for availability caching. The cache is eventually consistent but fast. Actual booking always validates against the primary database.
{
"mappings": {
"properties": {{
"query": {
"bool": {Availability Caching Strategy:
class AvailabilityCache:
def __init__(self, redis, db):
self.redis = redisCache invalidation strategy
On booking confirmation, publish event to message queue. Availability service subscribes and invalidates relevant cache entries. This decouples booking from cache updates.
Consistency and Invariants
System Invariants
The number of confirmed + pending bookings for a room type on any given date must never exceed the total inventory. Violating this means double-booking.
Consistency Model by Operation:
| Operation | Consistency | Why |
|---|---|---|
| Property search | Eventually consistent | Stale results OK, user will recheck |
| Availability display | Eventually consistent | Shows approximate, validates on book |
| Booking creation | Strongly consistent | Must prevent double-booking |
| Booking confirmation | Strongly consistent | Payment tied to inventory |
| Cancellation | Strongly consistent | Must release inventory correctly |
What to say
I accept that search results may be slightly stale - a user might click on a property that just got fully booked. But the actual booking flow validates against the primary database, so we never actually double-book.
Handling Conflicts:
When two users try to book the last room simultaneously:
Time User A User B System
---- ------ ------ ------
T1 Click Book Click Book Both see availableDatabase-Level Protection:
As a final safety net, add a database constraint:
-- Trigger to prevent overbooking
CREATE OR REPLACE FUNCTION check_overbooking()
RETURNS TRIGGER AS $$Failure Modes and Resilience
Proactively discuss failures
Let me walk through failure scenarios. This is critical for a booking system where failures can mean lost revenue or angry customers.
| Failure | Impact | Mitigation | Why It Works |
|---|---|---|---|
| Primary DB down | Cannot create bookings | Failover to standby, reject new bookings until healthy | Preserve consistency over availability |
| Redis (cache) down | Slow availability checks | Fall back to DB replicas, increased latency | Graceful degradation |
Handling Payment Failures:
async def confirm_booking_with_retry(self, booking_id: str,
payment_token: str,
max_retries: int = 3):Idempotency is critical
Always use idempotency keys for payment operations. Network failures might cause retries, and we must not charge the customer twice for the same booking.
Expired Booking Cleanup:
# Run every minute
async def cleanup_expired_bookings():
# Find pending bookings older than 15 minutesEvolution and Scaling
What to say
This design handles up to 1M bookings/day with a single primary database. Let me discuss how it evolves for 10x scale and additional features.
Evolution Path:
Stage 1: Single Region (current design) - Single PostgreSQL primary + replicas - Redis cluster for caching and locks - Elasticsearch for search - Handles ~1M bookings/day
Stage 2: Sharded by Property (10x scale) - Shard bookings table by property_id - Each shard handles bookings for subset of properties - No cross-shard transactions needed (bookings are per-property)
Stage 3: Multi-Region (global scale) - Regional deployments with local databases - Properties assigned to regions - Cross-region booking requires redirect to owning region
Sharded Architecture
Additional Features to Consider:
- 1.Dynamic Pricing: Add pricing service that adjusts based on demand, seasonality, competitor prices
- 2.Waitlist: When fully booked, allow users to join waitlist. Notify on cancellation.
- 3.Overbooking Strategy: Hotels sometimes overbook by 5-10% assuming some cancellations. Need policy for handling when everyone shows up.
- 4.Multi-room Bookings: Booking 3 rooms simultaneously. Need atomic reservation across rooms.
- 5.Channel Management: Sync inventory across Booking.com, Expedia, direct website. Real-time inventory sync critical.
Alternative approach
If we needed sub-second booking globally, I would consider event sourcing: every booking is an event, current availability is derived from event log. This enables geographic distribution with eventual convergence.
| Scale | Architecture | Tradeoff |
|---|---|---|
| <1M bookings/day | Single primary DB | Simple, strongly consistent |
| 1-10M bookings/day | Sharded by property | More complex routing, still consistent per shard |
| 10M+ bookings/day | Multi-region with property affinity | Latency for cross-region, complex sync |
| Global real-time | Event sourcing + CRDT | Eventually consistent, complex implementation |