Open Source

10 items

hadoophdfsmapreduceyarnbig-datadistributed-systemsbatch-processingadvanced

Apache Hadoop: Distributed Storage and Processing

The framework that started the big data revolution - store and process petabytes across thousands of commodity machines

Java|14,500 stars|Updated January 2024|50 min read

View on GitHub

Key Takeaways

Data Locality Principle

Moving computation to data is cheaper than moving data to computation. With petabyte-scale datasets, network bandwidth is the bottleneck. Hadoop schedules tasks on nodes that already have the data blocks locally.

Horizontal Scaling with Commodity Hardware

Instead of buying expensive specialized hardware, Hadoop scales by adding more cheap commodity machines. A 1000-node cluster of commodity servers is more cost-effective and fault-tolerant than a single supercomputer.

Fault Tolerance Through Replication

HDFS replicates each block (default 3 copies) across different nodes and racks. When a node fails, the system automatically re-replicates affected blocks to maintain the replication factor.

Write-Once, Read-Many Model

HDFS is optimized for batch processing workloads where files are written once and read many times. This simplifies consistency - no need for complex distributed locking or MVCC.

MapReduce Simplifies Distributed Programming

Developers write just two functions: map (transform input records) and reduce (aggregate intermediate results). The framework handles distribution, fault tolerance, and shuffling automatically.

YARN Separates Resource Management from Processing

YARN decouples cluster resource management from the MapReduce programming model. This allows multiple processing frameworks (Spark, Flink, Tez) to share the same cluster resources.

Deep Dive

In the early 2000s, Google faced a problem: they needed to process the entire web - billions of pages - to build their search index. No single machine could store or process that much data. Their solution, described in the Google File System (2003) and MapReduce (2004) papers, inspired Doug Cutting and Mike Cafarella to create Hadoop in 2006.

The core problem Hadoop solves:

Storage: How do you store petabytes of data reliably when individual disks fail regularly?
Processing: How do you process that data in reasonable time when a single machine would take years?
Cost: How do you do this without buying million-dollar specialized hardware?

Hadoop's answer: Use thousands of cheap commodity machines, distribute data across them, replicate for fault tolerance, and move computation to where data lives.

Hadoop Ecosystem Overview

Who uses Hadoop?

Yahoo: Original production user, 40,000+ node clusters
Facebook: Stores 300+ petabytes in HDFS
LinkedIn: Powers people you may know, job recommendations
Netflix: Log processing, analytics pipelines
Twitter: Stores and processes all tweets

While newer systems like Spark have largely replaced MapReduce for processing, HDFS remains the dominant distributed storage layer for big data workloads.

Trade-offs

Aspect	Advantage	Disadvantage
Horizontal scalability	Linear scaling to thousands of nodes and petabytes of data on commodity hardware	Complex operations and tuning required; not cost-effective for small datasets
Fault tolerance through replication	Automatic recovery from node failures; data survives even rack failures	3x storage overhead; write amplification; network bandwidth for replication
Data locality optimization	Moves computation to data, minimizing network transfer for large datasets	Requires co-located storage and compute; less flexible than disaggregated architectures
Write-once semantics	Simple consistency model; no distributed locking needed; append-only is efficient

Open Source

Redis: In-Memory Data Structure Store

Apache Kafka: Distributed Event Streaming Platform

Kubernetes: Container Orchestration Platform

Nginx: High-Performance Web Server and Reverse Proxy

PostgreSQL: The World's Most Advanced Open Source Database

Apache Cassandra: Distributed Wide-Column Store

etcd: Distributed Reliable Key-Value Store

Apache ZooKeeper: Distributed Coordination Service

Envoy Proxy: Modern L7 Proxy and Communication Bus