Open Source
10 items
10 items
The framework that started the big data revolution - store and process petabytes across thousands of commodity machines
Hadoop solves the problem of storing and processing datasets too large for a single machine. It consists of three core components: HDFS (a distributed file system that splits files into blocks and replicates them across nodes), MapReduce (a programming model that processes data in parallel by moving computation to where data lives), and YARN (a resource manager that schedules jobs across the cluster). The key insight is that commodity hardware fails frequently, so the system must handle failures automatically through replication and task re-execution.
Moving computation to data is cheaper than moving data to computation. With petabyte-scale datasets, network bandwidth is the bottleneck. Hadoop schedules tasks on nodes that already have the data blocks locally.
Instead of buying expensive specialized hardware, Hadoop scales by adding more cheap commodity machines. A 1000-node cluster of commodity servers is more cost-effective and fault-tolerant than a single supercomputer.
HDFS replicates each block (default 3 copies) across different nodes and racks. When a node fails, the system automatically re-replicates affected blocks to maintain the replication factor.
HDFS is optimized for batch processing workloads where files are written once and read many times. This simplifies consistency - no need for complex distributed locking or MVCC.
Developers write just two functions: map (transform input records) and reduce (aggregate intermediate results). The framework handles distribution, fault tolerance, and shuffling automatically.
YARN decouples cluster resource management from the MapReduce programming model. This allows multiple processing frameworks (Spark, Flink, Tez) to share the same cluster resources.
In the early 2000s, Google faced a problem: they needed to process the entire web - billions of pages - to build their search index. No single machine could store or process that much data. Their solution, described in the Google File System (2003) and MapReduce (2004) papers, inspired Doug Cutting and Mike Cafarella to create Hadoop in 2006.
The core problem Hadoop solves:
Hadoop's answer: Use thousands of cheap commodity machines, distribute data across them, replicate for fault tolerance, and move computation to where data lives.
Who uses Hadoop?
While newer systems like Spark have largely replaced MapReduce for processing, HDFS remains the dominant distributed storage layer for big data workloads.
| Aspect | Advantage | Disadvantage |
|---|---|---|
| Horizontal scalability | Linear scaling to thousands of nodes and petabytes of data on commodity hardware | Complex operations and tuning required; not cost-effective for small datasets |
| Fault tolerance through replication | Automatic recovery from node failures; data survives even rack failures | 3x storage overhead; write amplification; network bandwidth for replication |
| Data locality optimization | Moves computation to data, minimizing network transfer for large datasets | Requires co-located storage and compute; less flexible than disaggregated architectures |
| Write-once semantics | Simple consistency model; no distributed locking needed; append-only is efficient |
| No in-place updates; expensive for small modifications; not suitable for OLTP |
| MapReduce programming model | Simple abstraction; fault tolerant; easy to parallelize many algorithms | High latency (minutes); poor for iterative algorithms; limited to batch processing |
| Single NameNode metadata | Simple architecture; fast metadata operations; strong consistency | Memory limited by single machine; HA adds complexity; federation is awkward |
| On-premises deployment | Full control; data sovereignty; predictable costs for large steady workloads | High upfront cost; operational burden; harder to scale elastically |