SystemExpertsSystemExperts
Pricing

Open Source

10 items

Redis: In-Memory Data Structure Store

45mintermediate

Apache Kafka: Distributed Event Streaming Platform

40mintermediate

Kubernetes: Container Orchestration Platform

50mintermediate

Nginx: High-Performance Web Server and Reverse Proxy

40mintermediate

PostgreSQL: The World's Most Advanced Open Source Database

50madvanced

Apache Cassandra: Distributed Wide-Column Store

40madvanced

etcd: Distributed Reliable Key-Value Store

35madvanced

Apache ZooKeeper: Distributed Coordination Service

40madvanced

Envoy Proxy: Modern L7 Proxy and Communication Bus

40madvanced

Apache Hadoop: Distributed Storage and Processing

50madvanced
hadoophdfsmapreduceyarnbig-datadistributed-systemsbatch-processingadvanced

Apache Hadoop: Distributed Storage and Processing

The framework that started the big data revolution - store and process petabytes across thousands of commodity machines

Java|14,500 stars|Updated January 2024|50 min read
View on GitHub

Summary

Hadoop solves the problem of storing and processing datasets too large for a single machine. It consists of three core components: HDFS (a distributed file system that splits files into blocks and replicates them across nodes), MapReduce (a programming model that processes data in parallel by moving computation to where data lives), and YARN (a resource manager that schedules jobs across the cluster). The key insight is that commodity hardware fails frequently, so the system must handle failures automatically through replication and task re-execution.

Key Takeaways

Data Locality Principle

Moving computation to data is cheaper than moving data to computation. With petabyte-scale datasets, network bandwidth is the bottleneck. Hadoop schedules tasks on nodes that already have the data blocks locally.

Horizontal Scaling with Commodity Hardware

Instead of buying expensive specialized hardware, Hadoop scales by adding more cheap commodity machines. A 1000-node cluster of commodity servers is more cost-effective and fault-tolerant than a single supercomputer.

Fault Tolerance Through Replication

HDFS replicates each block (default 3 copies) across different nodes and racks. When a node fails, the system automatically re-replicates affected blocks to maintain the replication factor.

Write-Once, Read-Many Model

HDFS is optimized for batch processing workloads where files are written once and read many times. This simplifies consistency - no need for complex distributed locking or MVCC.

MapReduce Simplifies Distributed Programming

Developers write just two functions: map (transform input records) and reduce (aggregate intermediate results). The framework handles distribution, fault tolerance, and shuffling automatically.

YARN Separates Resource Management from Processing

YARN decouples cluster resource management from the MapReduce programming model. This allows multiple processing frameworks (Spark, Flink, Tez) to share the same cluster resources.

Deep Dive

In the early 2000s, Google faced a problem: they needed to process the entire web - billions of pages - to build their search index. No single machine could store or process that much data. Their solution, described in the Google File System (2003) and MapReduce (2004) papers, inspired Doug Cutting and Mike Cafarella to create Hadoop in 2006.

The core problem Hadoop solves:

  1. Storage: How do you store petabytes of data reliably when individual disks fail regularly?
  2. Processing: How do you process that data in reasonable time when a single machine would take years?
  3. Cost: How do you do this without buying million-dollar specialized hardware?

Hadoop's answer: Use thousands of cheap commodity machines, distribute data across them, replicate for fault tolerance, and move computation to where data lives.

Hadoop Ecosystem Overview

Who uses Hadoop?

  • Yahoo: Original production user, 40,000+ node clusters
  • Facebook: Stores 300+ petabytes in HDFS
  • LinkedIn: Powers people you may know, job recommendations
  • Netflix: Log processing, analytics pipelines
  • Twitter: Stores and processes all tweets

While newer systems like Spark have largely replaced MapReduce for processing, HDFS remains the dominant distributed storage layer for big data workloads.

Trade-offs

AspectAdvantageDisadvantage
Horizontal scalabilityLinear scaling to thousands of nodes and petabytes of data on commodity hardwareComplex operations and tuning required; not cost-effective for small datasets
Fault tolerance through replicationAutomatic recovery from node failures; data survives even rack failures3x storage overhead; write amplification; network bandwidth for replication
Data locality optimizationMoves computation to data, minimizing network transfer for large datasetsRequires co-located storage and compute; less flexible than disaggregated architectures
Write-once semanticsSimple consistency model; no distributed locking needed; append-only is efficient
No in-place updates; expensive for small modifications; not suitable for OLTP
MapReduce programming modelSimple abstraction; fault tolerant; easy to parallelize many algorithmsHigh latency (minutes); poor for iterative algorithms; limited to batch processing
Single NameNode metadataSimple architecture; fast metadata operations; strong consistencyMemory limited by single machine; HA adds complexity; federation is awkward
On-premises deploymentFull control; data sovereignty; predictable costs for large steady workloadsHigh upfront cost; operational burden; harder to scale elastically