Patterns
35 items
35 items
Processing large volumes of data in batches
Batch processing executes operations on large datasets in scheduled, bounded jobs rather than real-time. It's optimized for throughput over latency - process millions of records efficiently by grouping operations, optimizing I/O, and parallelizing work. MapReduce pioneered distributed batch processing; modern systems like Apache Spark, Hadoop, and AWS Batch handle petabyte-scale workloads. This pattern is essential for ETL pipelines, data analytics, report generation, and any workload where processing can be deferred.
Batch processing optimizes for total records processed, not individual response time. Processing 1B records in 1 hour is acceptable; 1ms per record is not required.
Jobs have defined start and end. Run on schedule (nightly, hourly) or trigger. Unlike streaming, input is known and finite.
Split data into partitions, process in parallel, merge results. Scales horizontally. MapReduce paradigm: map (transform) → shuffle (group) → reduce (aggregate).