MapReduce

Master the distributed computing framework that enables processing of massive datasets across clusters

30 min read
Not Started

MapReduce Job Performance Calculator

1000 GB
100 mappers
20 reducers
128 MB
50 nodes
1000 Mbps
3x replication

Job Performance

Map Phase:426.7 min
Shuffle & Sort:41 min
Reduce Phase:85.3 min
Total Time:553 min
Throughput:109 GB/hr

Resource Utilization

CPU Utilization:95%
Network Utilization:48%
Storage Required:4000 GB
Data per Mapper:10240 MB

Efficiency & Reliability

Parallel Efficiency:80%
Fault Tolerance:High
Scalability:Optimal
Total Blocks:8000

Cost Breakdown

Compute Cost:$6635.52
Storage Cost:$200
Network Cost:$6
Total Cost:$6841.52

MapReduce Programming Model

MapReduce is a programming model for processing large datasets across distributed clusters. It automatically handles parallelization, fault tolerance, and load balancing, allowing developers to focus on business logic rather than distributed systems complexity.

Automatic Parallelization

Automatically distributes computation across hundreds or thousands of machines without manual coordination.

Fault Tolerance

Built-in handling of machine failures with automatic task restart and data replication.

Scalable Storage

Integrates with distributed file systems like HDFS to handle petabyte-scale datasets.

MapReduce Execution Phases

1. Input Split & Map Phase

Input data is split into chunks and processed in parallel by mapper functions that emit key-value pairs.

Input Split: Data divided into HDFS blocks
Record Reader: Converts input to key-value pairs
Mapper Function: User-defined processing logic
Output: Intermediate key-value pairs
# Word count mapper example
def mapper(line):
words = line.split()
for word in words:
emit(word, 1)
# Input: "hello world hello"
# Output: (hello,1), (world,1), (hello,1)

2. Shuffle & Sort Phase

Intermediate data is grouped by key, sorted, and distributed to reducer nodes across the network.

Partitioning: Keys assigned to reducers
Sorting: Data sorted by key within partitions
Grouping: Values with same key grouped together
Network Transfer: Data moved to reducer nodes
Data Flow
Mapper 1
Reducer 1
Mapper 2
Reducer 2
Network shuffle by key hash

3. Reduce Phase

Reducer functions process grouped values for each key to produce final aggregated results.

Input: Key and list of associated values
Processing: User-defined aggregation logic
Output: Final results written to storage
Format: Configurable output format
# Word count reducer example
def reducer(key, values):
count = sum(values)
emit(key, count)
# Input: hello, [1, 1, 1]
# Output: (hello, 3)
# Final: hello 3, world 1

Advanced MapReduce Features

Combiners for Optimization

Combiners perform local aggregation on mapper output before the shuffle phase, reducing network traffic.

Benefits:
  • Reduces network bandwidth usage
  • Decreases shuffle phase time
  • Lower reducer input size
  • Improved overall job performance
Without Combiner
Mapper → (hello,1), (hello,1), (hello,1)
Network → 3 records transferred
With Combiner
Mapper → Combiner → (hello,3)
Network → 1 record transferred

Fault Tolerance Mechanisms

Task Failure Handling

Failed tasks are automatically restarted on different nodes up to a configured retry limit.

Data Replication

HDFS replicates data blocks across multiple nodes to prevent data loss from hardware failures.

Speculative Execution

Slow-running tasks are speculatively executed on other nodes to prevent stragglers.

Custom Partitioning & Input Formats

Custom Partitioners

Control how intermediate keys are distributed to reducers for load balancing.

def custom_partitioner(key, num_reducers):
return hash(key.country) % num_reducers
# Partition by country

Input Formats

Support for various data formats and custom input splitting strategies.

TextInputFormat: Line-based text files
SequenceFileInputFormat: Binary format
DBInputFormat: Database records
CustomInputFormat: User-defined formats

Hadoop Ecosystem & Tools

Core Hadoop Components

HDFS (Hadoop Distributed File System)

Distributed storage layer with automatic replication and fault tolerance

YARN (Yet Another Resource Negotiator)

Resource management and job scheduling for the cluster

MapReduce Framework

Programming model and runtime for distributed data processing

Hadoop Common

Common utilities and libraries used by other Hadoop modules

Ecosystem Tools & Frameworks

🐷

Apache Pig

High-level platform for analyzing large datasets

🐝

Apache Hive

SQL-like interface for querying large datasets

🗂️

HBase

NoSQL database built on top of HDFS

📊

Apache Sqoop

Data transfer between Hadoop and relational databases

🌊

Apache Flume

Service for collecting and moving large amounts of data

🔍

Apache Mahout

Machine learning library for scalable algorithms

Evolution Beyond MapReduce

Limitations of MapReduce

  • • High latency due to disk I/O between phases
  • • Inefficient for iterative algorithms
  • • No support for real-time processing
  • • Complex multi-stage job management

Modern Alternatives

  • Apache Spark: In-memory processing
  • Apache Flink: Stream processing
  • Apache Storm: Real-time computation
  • Presto/Trino: Interactive analytics

Real-World MapReduce Applications

Google

PageRank: Web page ranking algorithm
Index Building: Search index construction
Log Analysis: Petabyte-scale log processing
Machine Learning: Large-scale model training

Facebook

Data Warehousing: ETL for analytics
Ad Targeting: User behavior analysis
Graph Processing: Social graph analytics
A/B Testing: Experiment result analysis

Yahoo!

Web Crawling: Internet content indexing
Spam Detection: Email and content filtering
Recommendation: Content recommendation systems
Analytics: User behavior insights

Netflix

Viewing Analytics: Content performance analysis
Recommendation: Personalized content suggestions
Quality Control: Video encoding optimization
A/B Testing: Interface and algorithm testing

LinkedIn

People Recommendations: Social connection suggestions
Job Matching: Career opportunity matching
Content Ranking: Feed personalization
Skill Inference: Professional skill analysis

eBay

Search Optimization: Product search improvement
Fraud Detection: Transaction security analysis
Pricing Analysis: Market pricing insights
Seller Analytics: Merchant performance metrics

MapReduce Best Practices

🎯 Design Optimization

  • • Use combiners to reduce network traffic
  • • Optimize input split size (64MB-256MB)
  • • Avoid small files that create many mappers
  • • Balance mapper and reducer counts

📊 Performance Tuning

  • • Configure appropriate heap sizes
  • • Use compression for intermediate data
  • • Optimize sort buffer sizes
  • • Enable speculative execution carefully

⚡ Data Management

  • • Use appropriate file formats (Parquet, ORC)
  • • Implement proper data partitioning
  • • Consider data locality for performance
  • • Clean up intermediate and temporary data

🛡️ Reliability & Monitoring

  • • Monitor job progress and resource usage
  • • Set appropriate retry policies
  • • Implement proper error handling
  • • Use checkpoint and recovery mechanisms

🔧 Development Practices

  • • Test with small datasets first
  • • Use unit tests for mapper and reducer logic
  • • Implement proper logging and debugging
  • • Document job configurations and parameters

📈 Scalability Planning

  • • Plan cluster capacity for peak loads
  • • Consider data growth and retention policies
  • • Implement proper resource allocation
  • • Design for horizontal scaling

📝 MapReduce Knowledge Quiz

1 of 5Current: 0/5

What are the two main phases of MapReduce?