MapReduce is a framework for writing applications that process large amounts of structured and unstructured data in parallel across a cluster of thousands of machines, in a reliable and fault-tolerant manner. MapReduce, Hadoop Distributed File System (HDFS™) and YARN form the core of Apache™ Hadoop®.
What MapReduce Does
MapReduce is useful for many applications, such as web access log stats, machine learning and statistical machine translation.
MapReduce’s key benefits are:
- Simplicity: Developers can write applications in their language of choice, such as Java, C++ or Python. MapReduce jobs are easy to run.
- Scalability: MapReduce can process petabytes of data (stored in HDFS) on one cluster.
- Speed: Parallel processing means that Hadoop can take problems that used to take days to solve and solve them in hours or minutes.
- Built-in recovery: MapReduce takes care of failures. If a machine with one copy of the data is unavailable, another machine has a copy of the same key/value pair, which can be used to solve the same sub-task. The JobTracker keeps track of it all.
- Minimal data motion: MapReduce moves compute processes to the data on HDFS and not the other way around. Processing tasks can occur on the physical node where the data resides. This significantly reduces the network I/O patterns and keeps most of the I/O on the local disk or within the same rack.
- Freedom to focus on the business logic: Because MapReduce takes care of the mundane details around deployment, resource management, monitoring and scheduling, the user is free to focus on answering questions about her business.
How MapReduce Works
A MapReduce job splits a large data set into independent chunks and organizes them into key, value pairs for parallel processing. This parallel processing improves the speed and reliability of the cluster, returning solutions more quickly and with more reliability.
The Map function divides the input into ranges by the InputFormat and creates a map task for each range in the input. The JobTracker distributes those tasks to the worker nodes. The output of each map task is partitioned into a group of key-value pairs for each reduce.
The Reduce function then collects the various results and combines them to answer the larger problem the master node was trying to solve. Each reduce pulls the relevant partition from the machines where the maps executed, then writes its output back into HDFS. Thus, the reduce is able to collect the data from all of the maps for the keys it is responsible for and combine them to solve the problem.