Ben's Journal: Really Understanding the MapReduce Model

Wednesday, October 07, 2009

Really Understanding the MapReduce Model

If you're a programmer, you really should understand the MapReduce model. On some level, it's a simple model, but it's also remarkably powerful:

Programs written in [MapReduce] functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.

Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google's clusters every day.

Thankfully, ProgrammingPraxis has taken up the challenge of coding a tiny MapReduce implementation, and have made the code available here. It's a wonderful read, and helps turn an abstract concept into something concrete (with useful examples, too).

It would be a fun little challenge to write a MapReduce implementation that also made use of message passing and threads so that it could be distributed among machines.

Ben's Journal

Wednesday, October 07, 2009

Really Understanding the MapReduce Model

No comments:

Post a Comment