Wednesday, October 07, 2009

Really Understanding the MapReduce Model

If you're a programmer, you really should understand the MapReduce model. On some level, it's a simple model, but it's also remarkably powerful:

Programs written in [MapReduce] functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.

Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google's clusters every day.

Thankfully, ProgrammingPraxis has taken up the challenge of coding a tiny MapReduce implementation, and have made the code available here. It's a wonderful read, and helps turn an abstract concept into something concrete (with useful examples, too).

It would be a fun little challenge to write a MapReduce implementation that also made use of message passing and threads so that it could be distributed among machines.

No comments:

Post a Comment