1. So! What is MapReduce?
MapReduce is a two-step mechanism for manipulating distributed data with large scale. In particular, the ‘map’ step visits the data according to programmer-defined rules, then the ‘reduce’ step collects the intermediate results from ‘map’ and process them to produce the final result.
2. So! Why do we need MapReduce?
Because the data Google handles is of large scale and distributed across machines. Hence the conventional way of loading all the data necessary into the memory before the processing can start simply does not work.
3. So! Give me an example of how MapReduce work.
Say you are counting the number of a word in millions of web pages. The ‘map’ would go through these pages and fire a signal whenever it finds such words. Then the ‘reduce’ would collect lists of such signals and count them as a numeric value.