13. How It Works

13.1 MapReduce Job Planner

The Hadoop MapReduce Job Planner is an internal feature of Cascading.

When a collection of functions, splits, and joins are all tied up together into a "pipe assembly", the FlowConnector object is used to create a new Flow instance against input and output data paths. This Flow is a single Cascading job.

Internally, the FlowConnector employs an intelligent planner to convert the pipe assembly to a graph of dependent MapReduce jobs that can be executed on a Hadoop cluster.

All this happens behind the scenes - as does the scheduling of the individual MapReduce jobs, and the cleanup of intermediate data sets that bind the jobs together.

The diagram above shows how a typical Flow is partitioned into MapReduce jobs. Every job is delimited by a temporary file that serves as the sink from the first job and the source for the next.

To create a visualization of how your Flows are partitioned, call the Flow#writeDOT() method. This writes a DOT file out to the path specified, which can be viewed in a graphics package like OmniGraffle or Graphviz.

Copyright © 2007-2012 Concurrent, Inc. All Rights Reserved.