4.5 Debugging

Debugging and testing in Cascading local mode, unlike Cascading Hadoop mode, is trivial as all the work and processing happens in the local JVM and in local memory. This dramatically simplifies the use of an IDE and Debugger.Thus the very first recommendation for debugging Cascading applications on Hadoop is to first write tests that run in Cascading local mode.

Along with the use of an IDE Debugger, Cascading provides two tools to help sort out runtime issues. First is the use of the Debug filter.

It is a best practice to sprinkle Debug operators (see Debug Function) in the pipe assembly and rely on the planner to remove them at runtime by setting a DebugLevel. Debug can only print to the local console via std out or std error, thus making it harder for use on Hadoop, as Operations do not execute locally but on the cluster side. Debug can optionally print the current field names, and a prefix can be set to help distinguish between instances of the Debug operation.

Additionally, the actual execution plan for a given Flow can be written out (and visualized) via the Flow.writeDOT() method. DOT files are simply text representation of graph data and can be read by tools like GraphViz and Omni Graffle.

In Cascading local mode, these execution plans are exactly as the pipe assemblies were coded, except the sub-assemblies are unwound and the field names across the Flow are resolved by the local mode planner. That is, Fields.ALL and other wild cards are converted the actual field names or ordinals.

In the case of Hadoop mode, using the HadoopFlowConnector, the DOT files also contain the intermediate Tap instances created to join MapReduce jobs together. Thus the branches between Tap instances are effectively MapReduce jobs. See the Flow.writeStepsDOT() method to write out all the MapReduce jobs that will be scheduled.

This information can also be misleading to what is actually happening per Map or Reduce task cluster side. For a more detailed view of the data pipeline actually executing on a given Map or Reduce task, set the "cascading.stream.dotfile.path" property on the FlowConnector. This will write, cluster side, a DOT representation of the current data pipeline path the current Map or Reduce task is handling which is a function of which file(s) the Map or Reduce task are reading and processing. And if multiple files, which files are being read to which HashJoin instances. It is recommended to use a relative path like stepPlan/.

If the connect() method on the current FlowConnector fails, the resulting PlannerException has a writeDOT() method that shows the progress of the current planner.

If Cascading is failing with an unknown internal runtime exception during Map or Reduce task startup, setting the "cascading.stream.error.dotfile" property will tell Cascading where to write a DOT representation of the pipeline it was attempting to build, if any. This file will allow the Cascading community to better identify and resolve issues.

Copyright © 2007-2012 Concurrent, Inc. All Rights Reserved.