Debugging and testing in Cascading local mode, unlike Cascading Hadoop mode, is trivial as all the work and processing happens in the local JVM and in local memory. This dramatically simplifies the use of an IDE and Debugger.Thus the very first recommendation for debugging Cascading applications on Hadoop is to first write tests that run in Cascading local mode.
Along with the use of an IDE Debugger, Cascading provides two
tools to help sort out runtime issues. First is the use of the
It is a best practice to sprinkle
operators (see Debug Function) in the pipe assembly
and rely on the planner to remove them at runtime by setting a
Debug can only
print to the local console via std out or std error, thus making it
harder for use on Hadoop, as Operations do not execute locally but on
the cluster side.
Debug can optionally print the
current field names, and a prefix can be set to help distinguish between
instances of the
Additionally, the actual execution plan for a given Flow can be written out (and visualized) via the Flow.writeDOT() method. DOT files are simply text representation of graph data and can be read by tools like GraphViz and Omni Graffle.
In Cascading local mode, these execution plans are exactly as the
pipe assemblies were coded, except the sub-assemblies are unwound and
the field names across the Flow are resolved by the local mode planner.
Fields.ALL and other wild cards are converted the
actual field names or ordinals.
In the case of Hadoop mode, using the
HadoopFlowConnector, the DOT files also contain
Tap instances created to join
MapReduce jobs together. Thus the branches between Tap instances are
effectively MapReduce jobs. See the
method to write out all the MapReduce jobs that will be
This information can also be misleading to what is actually
happening per Map or Reduce task cluster side. For a more detailed view
of the data pipeline actually executing on a given Map or Reduce task,
set the "cascading.stream.dotfile.path" property on the
FlowConnector. This will write, cluster side, a
DOT representation of the current data pipeline path the current Map or
Reduce task is handling which is a function of which file(s) the Map or
Reduce task are reading and processing. And if multiple files, which
files are being read to which
It is recommended to use a relative path like
connect() method on the current
FlowConnector fails, the resulting
PlannerException has a
writeDOT() method that shows the progress of
the current planner.
If Cascading is failing with an unknown internal runtime exception during Map or Reduce task startup, setting the "cascading.stream.error.dotfile" property will tell Cascading where to write a DOT representation of the pipeline it was attempting to build, if any. This file will allow the Cascading community to better identify and resolve issues.
Copyright © 2007-2012 Concurrent, Inc. All Rights Reserved.