Debugging and testing in Cascading local mode, unlike Cascading Hadoop mode, is trivial as all the work and processing happens in the local JVM and in local memory. This dramatically simplifies the use of an IDE and Debugger.Thus the very first recommendation for debugging Cascading applications on Hadoop is to first write tests that run in Cascading local mode.
Along with the use of an IDE Debugger, Cascading provides two
tools to help sort out runtime issues. First is the use of the
Debug
filter.
It is a best practice to sprinkle Debug
operators (see Debug Function) in the pipe assembly
and rely on the planner to remove them at runtime by setting a
DebugLevel
. Debug
can only
print to the local console via std out or std error, thus making it
harder for use on Hadoop, as Operations do not execute locally but on
the cluster side. Debug
can optionally print the
current field names, and a prefix can be set to help distinguish between
instances of the Debug
operation.
Additionally, the actual execution plan for a given Flow can be written out (and visualized) via the Flow.writeDOT() method. DOT files are simply text representation of graph data and can be read by tools like GraphViz and Omni Graffle.
In Cascading local mode, these execution plans are exactly as the
pipe assemblies were coded, except the sub-assemblies are unwound and
the field names across the Flow are resolved by the local mode planner.
That is, Fields.ALL
and other wild cards are converted the
actual field names or ordinals.
In the case of Hadoop mode, using the
HadoopFlowConnector
, the DOT files also contain
the intermediate Tap
instances created to join
MapReduce jobs together. Thus the branches between Tap instances are
effectively MapReduce jobs. See the Flow.writeStepsDOT()
method to write out all the MapReduce jobs that will be
scheduled.
This information can also be misleading to what is actually
happening per Map or Reduce task cluster side. For a more detailed view
of the data pipeline actually executing on a given Map or Reduce task,
set the "cascading.stream.dotfile.path" property on the
FlowConnector
. This will write, cluster side, a
DOT representation of the current data pipeline path the current Map or
Reduce task is handling which is a function of which file(s) the Map or
Reduce task are reading and processing. And if multiple files, which
files are being read to which HashJoin
instances.
It is recommended to use a relative path like
stepPlan/
.
If the connect()
method on the current
FlowConnector
fails, the resulting
PlannerException
has a
writeDOT()
method that shows the progress of
the current planner.
If Cascading is failing with an unknown internal runtime exception during Map or Reduce task startup, setting the "cascading.stream.error.dotfile" property will tell Cascading where to write a DOT representation of the pipeline it was attempting to build, if any. This file will allow the Cascading community to better identify and resolve issues.
Copyright © 2007-2012 Concurrent, Inc. All Rights Reserved.