Cascading has a simple class, Cascade
,
that executes a collection of Cascading Flows on a target cluster in
dependency order.
Consider the following example.
Flow 1 reads input file A and outputs B.
Flow 2 expects input B and outputs C and D.
Flow 3 expects input C and outputs E.
A Cascade
is constructed through the
CascadeConnector
class, by building an internal
graph that makes each Flow a "vertex", and each file an "edge". A
topological walk on this graph will touch each vertex in order of its
dependencies. When a vertex has all its incoming edges (i.e., files)
available, it is scheduled on the cluster.
In the example above, Flow 1 goes first, Flow 2 goes second, and Flow 3 is last.
If two or more Flows are independent of one another, they are scheduled concurrently.
And by default, if any outputs from a Flow are newer than the inputs, the Flow is skipped. The assumption is that the Flow was executed recently, since the output isn't stale. So there is no reason to re-execute it and use up resources or add time to the job. This is similar behavior a compiler would exhibit if a source file wasn't updated before a recompile.
This is very handy if you have a large set of jobs, with varying interdependencies between them, that needs to be executed as a logical unit. Just pass them to the CascadeConnector and let it sort them all out.
Copyright © 2007-2012 Concurrent, Inc. All Rights Reserved.