Cascading has a simple class, Cascade
,
that will take a collection of Cascading Flows and execute them on the
target cluster in dependency order.
Consider the following example.
Flow 'first' reads input file A and outputs B.
Flow 'second' expects input B and outputs C and D.
Flow 'third' expects input C and outputs E.
A Cascade
is constructed through the
CascadeConnector
class, by building an internal
graph that makes each Flow a 'vertex', and each file an 'edge'. A
topological walk on this graph will touch each vertex in order of its
dependencies. When a vertex has all it's incoming edges (files)
available, it will be scheduled on the cluster.
In the example above, 'first' goes first, 'second' goes second, and 'third' is last.
If two or more Flows are independent of one another, they will be scheduled concurrently.
And by default, if any outputs from a Flow are newer than the inputs, the Flow is skipped. The assumption is that the Flow was executed recently, since the output isn't stale. So there is no reason to re-execute it and use up resources or add time to the job. This is similar behaviour a compiler would exhibit if a source file wasn't updated before a recompile.
This is very handy if you have a large number of jobs that should be executed as a logical unit with varying dependencies between them. Just pass them to the CascadeConnector, and let it sort them all out.
Copyright © 2007-2008 Concurrent, Inc. All Rights Reserved.