13.2 The Cascade Topological Scheduler

13.2 The Cascade Topological Scheduler
Prev	13. How It Works

Cascading has a simple class, Cascade , that executes a collection of Cascading Flows on a target cluster in dependency order.

Consider the following example.

Flow 1 reads input file A and outputs B.
Flow 2 expects input B and outputs C and D.
Flow 3 expects input C and outputs E.

A Cascade is constructed through the CascadeConnector class, by building an internal graph that makes each Flow a "vertex", and each file an "edge". A topological walk on this graph will touch each vertex in order of its dependencies. When a vertex has all its incoming edges (i.e., files) available, it is scheduled on the cluster.

In the example above, Flow 1 goes first, Flow 2 goes second, and Flow 3 is last.

If two or more Flows are independent of one another, they are scheduled concurrently.

And by default, if any outputs from a Flow are newer than the inputs, the Flow is skipped. The assumption is that the Flow was executed recently, since the output isn't stale. So there is no reason to re-execute it and use up resources or add time to the job. This is similar behavior a compiler would exhibit if a source file wasn't updated before a recompile.

This is very handy if you have a large set of jobs, with varying interdependencies between them, that needs to be executed as a logical unit. Just pass them to the CascadeConnector and let it sort them all out.