10.2 The Cascade Topological Scheduler

10.2 The Cascade Topological Scheduler
Prev	10. How It Works

Cascading has a simple class, Cascade , that will take a collection of Cascading Flows and execute them on the target cluster in dependency order.

Consider the following example.

Flow 'first' reads input file A and outputs B.
Flow 'second' expects input B and outputs C and D.
Flow 'third' expects input C and outputs E.

A Cascade is constructed through the CascadeConnector class, by building an internal graph that makes each Flow a 'vertex', and each file an 'edge'. A topological walk on this graph will touch each vertex in order of its dependencies. When a vertex has all it's incoming edges (files) available, it will be scheduled on the cluster.

In the example above, 'first' goes first, 'second' goes second, and 'third' is last.

If two or more Flows are independent of one another, they will be scheduled concurrently.

And by default, if any outputs from a Flow are newer than the inputs, the Flow is skipped. The assumption is that the Flow was executed recently, since the output isn't stale. So there is no reason to re-execute it and use up resources or add time to the job. This is similar behaviour a compiler would exhibit if a source file wasn't updated before a recompile.

This is very handy if you have a large number of jobs that should be executed as a logical unit with varying dependencies between them. Just pass them to the CascadeConnector, and let it sort them all out.