Cascading 3.0 User Guide - Cascades

Cascades

cascade

A Cascade allows multiple Flow instances to be executed as a single logical unit. If there are dependencies between the Flows, they are executed in the correct order.

Further, Cascade instances act like a compiler build file: a Cascade only executes Flows that have stale sinks (i.e., output data that is older than the input data). For more about flows and sinks, see Skipping Flows.

Creating a Cascade

Example 1. Creating a new Cascade
CascadeConnector connector = new CascadeConnector();
Cascade cascade = connector.connect( flowFirst, flowSecond, flowThird );

When passing Flows to the CascadeConnector, order is not important. The CascadeConnector automatically identifies the dependencies between the given Flows and creates a scheduler that starts each Flow as its data sources become available. If two or more Flow instances have no interdependencies, they are submitted together so that they can execute in parallel.

If an instance of cascading.flow.FlowSkipStrategy is given to a Cascade instance (via the Cascade.setFlowSkipStrategy() method), it is checked for every Flow instance managed by that Cascade, and all skip strategies on those Flow instances are ignored.

The Cascade Topological Scheduler

Cascading has a simple class, Cascade, that executes a collection of Cascading Flows on a target cluster in dependency order.

The CascadeConnector class constructs a Cascade by building a virtual, internal graph that renders each Flow as a "vertex" and renders each file as an "edge." As a Cascade executes, the processes trace the topology of the graph by plotting each vertex in order of dependencies. When all incoming edges (i.e., files) of a vertex are available, it is scheduled on the cluster.

Consider the following example.

  • Flow 1 reads input file A and outputs B.

  • Flow 2 expects input B and outputs C and D.

  • Flow 3 expects input C and outputs E.

In the example above, Flow 1 goes first, Flow 2 goes second, and Flow 3 is last.

If two or more Flows are independent of one another, they are scheduled concurrently.

By default, if any outputs from a Flow are newer than the inputs, the Flow is skipped. The assumption is that the Flow was executed recently, since the output is not stale. So there is no reason to re-execute it and use up resources or add time to the application. A compiler behaves analogously when a source file is not updated before a recompile.

The Cascade topological scheduler is particularly helpful when you have a large set of jobs, with varying interdependencies, that must be executed as a logical unit. You can just pass the jobs to the CascadeConnector, which can determine the sequence of flows to uphold the dependency order.