Cascading 3.1 User Guide - The Cascading Process Planner

1. Introduction: 1.1. What Is Cascading?

1.2. Another Perspective

1.3. Why Use Cascading?

1.4. The Cascading Philosophy

1.5. Who Are the Users?
2. Diving into the APIs: 2.1. Anatomy of a Word-Count Application

2.2. Fluid: An Alternative Fluent API
3. Cascading Basic Concepts: 3.1. Terminology

3.2. Pipe Assemblies

3.3. Pipes

3.4. Platforms

3.5. Sourcing and Sinking Data

3.6. Sink Modes

3.7. Flows
4. Tuple Fields: 4.1. Field Sets

4.2. Field Algebra

4.3. Field Typing

4.4. Type Coercion
5. Pipe Assemblies: 5.1. Each and Every Pipes

5.2. Merge

5.3. GroupBy

5.4. CoGroup

5.5. HashJoin
6. Flows: 6.1. Creating Flows from Pipe Assemblies

6.2. Configuring Flows

6.3. Skipping Flows

6.4. Creating Custom Flows

6.5. Process Levels in the Flow Hierarchy

6.6. Runtime Metrics
7. Cascades: 7.1. Creating a Cascade

7.2. The Cascade Topological Scheduler
8. Configuring: 8.1. Introduction

8.2. Creating Properties

8.3. Passing Properties
9. Local Platform: 9.1. Building an Application

9.2. Executing an Application

9.3. Source and Sink Taps

9.4. Troubleshooting and Debugging
10. The Apache Hadoop Platforms: 10.1. What is Apache Hadoop?

10.2. Hadoop 1 MapReduce vs. Hadoop 2 MapReduce

10.3. Hadoop 2 MapReduce vs Hadoop 2 Tez

10.4. Configuring Applications

10.5. Building an Application

10.6. Executing an Application

10.7. Troubleshooting and Debugging

10.8. Source and Sink Taps

10.9. Custom Taps and Schemes

10.10. Partial Aggregation instead of Combiners

10.11. Custom Types and Serialization
11. Apache Hadoop MapReduce Platform: 11.1. Configuring Applications

11.2. Creating Flows from a JobConf

11.3. Building
12. Apache Tez Platform: 12.1. Configuring Applications

12.2. Building
13. Using and Developing Operations: 13.1. Introduction

13.2. Functions

13.3. Filters

13.4. Aggregators

13.5. Buffers

13.6. Operation and BaseOperation
14. Custom Taps and Schemes: 14.1. Introduction

14.2. Custom Taps

14.3. Custom Schemes

14.4. Taps with File and Nonfile Resources

14.5. Tap Life-Cycle Methods
15. Advanced Processing: 15.1. SubAssemblies

15.2. Stream Assertions

15.3. Failure Traps

15.4. Checkpointing

15.5. Restarting a Checkpointed Flow

15.6. Flow and Cascade Event Handling

15.7. PartitionTaps

15.8. Partial Aggregation instead of Combiners
16. Built-In Operations: 16.1. Identity Function

16.2. Debug Function

16.3. Sample and Limit Functions

16.4. Insert Function

16.5. Text Functions

16.6. Regular Expression Operations

16.7. Java Expression Operations

16.8. XML Operations

16.9. Assertions

16.10. Logical Filter Operators

16.11. Buffers
17. Built-in SubAssemblies: 17.1. Optimized Aggregations

17.2. Stream Shaping
18. Cascading Best Practices: 18.1. Unit Testing

18.2. Flow Granularity

18.3. SubAssemblies, not Factories

18.4. Logical Responsibilities for SubAssemblies

18.5. Java Operators in Field Names

18.6. Debugging Planner Failures

18.7. Optimizing Joins

18.8. Debugging Streams

18.9. Handling Good and Bad Data

18.10. Maintaining State in Operations

18.11. Fields Constants

18.12. Checking the Source Code
19. Extending Cascading: 19.1. Scripting

19.2. Custom Types and Serialization

19.3. Custom Comparators and Hashing
20. Cookbook: Code Examples of Cascading Idioms: 20.1. Tuples and Fields

20.2. Stream Shaping

20.3. Common Operations

20.4. Stream Ordering

20.5. API Usage
21. The Cascading Process Planner: 21.1. FlowConnector

21.2. RuleRegistrySet

21.3. RuleRegistry

21.4. Debugging RuleRegistrySets

The Cascading Process Planner

For an introduction to the thinking and design behind the query planner, see the The Cascading 3.0 Query Planner blog posting.

FlowConnector

All FlowConnector subclasses provide a custom RuleRegistrySet. The component layers of RuleRegistrySet can be summarized as:

RuleRegistrySet consists of one or more RuleRegistry instances
RuleRegistry instances consist of Rule instances

A RuleRegistrySet is mutable. Consequently, the default registry can be retrieved via FlowConnector.getRuleRegistrySet() and updated.

The various FlowConnector subclasses are covered in Platforms.

RuleRegistrySet

The RuleRegistrySet manages one or more RuleRegistry instances.

During planning, all registered RuleRegistry instances are used independently to create candidate execution plans. If more than one RuleRegistry is available, the instances are executed in parallel. However, a total plan timeout can be provided in the case a planner executes for too long.

Either the first or "best" plan is utilized when the planner completes, depending on the value of RuleRegistrySet.setSelect(). The value can be one of the following values:

Select.FIRST: The first RuleRegistry instance to complete is used. All remaining planners are cancelled.
Select.COMPARED: The best RuleRegistry instance is used, where "best" is a function of the registred planComparator. The planComparator is set by calling RuleRegistrySet.setPlanComparator().

If not set, the default behavior is to choose the plan with the fewest FlowStep instances. If the number of FlowStep instances among plans is equal, the plan with the fewest FlowNode instances is chosen. Otherwise, the order in which the RuleRegistry instances were registered prevails.

Relying on the fewest FlowStep and FlowNode instances in the resulting plan is a simplified version of cost. If competing plans can be measured differently, supply a custom Comparator to perform the comparison.

Support for multiple registries is important. As Cascading advances, primitives at the Pipe level are added that offer new optimizations or capabilities.

These additions have consequences when related to the set of other pipe primitives being utilized. Each new primitive adds to the time complexity of finding a suitable plan. Frequently two or more primitives working together add yet another topological complexity (some combinations could cause runtime or planner failures if not accounted for).

The end result is that a one-size-fits-all rule set, in order to be safe, makes conservative decisions so that plan time and the resulting execution are reasonable instead of optimal.

By allowing multiple rule registries to be declared in the RuleRegistrySet, any given registry can decide if it is applicable, and if so, will apply itself. If not applicable, it will leave the competition.

For example, the Cascading Apache Tez planner provides one rule registry that supports HashJoin pipes, and one that does not.

Under very complex scenarios, when HashJoins are in play, the planner can execute must faster if it makes some compromises to limit the search space the planner must navigate to find a functional plan. When there are no HashJoins, it isn’t worth the time to find and apply the compromises.

RuleRegistry

The RuleRegistry consists of a set of rules. Rules in short navigate and mutate what is known as an element graph, which is simply a directed acyclic graph (DAG) of Pipe and Tap instances connected by edges (Scope instances).

The input of a rule is an element graph, and the output is a PlannerException, a modified element graph, or a subgraph of the given element graph.

Types of Rules

The following are the main types of rules:

RuleAssert: This rule simply provides an assertion mechanism about the structure of the pipe assembly. If a Pipe or Tap is misplaced or unsupported, it can in effect cause a syntax error. The error message includes the name and location of the problematic element.
RuleTransformer: This rule provides a means to modify or mutate an element graph by either adding or removing elements.

In addition to mutating an element graph, a transformer can annotate the graph with metadata that can inject logic to downstream rules or Cascading itself during runtime. Identifying the accumulated and streamed sides of a HashJoin greatly simplifies the construction of the actual execution logic.
RulePartitioner: This rule provides a means to break a large graph into smaller graphs. This is where a pipe assembly along with attached sources and sinks are broken down into FlowStep, FlowNode, and possibly FlowPipeline instances.

A partitioner can partition the current top-level element graph or repartition an element subgraph into new subgraphs.

See the section on Flow Process Hierarchy for details on the roles these classes play.

Phases for Rules

Rules are applied at different phases. The following is a list of PlanPhase types:

PreBalanceAssembly: The PreBalanceAssembly phase is where most assertions are applied.
BalanceAssembly: The BalanceAssembly phase is where, for example with MapReduce, intermediate Tap instances are injected into the element graph.
PostBalanceAssembly: The PostBalanceAssembly phase provides a means to perform any cleanup after the balancing phase.
PreResolveAssembly: The PreResolveAssembly phase is where any no op Pipe instances or unnecessary Debug or Assertion filters should be removed (depending on the configured DebugLevel or AssertionLevel, respectively.)
ResolveAssembly: The ResolveAssembly is where Cascading performs all field resolution by inspecting the source and sink Taps and any Operations that produce or require fields. The general purpose of this phase is to ensure that all field-level dependencies are satisfied.

No rules are applied in this phase.
PostResolveAssembly: The PostResolveAssembly phase is where any logical optimizations could be applied based on now fully resolved field names in the element graph prior to any subgraph partitioning.
PartitionSteps: The PartitionSteps phase is where the element graphs that represent the work that would be contained in individual FlowStep instances are found.

All MapReduce jobs are separated into individual element subgraphs, each bounded by source and sink Taps, or intermediate, temporary source and sink Taps.
PostSteps: The PostSteps phase provides a means to clean up post-step partitioning.
PartitionNodes: The PartitionNodes phase is where the element graphs that represent the work that would be contained in individual FlowNode instances are found.
PostNodes: The PostNodes phase provides a means to clean up post-node partitioning. Frequently malformed remainder subgraphs can result in rules that were already applied. This phase is the best opportunity to remove these graphs.
PartitionPipelines: The PartitionPipelines phase is where the element graphs that represent the work that would be contained in individual FlowPipeline instances are found.

In the case of MapReduce, a mapper function can have multiple discrete input paths that correspond to different computation paths. Consider joining two files: one pipeline to process the left side and another to process the right side. The side to process within the FlowNode is determined at runtime when the child mapper JVM is instantiated and handed an input split, a portion of one of the input files, which in turn corresponds to one of the pipelines.
PostPipelines: The PostPipelines phase provides a means to clean up post-pipeline partitioning. Frequently malformed remainder subgraphs can result in rules that were applied. This phase is the best opportunity to remove these graphs.

Debugging RuleRegistrySets

You can debug planner rules by enabling the various trace mechanisms outlined in the section on Debugging Planner Failures.