Cascading 3.3 User Guide - Local Platform

1. Introduction

1.1. What Is Cascading?

2. Diving into the APIs

2.1. Anatomy of a Word-Count Application

3. Cascading Basic Concepts

3.1. Terminology

3.3. Pipes

3.4. Platforms

3.6. Sink Modes

3.7. Flows

4. Tuple Fields

4.1. Field Sets

5. Pipe Assemblies

5.1. Each and Every Pipes

5.2. Merge

5.3. GroupBy

5.4. CoGroup

5.5. HashJoin

6. Flows

6.1. Creating Flows from Pipe Assemblies

7. Cascades

7.1. Creating a Cascade

8. Configuring

8.1. Introduction

9. Local Platform

9.1. Building an Application

10. The Apache Hadoop Platforms

10.1. What is Apache Hadoop?

11. Apache Hadoop MapReduce Platform

11.1. Configuring Applications

11.3. Building

12. Apache Tez Platform

12.1. Configuring Applications

12.2. Building

13. Using and Developing Operations

13.1. Introduction

13.2. Functions

13.3. Filters

13.4. Aggregators

13.5. Buffers

14. Custom Taps and Schemes

14.1. Introduction

14.2. Custom Taps

15. Advanced Processing

15.1. SubAssemblies

16. Built-In Operations

16.1. Identity Function

16.9. Assertions

16.11. Buffers

17. Built-in SubAssemblies

17.1. Optimized Aggregations

18. Cascading Best Practices

18.1. Unit Testing

19. Extending Cascading

19.1. Scripting

20. Cookbook: Code Examples of Cascading Idioms

20.1. Tuples and Fields

20.5. API Usage

21. The Cascading Process Planner

21.1. FlowConnector

21.3. RuleRegistry

Local Platform

Building an Application

The Cascading local mode has no special requirements for building outside the requirement for any Java application to be executed from the command line. However, there are two top-level dependencies that should be added to the build file:

cascading-core-3.x.y.jar

This JAR contains the Cascading Core class files.

cascading-local-3.x.y.jar

This JAR contains the Cascading local-mode class files.

Executing an Application

After completing a build of the application’s "main" class, the application can be run like any other Java-based command-line application.

Source and Sink Taps

Taps

The Cascading local mode only provides a single platform specific Tap type:

FileTap

The cascading.tap.local.FileTap tap is used with the Cascading local platform to access files on the local filesystem.

Troubleshooting and Debugging

IDE debugging and testing in Cascading local mode, unlike Cascading on other platforms, is straightforward as all the processing happens in the local JVM and in local memory. Therefore, the first recommendation for debugging Cascading applications on a given platform is to first write tests that run in Cascading local mode.

Because Cascading local mode runs entirely in memory, large data sets may cause an OutOfMemoryException. Also, be sure to adjust the Java runtime memory settings.

In addition to using an IDE debugger, you can use two Cascading features to help sort out runtime issues.

One feature is the Debug filter. Best practice is to sprinkle Debug operators (see Debug Function) in the pipe assembly and rely on the planner to remove them at runtime by setting a DebugLevel.

Debug can only print to the local console via standard output or standard error. This print limitation makes it harder to use Debug on distributed platforms, as operations do not execute locally but on the cluster side. Debug provides the option to print the current field names, and a prefix can be set to help distinguish between instances of the Debug operation.

Additionally, the actual execution plan for a Flow can be written (and visualized) via the Flow.writeDOT() method. DOT files are simply text representations of graph data and can be read by tools like Graphviz and OmniGraffle.

In Cascading local mode, these execution plans are exactly as the pipe assemblies were coded, except the subassemblies are unwound and the field names across the Flow are resolved by the planner. In other words, Fields.ALL and other wild cards are converted to the actual field names or ordinals.

If the connect() method on the current FlowConnector fails, the resulting PlannerException has a writeDOT() method that shows the progress of the current planner.

For planner-related errors that appear during runtime when executing a Flow, see the chapter on the Cascading Process Planner.