Cascading 2.1 User Guide

Concurrent, Inc.

January 2013


Table of Contents

1. About Cascading
1.1. What is Cascading?
1.2. Usage Scenarios
Why use Cascading?
Who are the users?
1.3. What is Apache Hadoop?
2. Diving In
3. Data Processing
3.1. Terminology
3.2. Pipe Assemblies
Pipe Assembly Workflow
Common Stream Patterns
Data Processing
3.3. Pipes
Types of Pipes
The Each and Every Pipes
Merge
GroupBy
CoGroup
HashJoin
Setting Custom Pipe Properties
3.4. Platforms
3.5. Source and Sink Taps
Schemes
Taps
3.6. Sink modes
3.7. Fields Sets
3.8. Flows
Creating Flows from Pipe Assemblies
Configuring Flows
Skipping Flows
Creating Flows from a JobConf
Creating Custom Flows
3.9. Cascades
4. Executing Processes on Hadoop
4.1. Introduction
4.2. Building
4.3. Configuring
4.4. Executing
5. Using and Developing Operations
5.1. Introduction
5.2. Functions
5.3. Filter
5.4. Aggregator
5.5. Buffer
5.6. Operation and BaseOperation
6. Custom Taps and Schemes
6.1. Introduction
6.2. Custom Taps
6.3. Custom Schemes
7. Advanced Processing
7.1. SubAssemblies
7.2. Stream Assertions
7.3. Failure Traps
7.4. Checkpointing
7.5. Restarting a Checkpointed Flow
7.6. Flow and Cascade Event Handling
7.7. Template taps
7.8. Partial Aggregation instead of Combiners
8. Built-In Operations
8.1. Identity Function
8.2. Debug Function
8.3. Sample and Limit Functions
8.4. Insert Function
8.5. Text Functions
8.6. Regular Expression Operations
8.7. Java Expression Operations
8.8. XML Operations
8.9. Assertions
8.10. Logical Filter Operators
9. Built-in Assemblies
9.1. AggregateBy
AverageBy
CountBy
SumBy
FirstBy
9.2. Coerce
9.3. Discard
9.4. Rename
9.5. Retain
9.6. Unique
10. Best Practices
10.1. Unit Testing
10.2. Flow Granularity
10.3. SubAssemblies, not Factories
10.4. Logical Responsibilities for SubAssemblies
10.5. Java Operators in Field Names
10.6. Debugging Planner Failures
10.7. Optimizing Joins
10.8. Debugging Streams
10.9. Handling Good and Bad Data
10.10. Maintaining State in Operations
10.11. Custom Types
10.12. Fields Constants
10.13. Checking the Source Code
11. Extending Cascading
11.1. Scripting
11.2. Custom Types and Serialization
11.3. Custom Comparators and Hashing
12. Cookbook
12.1. Tuples and Fields
12.2. Stream Shaping
12.3. Common Operations
12.4. Stream Ordering
12.5. API Usage
13. How It Works
13.1. MapReduce Job Planner
13.2. The Cascade Topological Scheduler

Copyright © 2007-2012 Concurrent, Inc. All Rights Reserved.