Cascading 2.5 User Guide

Concurrent, Inc.

November 2013


Table of Contents

1. About Cascading
1.1. What is Cascading?
1.2. Usage Scenarios
Why use Cascading?
Who are the users?
1.3. What is Apache Hadoop?
1.4. Hadoop 1 vs Hadoop 2
2. Diving In
3. Data Processing
3.1. Terminology
3.2. Pipe Assemblies
Pipe Assembly Workflow
Common Stream Patterns
Data Processing
3.3. Pipes
Types of Pipes
The Each and Every Pipes
Merge
GroupBy
CoGroup
HashJoin
Setting Custom Pipe Properties
3.4. Platforms
3.5. Source and Sink Taps
Schemes
Taps
3.6. Sink modes
3.7. Fields Sets
3.8. Flows
Creating Flows from Pipe Assemblies
Configuring Flows
Skipping Flows
Creating Flows from a JobConf
Creating Custom Flows
3.9. Cascades
4. Executing Processes on Hadoop
4.1. Introduction
4.2. Building
4.3. Configuring
4.4. Executing
4.5. Debugging
5. Using and Developing Operations
5.1. Introduction
5.2. Functions
5.3. Filter
5.4. Aggregator
5.5. Buffer
5.6. Operation and BaseOperation
6. Custom Taps and Schemes
6.1. Introduction
6.2. Custom Taps
6.3. Custom Schemes
7. Field Typing and Type Coercion
7.1. Field Typing
7.2. Type Coercion
8. Advanced Processing
8.1. SubAssemblies
8.2. Stream Assertions
8.3. Failure Traps
8.4. Checkpointing
8.5. Restarting a Checkpointed Flow
8.6. Flow and Cascade Event Handling
8.7. PartitionTaps
8.8. Partial Aggregation instead of Combiners
9. Built-In Operations
9.1. Identity Function
9.2. Debug Function
9.3. Sample and Limit Functions
9.4. Insert Function
9.5. Text Functions
9.6. Regular Expression Operations
9.7. Java Expression Operations
9.8. XML Operations
9.9. Assertions
9.10. Logical Filter Operators
9.11. Buffers
10. Built-in Assemblies
10.1. AggregateBy
AverageBy
CountBy
SumBy
FirstBy
10.2. Coerce
10.3. Discard
10.4. Rename
10.5. Retain
10.6. Unique
11. Best Practices
11.1. Unit Testing
11.2. Flow Granularity
11.3. SubAssemblies, not Factories
11.4. Logical Responsibilities for SubAssemblies
11.5. Java Operators in Field Names
11.6. Debugging Planner Failures
11.7. Optimizing Joins
11.8. Debugging Streams
11.9. Handling Good and Bad Data
11.10. Maintaining State in Operations
11.11. Custom Types
11.12. Fields Constants
11.13. Checking the Source Code
12. Extending Cascading
12.1. Scripting
12.2. Custom Types and Serialization
12.3. Custom Comparators and Hashing
13. Cookbook
13.1. Tuples and Fields
13.2. Stream Shaping
13.3. Common Operations
13.4. Stream Ordering
13.5. API Usage
14. How It Works
14.1. MapReduce Job Planner
14.2. The Cascade Topological Scheduler

Copyright © 2007-2012 Concurrent, Inc. All Rights Reserved.