Cascading - User Guide

Concurrent, Inc

V 1.2

December, 2010


Table of Contents

1. Cascading
1.1. What is Cascading?
1.2. Who should use Cascading?
1.3. What is Apache Hadoop
2. Diving In
3. Data Processing
3.1. Introduction
3.2. Pipe Assemblies
Assembling Pipe Assemblies
Each and Every Pipes
GroupBy and CoGroup Pipes
Sorting
3.3. Source and Sink Taps
3.4. Field Algebra
3.5. Flows
Creating Flows from Pipe Assemblies
Configuring Flows
Skipping Flows
Creating Flows from a JobConf
Creating Custom Flows
3.6. Cascades
4. Executing Processes
4.1. Introduction
4.2. Building
4.3. Configuring
4.4. Executing
5. Using and Developing Operations
5.1. Introduction
5.2. Functions
5.3. Filter
5.4. Aggregator
5.5. Buffer
5.6. Operation and BaseOperation
6. Advanced Processing
6.1. SubAssemblies
6.2. Stream Assertions
6.3. Failure Traps
6.4. Event Handling
6.5. Template Taps
6.6. Scripting
6.7. Custom Taps and Schemes
6.8. Custom Types and Serialization
6.9. Partial Aggregation instead of Combiners
7. Built-In Operations
7.1. Identity Function
7.2. Debug Function
7.3. Sample and Limit Functions
7.4. Insert Function
7.5. Text Functions
7.6. Regular Expression Operations
7.7. Java Expression Operations
7.8. XML Operations
7.9. Assertions
7.10. Logical Filter Operators
8. Best Practices
8.1. Unit Testing
8.2. Flow Granularity
8.3. SubAssemblies, not Factories
8.4. Give SubAssemblies Logical Responsibilities
8.5. Java Operators in Field Names
8.6. Debugging Planner Failures
8.7. Optimizing Joins
8.8. Debuging Streams
8.9. Handling Good and Bad Data
8.10. Maintaining State in Operations
8.11. Custom Types
8.12. Fields Constants
8.13. Look at the Source Code
9. CookBook
9.1. Tuples and Fields
9.2. Stream Shaping
9.3. Common Operations
9.4. Stream Ordering
9.5. API Usage
10. How It Works
10.1. MapReduce Job Planner
10.2. The Cascade Topological Scheduler

Copyright © 2007-2008 Concurrent, Inc. All Rights Reserved.