5. Using and Developing Operations

5.1 Introduction

To use Cascading, it is not strictly necessary to create custom Operations. There are a number of Operations in the Cascading library that can be combined into very robust applications. In the same way you can chain sed, grep, sort, uniq, awk, etc in Unix, you can chain existing Cascading operations. But developing customs Operations is very simple in Cascading.

There are four kinds of Operations: Function,Filter, Aggregator, and Buffer.

All Operations operate on an input argument Tuple and all Operations other than Filter may return zero or more Tuple object results. That is, a Function can parse a string and return a new Tuple for every value parsed out (one Tuple for each 'word'), or it may create a single Tuple with every parsed value as an element in the Tuple object (one Tuple with "first-name" and "last-name" fields).

In practice, a Function that returns no results is aFilter, but the Filter type has been optimized and can be combined with "logical" filter Operations like Not, And, Or, etc.

During runtime, Operations actually receive arguments as an instance of the TupleEntry object. The TupleEntry object holds both an instance of Fields and the current Tuple the Fields object defines fields for.

All Operations, other thanFilter, must declare result Fields. For example, if a Function was written to parse words out of a String and return a new Tuple for each word, this Function must declare that it intends to return a Tuple with one field named "word". If the Function mistakenly returns more values in the Tuple other than a 'word', the process will fail. Operations that do return arbitrary numbers of values in a result Tuple may declare Fields.UNKNOWN.

The Cascading planner always attempts to "fail fast" where possible by checking the field name dependencies between Pipes and Operations, but some cases the planner can't account for.

All Operations must be wrapped by either an Each or an Every pipe instance. The pipe is responsible for passing in an argument Tuple and accepting the result Tuple.

Operations, by default, are "safe". Safe Operations can execute safely multiple times on the same Tuple multiple times, that is, it has no side-effects, it is idempotent. If an Operation is not idempotent, the method isSafe() must returnfalse. This value influences how the Cascading planner renders the Flow under certain circumstances.

Copyright © 2007-2008 Concurrent, Inc. All Rights Reserved.