5. Using and Developing Operations

5.1 Introduction

So far we've talked about setting up sources and sinks, shaping the data streams, referencing the data fields, and so on. Within this Pipe framework, Operations are used to act upon the data - e.g., alter it, filter it, analyze it, or transform it. You can use the standard Operations in the Cascading library to create powerful and robust applications by combining them in chains (much like Unix operations such as sed, grep, sort, uniq, and awk). And if you want to go further, it's also very simple to develop custom Operations in Cascading.

There are four kinds of Operations: Function, Filter, Aggregator, and Buffer.

Operations typically require an input argument Tuple to act on. And all Operations can return zero or more Tuple object results - except Filter, which simply returns a Boolean indicating whether to discard the current Tuple. A Function, for instance, can parse a string passed by an argument Tuple and return a new Tuple for every value parsed (i.e., one Tuple for each "word"), or it may create a single Tuple with every parsed value included as an element in one Tuple object (e.g., one Tuple with "first-name" and "last-name" fields).

In theory, a Function can be used as a Filter by not emitting a Tuple result. However, the Filter type is optimized for filtering, and can be combined with logical Operations such as Not, And, Or, etc.

During runtime, Operations actually receive arguments as one or more instances of the TupleEntry object. The TupleEntry object holds the current Tuple and a Fields object that defines field names for positions within the Tuple.

Except for Filter, all Operations must declare result Fields, and if the actual output does not match the declaration, the process will fail. For example, consider a Function written to parse words out of a String and return a new Tuple for each word. If it declares that its intended output is a Tuple with a single field named "word", and then returns more values in the Tuple beyond that single "word", processing will halt. However, Operations designed to return arbitrary numbers of values in a result Tuple may declare Fields.UNKNOWN.

The Cascading planner always attempts to "fail fast" where possible by checking the field name dependencies between Pipes and Operations, but there may be some cases the planner can't account for.

All Operations must be wrapped by either an Each or an Every pipe instance. The pipe is responsible for passing in an argument Tuple and accepting the resulting output Tuple.

Operations by default are assumed by the Cascading planner to be "safe". A safe Operation is idempotent; it can safely execute multiple times on the exact same record or Tuple; it has no side-effects. If a custom Operation is not idempotent, the method isSafe() must return false. This value influences how the Cascading planner renders the Flow under certain circumstances.

Copyright © 2007-2012 Concurrent, Inc. All Rights Reserved.