To use Cascading, it is not strictly necessary to create custom Operations. There are a number of Operations in the Cascading library that can be combined into very robust applications. In the same way you can chain sed, grep, sort, uniq, awk, etc in Unix, you can chain existing Cascading operations. But developing customs Operations is very simple in Cascading.
There are four kinds of Operations:
Function
,Filter
,
Aggregator
, and
Buffer
.
All Operations operate on an input argument Tuple and all
Operations other than Filter
may return zero or
more Tuple object results. That is, a Function
can parse a string and return a new Tuple for every value parsed out
(one Tuple for each 'word'), or it may create a single Tuple with every
parsed value as an element in the Tuple object (one Tuple with
"first-name" and "last-name" fields).
In practice, a Function
that returns no
results is aFilter
, but the
Filter
type has been optimized and can be
combined with "logical" filter Operations like
Not
, And
,
Or
, etc.
During runtime, Operations actually receive arguments as an
instance of the TupleEntry object. The TupleEntry object holds both an
instance of Fields
and the current
Tuple
the Fields
object
defines fields for.
All Operations, other thanFilter
, must
declare result Fields. For example, if a Function
was written to parse words out of a String and return a new Tuple for
each word, this Function
must declare that it
intends to return a Tuple with one field named "word". If the
Function
mistakenly returns more values in the
Tuple other than a 'word', the process will fail. Operations that do
return arbitrary numbers of values in a result Tuple may declare
Fields.UNKNOWN
.
The Cascading planner always attempts to "fail fast" where possible by checking the field name dependencies between Pipes and Operations, but some cases the planner can't account for.
All Operations must be wrapped by either an
Each
or an Every
pipe
instance. The pipe is responsible for passing in an argument Tuple and
accepting the result Tuple.
Operations, by default, are "safe". Safe Operations can execute
safely multiple times on the same Tuple multiple times, that is, it has
no side-effects, it is idempotent. If an Operation is not idempotent,
the method isSafe()
must returnfalse
. This
value influences how the Cascading planner renders the Flow under
certain circumstances.
Copyright © 2007-2008 Concurrent, Inc. All Rights Reserved.