So far we've talked about setting up sources and sinks, shaping the data streams, referencing the data fields, and so on. Within this Pipe framework, Operations are used to act upon the data - e.g., alter it, filter it, analyze it, or transform it. You can use the standard Operations in the Cascading library to create powerful and robust applications by combining them in chains (much like Unix operations such as sed, grep, sort, uniq, and awk). And if you want to go further, it's also very simple to develop custom Operations in Cascading.
There are four kinds of Operations:
Operations typically require an input argument Tuple to act on.
And all Operations can return zero or more Tuple object results - except
Filter, which simply returns a Boolean indicating
whether to discard the current Tuple. A
for instance, can parse a string passed by an argument Tuple and return
a new Tuple for every value parsed (i.e., one Tuple for each "word"), or
it may create a single Tuple with every parsed value included as an
element in one Tuple object (e.g., one Tuple with "first-name" and
In theory, a
Function can be used as a
Filter by not emitting a Tuple result. However,
Filter type is optimized for filtering, and
can be combined with logical Operations such as
During runtime, Operations actually receive arguments as one or
more instances of the
TupleEntry object. The
TupleEntry object holds the current
Tuple and a
Fields object that defines field names for
positions within the Tuple.
Filter, all Operations must
declare result Fields, and if the actual output does not match the
declaration, the process will fail. For example, consider a
Function written to parse words out of a String
and return a new Tuple for each word. If it declares that its intended
output is a Tuple with a single field named "word", and then returns
more values in the Tuple beyond that single "word", processing will
halt. However, Operations designed to return arbitrary numbers of values
in a result Tuple may declare
The Cascading planner always attempts to "fail fast" where possible by checking the field name dependencies between Pipes and Operations, but there may be some cases the planner can't account for.
All Operations must be wrapped by either an
Each or an
instance. The pipe is responsible for passing in an argument Tuple and
accepting the resulting output Tuple.
Operations by default are assumed by the Cascading planner to be
"safe". A safe Operation is idempotent; it can safely execute multiple
times on the exact same record or Tuple; it has no side-effects. If a
custom Operation is not idempotent, the method
false. This value influences how the Cascading
planner renders the Flow under certain circumstances.
Copyright © 2007-2012 Concurrent, Inc. All Rights Reserved.