All custom Scheme classes must subclass the
cascading.scheme.Scheme
abstract class and
implement the required methods.
A Scheme
is ultimately responsible for
sourcing and sinking Tuples of data. Consequently it must know what
Fields
it presents during sourcing, and what
Fields
it accepts during sinking. Thus the
constructors on the base Scheme
type must be set
with the source and sink Fields.
A Scheme is allowed to source different Fields than it sinks. The
TextLine
Scheme
does just
this. (The TextDelimited
Scheme
, on the other hand, forces the source and
sink Fields
to be the same.)
The retrieveSourceFields()
and
retrieveSinkFields()
methods allow a custom
Scheme
to fetch its source and sink
Fields
immediately before the planner is invoked
- for example, from the header of a file, as is the case with
TextDelimited
. Also the
presentSourceFields()
and
presentSinkFields()
methods notify the
Scheme
of the Fields
that
the planner expects the Scheme to handle - for example, to write the
field names as a header, as is the case with
TextDelimited
.
Every Scheme
is presented the opportunity
to set any custom properties the underlying platform requires, via the
methods sourceConfInit()
(for a Tuple source
tap) and sinkConfInit()
(for a Tuple sink tap).
These methods may be called more than once with new configuration
objects, and should be idempotent.
On the Hadoop platform, these methods should be used to configure
the appropriate
org.apache.hadoop.mapred.InputFormat
and
org.apache.hadoop.mapred.OutputFormat
.
A Scheme is always sourced via the
source()
method, and is always sunk to via the
sink()
method.
Prior to a source()
or
sink()
call, the
sourcePrepare()
and
sinkPrepare()
methods are called. After all
values have been read or written, the s
ourceCleanup()
and
sinkCleanup()
methods are called.
The *Prepare()
methods allow a Scheme to
initialize any state necessary - for example, to create a new
java.util.regex.Matcher
instance for use against
all record reads). Conversely, the *Cleanup()
methods allow for clearing up any resources.
These methods are always called in the same process space as their
associated source()
and
sink()
calls. In the case of the Hadoop
platform, this will likely be on the cluster side, unlike calls to
*ConfInit()
which will likely be on the client
side.
Copyright © 2007-2012 Concurrent, Inc. All Rights Reserved.