6.3 Custom Schemes

All custom Scheme classes must subclass the cascading.scheme.Scheme abstract class and implement the required methods.

A Scheme is ultimately responsible for sourcing and sinking Tuples of data. Consequently it must know what Fields it presents during sourcing, and what Fields it accepts during sinking. Thus the constructors on the base Scheme type must be set with the source and sink Fields.

A Scheme is allowed to source different Fields than it sinks. The TextLine Scheme does just this. (The TextDelimited Scheme, on the other hand, forces the source and sink Fields to be the same.)

The retrieveSourceFields() and retrieveSinkFields() methods allow a custom Scheme to fetch its source and sink Fields immediately before the planner is invoked - for example, from the header of a file, as is the case with TextDelimited. Also the presentSourceFields() and presentSinkFields() methods notify the Scheme of the Fields that the planner expects the Scheme to handle - for example, to write the field names as a header, as is the case with TextDelimited.

Every Scheme is presented the opportunity to set any custom properties the underlying platform requires, via the methods sourceConfInit() (for a Tuple source tap) and sinkConfInit() (for a Tuple sink tap). These methods may be called more than once with new configuration objects, and should be idempotent.

On the Hadoop platform, these methods should be used to configure the appropriate org.apache.hadoop.mapred.InputFormat and org.apache.hadoop.mapred.OutputFormat.

A Scheme is always sourced via the source() method, and is always sunk to via the sink() method.

Prior to a source() or sink() call, the sourcePrepare() and sinkPrepare() methods are called. After all values have been read or written, the s ourceCleanup() and sinkCleanup() methods are called.

The *Prepare() methods allow a Scheme to initialize any state necessary - for example, to create a new java.util.regex.Matcher instance for use against all record reads). Conversely, the *Cleanup() methods allow for clearing up any resources.

These methods are always called in the same process space as their associated source() and sink() calls. In the case of the Hadoop platform, this will likely be on the cluster side, unlike calls to *ConfInit() which will likely be on the client side.

Copyright © 2007-2012 Concurrent, Inc. All Rights Reserved.