6.7 Custom Taps and Schemes

Cascading was designed to be easily configured and enhanced by developers. Besides allowing for custom Operations, developers can provide custom Tap and Scheme types so applications can connect to system external to Hadoop.

A Tap represents something "physical", like a file or a database table. Subsequently Tap implementations are responsible for life cycle issues around the resource they represent, like tests for existence, or deleting.

A Scheme represents a format or representation, like a text format for a file, or columns in a table. Schemes are responsible for converting the Tap managed resources proprietary format to and from a cascading.tuple.Tuple instance.

Unfortunately creating custom Taps and Schemes can be an involved process and requires some knowledge of Hadoop and the Hadoop FileSystem API. Most commonly, the cascading.tap.Hfs class can be subclassed if a new file system is to be supported, assuming passing a fully qualified URL to the Hfs constructor isn't sufficient (the Hfs tap will look up a file system based on the URL scheme via the Hadoop FileSystem API).

Delegating to the Hadoop FileSystem API is not a strict requirement, but the developer will need to implement a Hadoop org.apache.hadoop.mapred.InputFormat and/or . org.apache.hadoop.mapred.OutputFormat so that Hadoop knows how to split and handle the incoming/outgoing data. The custom Scheme is responsible for setting InputFormat and OutputFormat on the JobConf via the sinkInit and sourceInit methods.

For examples on how to implement a custom Tap and Scheme, see the Cascading Modules page for samples.

Copyright © 2007-2008 Concurrent, Inc. All Rights Reserved.