6. Custom Taps and Schemes

6.1 Introduction

Cascading is designed to be easily configured and enhanced by developers. In addition to creating custom Operations, developers can create custom Tap and Scheme classes that let applications connect to external systems or read/write data to proprietary formats.

A Tap represents something physical, like a file or a database table. Accordingly, Tap implementations are responsible for life-cycle issues around the resource they represent, such as tests for resource existence, or to perform resource deletion (dropping a remote SQL table).

A Scheme represents a format or representation - such as a text format for a file, the columns in a table, etc. Schemes are used to convert between the source data's native format and a cascading.tuple.Tuple instance.

Creating custom taps and schemes can be an involved process. When using the Cascading Hadoop mode, it requires some knowledge of Hadoop and the Hadoop FileSystem API. If a flow needs to support a new file system, passing a fully-qualified URL to the Hfs constructor may be sufficient - the Hfs tap will look up a file system based on the URL scheme via the Hadoop FileSystem API. If not, a new system is commonly constructed by subclassing the cascading.tap.Hfs class.

Delegating to the Hadoop FileSystem API is not a strict requirement. But if not using it, the developer must implement Hadoop org.apache.hadoop.mapred.InputFormat and/or org.apache.hadoop.mapred.OutputFormat classes so that Hadoop knows how to split and handle the incoming/outgoing data. The custom Scheme is responsible for setting the InputFormat and OutputFormat on the JobConf, via the sinkConfInit and sourceConfInit methods.

For examples of how to implement a custom tap and scheme, see the Cascading Modules page.

Copyright © 2007-2012 Concurrent, Inc. All Rights Reserved.