Cascading is designed to be easily configured and enhanced by
developers. In addition to creating custom Operations, developers can
Scheme classes that let applications connect to
external systems or read/write data to proprietary formats.
A Tap represents something physical, like a file or a database table. Accordingly, Tap implementations are responsible for life-cycle issues around the resource they represent, such as tests for resource existence, or to perform resource deletion (dropping a remote SQL table).
A Scheme represents a format or representation - such as a text
format for a file, the columns in a table, etc. Schemes are used to
convert between the source data's native format and a
Creating custom taps and schemes can be an involved process. When
using the Cascading Hadoop mode, it requires some knowledge of Hadoop
and the Hadoop FileSystem API. If a flow needs to support a new file
system, passing a fully-qualified URL to the
constructor may be sufficient - the
Hfs tap will
look up a file system based on the URL scheme via the Hadoop FileSystem
API. If not, a new system is commonly constructed by subclassing the
Delegating to the Hadoop FileSystem API is not a strict
requirement. But if not using it, the developer must implement Hadoop
org.apache.hadoop.mapred.OutputFormat classes so
that Hadoop knows how to split and handle the incoming/outgoing data.
Scheme is responsible for setting the
OutputFormat on the
JobConf, via the
For examples of how to implement a custom tap and scheme, see the Cascading Modules page.
Copyright © 2007-2012 Concurrent, Inc. All Rights Reserved.