Cascading is designed to be easily configured and enhanced by
developers. In addition to creating custom Operations, developers can
create custom Tap
and
Scheme
classes that let applications connect to
external systems or read/write data to proprietary formats.
A Tap represents something physical, like a file or a database table. Accordingly, Tap implementations are responsible for life-cycle issues around the resource they represent, such as tests for resource existence, or to perform resource deletion (dropping a remote SQL table).
A Scheme represents a format or representation - such as a text
format for a file, the columns in a table, etc. Schemes are used to
convert between the source data's native format and a
cascading.tuple.Tuple
instance.
Creating custom taps and schemes can be an involved process. When
using the Cascading Hadoop mode, it requires some knowledge of Hadoop
and the Hadoop FileSystem API. If a flow needs to support a new file
system, passing a fully-qualified URL to the Hfs
constructor may be sufficient - the Hfs
tap will
look up a file system based on the URL scheme via the Hadoop FileSystem
API. If not, a new system is commonly constructed by subclassing the
cascading.tap.Hfs
class.
Delegating to the Hadoop FileSystem API is not a strict
requirement. But if not using it, the developer must implement Hadoop
org.apache.hadoop.mapred.InputFormat
and/or
org.apache.hadoop.mapred.OutputFormat
classes so
that Hadoop knows how to split and handle the incoming/outgoing data.
The custom Scheme
is responsible for setting the
InputFormat
and
OutputFormat
on the
JobConf
, via the
sinkConfInit
and
sourceConfInit
methods.
For examples of how to implement a custom tap and scheme, see the Cascading Modules page.
Copyright © 2007-2012 Concurrent, Inc. All Rights Reserved.