Cascading was designed to be easily configured and enhanced by
developers. Besides allowing for custom Operations, developers can
provide custom Tap
and
Scheme
types so applications can connect to
system external to Hadoop.
A Tap represents something "physical", like a file or a database table. Subsequently Tap implementations are responsible for life cycle issues around the resource they represent, like tests for existence, or deleting.
A Scheme represents a format or representation, like a text format
for a file, or columns in a table. Schemes are responsible for
converting the Tap managed resources proprietary format to and from a
cascading.tuple.Tuple
instance.
Unfortunately creating custom Taps and Schemes can be an involved
process and requires some knowledge of Hadoop and the Hadoop FileSystem
API. Most commonly, the cascading.tap.Hfs
class
can be subclassed if a new file system is to be supported, assuming
passing a fully qualified URL to the Hfs
constructor isn't sufficient (the Hfs
tap will
look up a file system based on the URL scheme via the Hadoop FileSystem
API).
Delegating to the Hadoop FileSystem API is not a strict
requirement, but the developer will need to implement a Hadoop
org.apache.hadoop.mapred.InputFormat
and/or .
org.apache.hadoop.mapred.OutputFormat
so that
Hadoop knows how to split and handle the incoming/outgoing data. The
custom Scheme
is responsible for setting
InputFormat
and
OutputFormat
on the
JobConf
via the sinkInit
and sourceInit
methods.
For examples on how to implement a custom Tap and Scheme, see the Cascading Modules page for samples.
Copyright © 2007-2008 Concurrent, Inc. All Rights Reserved.