All custom Tap classes must subclass the
cascading.tap.Tap abstract class and implement
the required methods. The method
getIdentifier() must return a
String that uniquely identifies the resource the
Tap instance is managing. Any two Tap instances with the same
fully-qualified identifier value will be considered equal.
Every Tap is presented an opportunity to set any custom properties
the underlying platform requires, via the methods
sourceConfInit() (for a Tuple source tap) and
sinkConfInit() (for a Tuple sink tap). These
two methods may be called more than once with new configuration objects,
and should be idempotent.
A Tap is always sourced from the
openForRead() method via a
TupleEntryIterator - i.e.,
openForRead() is always called in the same
process that will read the data. It is up to the Tap to return a
TupleEntryIterator that will iterate across the
resource, returning a
TupleEntry instance (and
Tuple instance) for each "record" in the
TupleEntryIterator.close() is always
called when no more entries will be read. For more on this topic, see
TupleEntrySchemeIterator in the Javadoc.
On some platforms,
called with a pre-instantiated Input type. Typically this Input type
should be used instead of instantiating a new instance of the
In the case of the Hadoop platform, a
RecordReader is created by Hadoop and passed to
the Tap. This
RecordReader is already configured
to read data from the current
Similarly, a Tap is always used to sink data from the
openForWrite() method via the
TupleEntryCollector. Here again,
openForWrite() is always called in the process
in which data will be written. It is up to the Tap to return a
TupleEntryCollector that will accept and store
any number of
Tuple instances for each record that is processed
or created by a given Flow.
TupleEntryCollector.close() is always called
when no more entries will be written. See
TupleEntrySchemeCollector in the Javadoc.
Again, on some platforms,
will be called with a pre-instantiated Output type. Typically this
Output type should be used instead of instantiating a new instance of
the appropriate type.
In the case of the Hadoop platform, an
OutputCollector is created by Hadoop and passed
to the Tap. This
OutputCollector is already
configured to to write data to the current resource.
TupleEntrySchemeCollector should be used to hold
any state or resources necessary to communicate with any remote
services. For example, when connecting to a SQL database, any JDBC
drivers should be created on the constructor and cleaned up on
Note that the Tap is not responsible for reading or writing data
to the Input or Output type. This is delegated to the
Scheme passed on the constructor of the
Tap. Consequently, the
Scheme is responsible for configuring the Input
and Output types it will be reading and writing.
Copyright © 2007-2012 Concurrent, Inc. All Rights Reserved.