All custom Tap classes must subclass the
cascading.tap.Tap
abstract class and implement
the required methods. The method
getIdentifier()
must return a
String
that uniquely identifies the resource the
Tap instance is managing. Any two Tap instances with the same
fully-qualified identifier value will be considered equal.
Every Tap is presented an opportunity to set any custom properties
the underlying platform requires, via the methods
sourceConfInit()
(for a Tuple source tap) and
sinkConfInit()
(for a Tuple sink tap). These
two methods may be called more than once with new configuration objects,
and should be idempotent.
A Tap is always sourced from the
openForRead()
method via a
TupleEntryIterator
- i.e.,
openForRead()
is always called in the same
process that will read the data. It is up to the Tap to return a
TupleEntryIterator
that will iterate across the
resource, returning a TupleEntry
instance (and
Tuple
instance) for each "record" in the
resource. TupleEntryIterator.close()
is always
called when no more entries will be read. For more on this topic, see
TupleEntrySchemeIterator
in the Javadoc.
On some platforms, openForRead()
is
called with a pre-instantiated Input type. Typically this Input type
should be used instead of instantiating a new instance of the
appropriate type.
In the case of the Hadoop platform, a
RecordReader
is created by Hadoop and passed to
the Tap. This RecordReader
is already configured
to read data from the current InputSplit
.
Similiarly, a Tap is always used to sink data from the
openForWrite()
method via the
TupleEntryCollector
. Here again,
openForWrite()
is always called in the process
in which data will be written. It is up to the Tap to return a
TupleEntryCollector
that will accept and store
any number of TupleEntry
or
Tuple
instances for each record that is processed
or created by a given Flow.
TupleEntryCollector.close()
is always called
when no more entries will be written. See
TupleEntrySchemeCollector
in the Javadoc.
Again, on some platforms, openForWrite()
will be called with a pre-instantiated Output type. Typically this
Output type should be used instead of instantiating a new instance of
the appropriate type.
In the case of the Hadoop platform, an
OutputCollector
is created by Hadoop and passed
to the Tap. This OutputCollector
is already
configured to to write data to the current resource.
Both the TupleEntrySchemeIterator
and
TupleEntrySchemeCollector
should be used to hold
any state or resources necessary to communicate with any remote
services. For example, when connecting to a SQL database, any JDBC
drivers should be created on the constructor and cleaned up on
close()
.
Note that the Tap is not responsible for reading or writing data
to the Input or Output type. This is delegated to the
Scheme
passed on the constructor of the
Tap
. Consequently, the
Scheme
is responsible for configuring the Input
and Output types it will be reading and writing.
Copyright © 2007-2012 Concurrent, Inc. All Rights Reserved.