6.2 Custom Taps

All custom Tap classes must subclass the cascading.tap.Tap abstract class and implement the required methods. The method getIdentifier() must return a String that uniquely identifies the resource the Tap instance is managing. Any two Tap instances with the same fully-qualified identifier value will be considered equal.

Every Tap is presented an opportunity to set any custom properties the underlying platform requires, via the methods sourceConfInit() (for a Tuple source tap) and sinkConfInit() (for a Tuple sink tap). These two methods may be called more than once with new configuration objects, and should be idempotent.

A Tap is always sourced from the openForRead() method via a TupleEntryIterator - i.e., openForRead() is always called in the same process that will read the data. It is up to the Tap to return a TupleEntryIterator that will iterate across the resource, returning a TupleEntry instance (and Tuple instance) for each "record" in the resource. TupleEntryIterator.close() is always called when no more entries will be read. For more on this topic, see TupleEntrySchemeIterator in the Javadoc.

On some platforms, openForRead() is called with a pre-instantiated Input type. Typically this Input type should be used instead of instantiating a new instance of the appropriate type.

In the case of the Hadoop platform, a RecordReader is created by Hadoop and passed to the Tap. This RecordReader is already configured to read data from the current InputSplit.

Similiarly, a Tap is always used to sink data from the openForWrite() method via the TupleEntryCollector. Here again, openForWrite() is always called in the process in which data will be written. It is up to the Tap to return a TupleEntryCollector that will accept and store any number of TupleEntry or Tuple instances for each record that is processed or created by a given Flow. TupleEntryCollector.close() is always called when no more entries will be written. See TupleEntrySchemeCollector in the Javadoc.

Again, on some platforms, openForWrite() will be called with a pre-instantiated Output type. Typically this Output type should be used instead of instantiating a new instance of the appropriate type.

In the case of the Hadoop platform, an OutputCollector is created by Hadoop and passed to the Tap. This OutputCollector is already configured to to write data to the current resource.

Both the TupleEntrySchemeIterator and TupleEntrySchemeCollector should be used to hold any state or resources necessary to communicate with any remote services. For example, when connecting to a SQL database, any JDBC drivers should be created on the constructor and cleaned up on close().

Note that the Tap is not responsible for reading or writing data to the Input or Output type. This is delegated to the Scheme passed on the constructor of the Tap. Consequently, the Scheme is responsible for configuring the Input and Output types it will be reading and writing.

Copyright © 2007-2012 Concurrent, Inc. All Rights Reserved.