cascading.tap
Class Tap<Config,Input,Output>

java.lang.Object
  extended by cascading.tap.Tap<Config,Input,Output>
All Implemented Interfaces:
FlowElement, Serializable
Direct Known Subclasses:
FileTap, Hfs, SinkTap, SourceTap

public abstract class Tap<Config,Input,Output>
extends Object
implements FlowElement, Serializable

A Tap represents the physical data source or sink in a connected Flow.

That is, a source Tap is the head end of a connected Pipe and Tuple stream, and a sink Tap is the tail end. Kinds of Tap types are used to manage files from a local disk, distributed disk, remote storage like Amazon S3, or via FTP. It simply abstracts out the complexity of connecting to these types of data sources.

A Tap takes a Scheme instance, which is used to identify the type of resource (text file, binary file, etc). A Tap is responsible for how the resource is reached.

By default when planning a Flow, Tap equality is a function of the getIdentifier() and getScheme() values. That is, two Tap instances are the same Tap instance if they sink/source the same resource and sink/source the same fields.

Some more advanced taps, like a database tap, may need to extend equality to include any filtering, like the where clause in a SQL statement so two taps reading from the same SQL table aren't considered equal.

Taps are also used to determine dependencies between two or more Flow instances when used with a Cascade. In that case the getFullIdentifier(Object) value is used and the Scheme is ignored.

See Also:
Serialized Form

Constructor Summary
protected Tap()
           
protected Tap(Scheme<Config,Input,Output,?,?> scheme)
           
protected Tap(Scheme<Config,Input,Output,?,?> scheme, SinkMode sinkMode)
           
 
Method Summary
 boolean commitResource(Config conf)
          Method commitResource allows the underlying resource to be notified when all write processing is successful so that any additional cleanup or processing may be completed.
abstract  boolean createResource(Config conf)
          Method createResource creates the underlying resource.
abstract  boolean deleteResource(Config conf)
          Method deleteResource deletes the resource represented by this instance.
 boolean equals(Object object)
           
 void flowConfInit(Flow<Config> flow)
          Method flowInit allows this Tap instance to initialize itself in context of the given Flow instance.
 ConfigDef getConfigDef()
          Returns a ConfigDef instance that allows for local properties to be set and made available via a resulting FlowProcess instance when the tap is invoked.
 String getFullIdentifier(Config conf)
          Method getFullIdentifier returns a fully qualified resource identifier.
abstract  String getIdentifier()
          Method getIdentifier returns a String representing the resource this Tap instance represents.
abstract  long getModifiedTime(Config conf)
          Method getModifiedTime returns the date this resource was last modified.
 Scheme<Config,Input,Output,?,?> getScheme()
          Method getScheme returns the scheme of this Tap object.
 Fields getSinkFields()
          Method getSinkFields returns the sinkFields of this Tap object.
 SinkMode getSinkMode()
          Method getSinkMode returns the SinkMode }of this Tap object.
 Fields getSourceFields()
          Method getSourceFields returns the sourceFields of this Tap object.
 ConfigDef getStepConfigDef()
          Returns a ConfigDef instance that allows for process level properties to be set and made available via a resulting FlowProcess instance when the tap is invoked.
 String getTrace()
          Method getTrace return the trace of this object.
 boolean hasConfigDef()
          Returns true if there are properties in the configDef instance.
 int hashCode()
           
 boolean hasStepConfigDef()
          Returns true if there are properties in the processConfigDef instance.
static String id(Tap tap)
          Creates and returns a unique ID for the given Tap, this value is cached and may be used to uniquely identify the Tap instance in properties files etc.
 boolean isEquivalentTo(FlowElement element)
           
 boolean isKeep()
          Method isKeep indicates whether the resource represented by this instance should be kept if it already exists when the Flow is started.
 boolean isReplace()
          Method isReplace indicates whether the resource represented by this instance should be deleted if it already exists when the Flow is started.
 boolean isSink()
          Method isSink returns true if this Tap instance can be used as a sink.
 boolean isSource()
          Method isSource returns true if this Tap instance can be used as a source.
 boolean isTemporary()
          Method isTemporary returns true if this Tap is temporary (used for intermediate results).
 boolean isUpdate()
          Method isUpdate indicates whether the resource represented by this instance should be updated if it already exists.
 TupleEntryIterator openForRead(FlowProcess<Config> flowProcess)
          Method openForRead opens the resource represented by this Tap instance for reading.
abstract  TupleEntryIterator openForRead(FlowProcess<Config> flowProcess, Input input)
          Method openForRead opens the resource represented by this Tap instance for reading.
 TupleEntryCollector openForWrite(FlowProcess<Config> flowProcess)
          Method openForWrite opens the resource represented by this Tap instance for writing.
abstract  TupleEntryCollector openForWrite(FlowProcess<Config> flowProcess, Output output)
          Method openForWrite opens the resource represented by this Tap instance for writing.
 cascading.flow.planner.Scope outgoingScopeFor(Set<cascading.flow.planner.Scope> incomingScopes)
          Method outgoingScopeFor returns the Scope this FlowElement hands off to the next FlowElement.
 void presentSinkFields(FlowProcess<Config> flowProcess, Fields fields)
           
 void presentSourceFields(FlowProcess<Config> flowProcess, Fields fields)
           
 Fields resolveIncomingOperationArgumentFields(cascading.flow.planner.Scope incomingScope)
          Method resolveIncomingOperationArgumentFields returns the Fields outgoing from the previous FlowElement that are consumable by this FlowElement when preparing Operation arguments.
 Fields resolveIncomingOperationPassThroughFields(cascading.flow.planner.Scope incomingScope)
          Method resolveIncomingOperationPassThroughFields returns the Fields outgoing from the previous FlowElement that are consumable by this FlowElement when preparing the Pipe outgoing tuple.
abstract  boolean resourceExists(Config conf)
          Method resourceExists returns true if the path represented by this instance exists.
 Fields retrieveSinkFields(FlowProcess<Config> flowProcess)
          A hook for allowing a Scheme to lazily retrieve its sink fields.
 Fields retrieveSourceFields(FlowProcess<Config> flowProcess)
          A hook for allowing a Scheme to lazily retrieve its source fields.
 boolean rollbackResource(Config conf)
          Method rollbackResource allows the underlying resource to be notified when any write processing has failed or was stopped so that any cleanup may be started.
protected  void setScheme(Scheme<Config,Input,Output,?,?> scheme)
           
 void sinkConfInit(FlowProcess<Config> flowProcess, Config conf)
          Method sinkConfInit initializes this instance as a sink.
 void sourceConfInit(FlowProcess<Config> flowProcess, Config conf)
          Method sourceConfInit initializes this instance as a source.
static Tap[] taps(Tap... taps)
          Convenience function to make an array of Tap instances.
 String toString()
           
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Constructor Detail

Tap

protected Tap()

Tap

protected Tap(Scheme<Config,Input,Output,?,?> scheme)

Tap

protected Tap(Scheme<Config,Input,Output,?,?> scheme,
              SinkMode sinkMode)
Method Detail

taps

public static Tap[] taps(Tap... taps)
Convenience function to make an array of Tap instances.

Parameters:
taps - of type Tap
Returns:
Tap array

id

public static String id(Tap tap)
Creates and returns a unique ID for the given Tap, this value is cached and may be used to uniquely identify the Tap instance in properties files etc.

This value is generally reproducible assuming the Tap identifier and the Scheme source and sink Fields remain consistent.

Parameters:
tap - of type Tap
Returns:
of type String

setScheme

protected void setScheme(Scheme<Config,Input,Output,?,?> scheme)

getScheme

public Scheme<Config,Input,Output,?,?> getScheme()
Method getScheme returns the scheme of this Tap object.

Returns:
the scheme (type Scheme) of this Tap object.

getTrace

public String getTrace()
Method getTrace return the trace of this object.

Returns:
String

flowConfInit

public void flowConfInit(Flow<Config> flow)
Method flowInit allows this Tap instance to initialize itself in context of the given Flow instance. This method is guaranteed to be called before the Flow is started and the FlowListener.onStarting(cascading.flow.Flow) event is fired.

This method will be called once per Flow, and before #sourceConfInit(FlowProcess, Config) and #sinkConfInit(FlowProcess, Config) methods.

Parameters:
flow - of type Flow

sourceConfInit

public void sourceConfInit(FlowProcess<Config> flowProcess,
                           Config conf)
Method sourceConfInit initializes this instance as a source.

This method maybe called more than once if this Tap instance is used outside the scope of a Flow instance or if it participates in multiple times in a given Flow or across different Flows in a Cascade.

In the context of a Flow, it will be called after FlowListener.onStarting(cascading.flow.Flow)

Note that no resources or services should be modified by this method.

Parameters:
flowProcess - of type FlowProcess
conf - of type Config

sinkConfInit

public void sinkConfInit(FlowProcess<Config> flowProcess,
                         Config conf)
Method sinkConfInit initializes this instance as a sink.

This method maybe called more than once if this Tap instance is used outside the scope of a Flow instance or if it participates in multiple times in a given Flow or across different Flows in a Cascade.

Note this method will be called in context of this Tap being used as a traditional 'sink' and as a 'trap'.

In the context of a Flow, it will be called after FlowListener.onStarting(cascading.flow.Flow)

Note that no resources or services should be modified by this method. If this Tap instance returns true for isReplace(), then deleteResource(Object) will be called by the parent Flow.

Parameters:
flowProcess - of type FlowProcess
conf - of type Config

getIdentifier

public abstract String getIdentifier()
Method getIdentifier returns a String representing the resource this Tap instance represents.

Often, if the tap accesses a filesystem, the identifier is nothing more than the path to the file or directory. In other cases it may be a an URL or URI representing a connection string or remote resource.

Any two Tap instances having the same value for the identifier are considered equal.

Returns:
String

getSourceFields

public Fields getSourceFields()
Method getSourceFields returns the sourceFields of this Tap object.

Returns:
the sourceFields (type Fields) of this Tap object.

getSinkFields

public Fields getSinkFields()
Method getSinkFields returns the sinkFields of this Tap object.

Returns:
the sinkFields (type Fields) of this Tap object.

openForRead

public abstract TupleEntryIterator openForRead(FlowProcess<Config> flowProcess,
                                               Input input)
                                        throws IOException
Method openForRead opens the resource represented by this Tap instance for reading.

input value may be null, if so, sub-classes must inquire with the underlying Scheme via Scheme.sourceConfInit(cascading.flow.FlowProcess, Tap, Object) to get the proper input type and instantiate it before calling super.openForRead().

Note the returned iterator will return the same instance of TupleEntry on every call, thus a copy must be made of either the TupleEntry or the underlying Tuple instance if they are to be stored in a Collection.

Parameters:
flowProcess - of type FlowProcess
input - of type Input
Returns:
TupleEntryIterator
Throws:
IOException - when the resource cannot be opened

openForRead

public TupleEntryIterator openForRead(FlowProcess<Config> flowProcess)
                               throws IOException
Method openForRead opens the resource represented by this Tap instance for reading.

Note the returned iterator will return the same instance of TupleEntry on every call, thus a copy must be made of either the TupleEntry or the underlying Tuple instance if they are to be stored in a Collection.

Parameters:
flowProcess - of type FlowProcess
Returns:
TupleEntryIterator
Throws:
IOException - when the resource cannot be opened

openForWrite

public abstract TupleEntryCollector openForWrite(FlowProcess<Config> flowProcess,
                                                 Output output)
                                          throws IOException
Method openForWrite opens the resource represented by this Tap instance for writing.

This method is used internally and does not honor the SinkMode setting. If SinkMode is SinkMode.REPLACE, this call may fail. See openForWrite(cascading.flow.FlowProcess).

output value may be null, if so, sub-classes must inquire with the underlying Scheme via Scheme.sinkConfInit(cascading.flow.FlowProcess, Tap, Object) to get the proper output type and instantiate it before calling super.openForWrite().

Parameters:
flowProcess - of type FlowProcess
output - of type Output
Returns:
TupleEntryCollector
Throws:
IOException - when the resource cannot be opened

openForWrite

public TupleEntryCollector openForWrite(FlowProcess<Config> flowProcess)
                                 throws IOException
Method openForWrite opens the resource represented by this Tap instance for writing.

This method is for user application use and does honor the SinkMode.REPLACE settings. That is, if SinkMode is set to SinkMode.REPLACE the underlying resource will be deleted.

Note if SinkMode.UPDATE is set, the resource will not be deleted.

Parameters:
flowProcess - of type FlowProcess
Returns:
TupleEntryCollector
Throws:
IOException - when the resource cannot be opened

outgoingScopeFor

public cascading.flow.planner.Scope outgoingScopeFor(Set<cascading.flow.planner.Scope> incomingScopes)
Description copied from interface: FlowElement
Method outgoingScopeFor returns the Scope this FlowElement hands off to the next FlowElement.

Specified by:
outgoingScopeFor in interface FlowElement
Parameters:
incomingScopes - of type Set
Returns:
Scope

retrieveSourceFields

public Fields retrieveSourceFields(FlowProcess<Config> flowProcess)
A hook for allowing a Scheme to lazily retrieve its source fields.

Parameters:
flowProcess - of type FlowProcess
Returns:
the found Fields

presentSourceFields

public void presentSourceFields(FlowProcess<Config> flowProcess,
                                Fields fields)

retrieveSinkFields

public Fields retrieveSinkFields(FlowProcess<Config> flowProcess)
A hook for allowing a Scheme to lazily retrieve its sink fields.

Parameters:
flowProcess - of type FlowProcess
Returns:
the found Fields

presentSinkFields

public void presentSinkFields(FlowProcess<Config> flowProcess,
                              Fields fields)

resolveIncomingOperationArgumentFields

public Fields resolveIncomingOperationArgumentFields(cascading.flow.planner.Scope incomingScope)
Description copied from interface: FlowElement
Method resolveIncomingOperationArgumentFields returns the Fields outgoing from the previous FlowElement that are consumable by this FlowElement when preparing Operation arguments.

Specified by:
resolveIncomingOperationArgumentFields in interface FlowElement
Parameters:
incomingScope - of type Scope
Returns:
Fields

resolveIncomingOperationPassThroughFields

public Fields resolveIncomingOperationPassThroughFields(cascading.flow.planner.Scope incomingScope)
Description copied from interface: FlowElement
Method resolveIncomingOperationPassThroughFields returns the Fields outgoing from the previous FlowElement that are consumable by this FlowElement when preparing the Pipe outgoing tuple.

Specified by:
resolveIncomingOperationPassThroughFields in interface FlowElement
Parameters:
incomingScope - of type Scope
Returns:
Fields

getFullIdentifier

public String getFullIdentifier(Config conf)
Method getFullIdentifier returns a fully qualified resource identifier.

Parameters:
conf - of type Config
Returns:
String

createResource

public abstract boolean createResource(Config conf)
                                throws IOException
Method createResource creates the underlying resource.

Parameters:
conf - of type Config
Returns:
boolean
Throws:
IOException - when there is an error making directories

deleteResource

public abstract boolean deleteResource(Config conf)
                                throws IOException
Method deleteResource deletes the resource represented by this instance.

Parameters:
conf - of type Config
Returns:
boolean
Throws:
IOException - when the resource cannot be deleted

commitResource

public boolean commitResource(Config conf)
                       throws IOException
Method commitResource allows the underlying resource to be notified when all write processing is successful so that any additional cleanup or processing may be completed.

See rollbackResource(Object) to handle cleanup in the face of failures.

This method is invoked once "client side" and not in the cluster, if any.

If other sink Tap instance in a given Flow fail on commitResource after called on this instance, rollbackResource will not be called.

This is an experimental API and subject to refinement!!

Parameters:
conf - of type Config
Returns:
returns true if successful
Throws:
IOException

rollbackResource

public boolean rollbackResource(Config conf)
                         throws IOException
Method rollbackResource allows the underlying resource to be notified when any write processing has failed or was stopped so that any cleanup may be started.

See commitResource(Object) to handle cleanup when the write has successfully completed.

This method is invoked once "client side" and not in the cluster, if any.

This is an experimental API and subject to refinement!!

Parameters:
conf - of type Config
Returns:
returns true if successful
Throws:
IOException

resourceExists

public abstract boolean resourceExists(Config conf)
                                throws IOException
Method resourceExists returns true if the path represented by this instance exists.

Parameters:
conf - of type Config
Returns:
true if the underlying resource already exists
Throws:
IOException - when the status cannot be determined

getModifiedTime

public abstract long getModifiedTime(Config conf)
                              throws IOException
Method getModifiedTime returns the date this resource was last modified.

Parameters:
conf - of type Config
Returns:
The date this resource was last modified.
Throws:
IOException

getSinkMode

public SinkMode getSinkMode()
Method getSinkMode returns the SinkMode }of this Tap object.

Returns:
the sinkMode (type SinkMode) of this Tap object.

isKeep

public boolean isKeep()
Method isKeep indicates whether the resource represented by this instance should be kept if it already exists when the Flow is started.

Returns:
boolean

isReplace

public boolean isReplace()
Method isReplace indicates whether the resource represented by this instance should be deleted if it already exists when the Flow is started.

Returns:
boolean

isUpdate

public boolean isUpdate()
Method isUpdate indicates whether the resource represented by this instance should be updated if it already exists. Otherwise a new resource will be created, via createResource(Object), when the Flow is started.

Returns:
boolean

isSink

public boolean isSink()
Method isSink returns true if this Tap instance can be used as a sink.

Returns:
boolean

isSource

public boolean isSource()
Method isSource returns true if this Tap instance can be used as a source.

Returns:
boolean

isTemporary

public boolean isTemporary()
Method isTemporary returns true if this Tap is temporary (used for intermediate results).

Returns:
the temporary (type boolean) of this Tap object.

getConfigDef

public ConfigDef getConfigDef()
Returns a ConfigDef instance that allows for local properties to be set and made available via a resulting FlowProcess instance when the tap is invoked.

Any properties set on the configDef will not show up in any Flow or FlowStep process level configuration, but will override any of those values as seen by the current Tap instance method call where a FlowProcess is provided except for the sourceConfInit(cascading.flow.FlowProcess, Object) and sinkConfInit(cascading.flow.FlowProcess, Object) methods.

That is, the *confInit methods are called before any ConfigDef is applied, so any values placed into a ConfigDef instance will not be visible to them.

Specified by:
getConfigDef in interface FlowElement
Returns:
an instance of ConfigDef

hasConfigDef

public boolean hasConfigDef()
Returns true if there are properties in the configDef instance.

Specified by:
hasConfigDef in interface FlowElement
Returns:
true if there are configDef properties

getStepConfigDef

public ConfigDef getStepConfigDef()
Returns a ConfigDef instance that allows for process level properties to be set and made available via a resulting FlowProcess instance when the tap is invoked.

Any properties set on the stepConfigDef will not show up in any Flow configuration, but will show up in the current process FlowStep (in Hadoop the MapReduce jobconf). Any value set in the stepConfigDef will be overridden by the tap local #getConfigDef instance.

Use this method to tweak properties in the process step this tap instance is planned into.

Note the *confInit methods are called before any ConfigDef is applied, so any values placed into a ConfigDef instance will not be visible to them.

Specified by:
getStepConfigDef in interface FlowElement
Returns:
an instance of ConfigDef

hasStepConfigDef

public boolean hasStepConfigDef()
Returns true if there are properties in the processConfigDef instance.

Specified by:
hasStepConfigDef in interface FlowElement
Returns:
true if there are processConfigDef properties

isEquivalentTo

public boolean isEquivalentTo(FlowElement element)
Specified by:
isEquivalentTo in interface FlowElement

equals

public boolean equals(Object object)
Overrides:
equals in class Object

hashCode

public int hashCode()
Overrides:
hashCode in class Object

toString

public String toString()
Overrides:
toString in class Object


Copyright © 2007-2012 Concurrent, Inc. All Rights Reserved.