cascading.tap.hadoop
Class PartitionTap

java.lang.Object
  extended by cascading.tap.Tap<Config,Input,Output>
      extended by cascading.tap.partition.BasePartitionTap<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector>
          extended by cascading.tap.hadoop.PartitionTap
All Implemented Interfaces:
FlowElement, Serializable

public class PartitionTap
extends BasePartitionTap<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector>

Class PartitionTap can be used to write tuple streams out to files and sub-directories based on the values in the current Tuple instance.

The constructor takes a Hfs Tap and a Partition implementation. This allows Tuple values at given positions to be used as directory names during write operations, and directory names as data during read operations.

The key value here is that there is no need to duplicate data values in the directory names and inside the data files.

So only values declared in the parent Tap will be read or written to the underlying file system files. But fields declared by the Partition will only be read or written to the directory names. That is, the PartitionTap instance will sink or source the partition fields, plus the parent Tap fields. The partition fields and parent Tap fields do not need to have common field names.

Note that Hadoop can only sink to directories, and all files in those directories are "part-xxxxx" files.

openWritesThreshold limits the number of open files to be output to. This value defaults to 300 files. Each time the threshold is exceeded, 10% of the least recently used open files will be closed.

PartitionTap will populate a given partition without regard to case of the values being used. Thus the resulting paths 2012/June/ and 2012/june/ will likely result in two open files into the same location. Forcing the case to be consistent with a custom Partition implementation or an upstream Function is recommended, see ExpressionFunction.

Though Hadoop has no mechanism to prevent simultaneous writes to a directory from multiple jobs, it doesn't mean its safe to do so. Same is true with the PartitionTap. Interleaving writes to a common parent (root) directory across multiple flows will very likely lead to data loss.

See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class cascading.tap.partition.BasePartitionTap
BasePartitionTap.Counters, BasePartitionTap.PartitionCollector, BasePartitionTap.PartitionScheme<Config,Input,Output>
 
Field Summary
 
Fields inherited from class cascading.tap.partition.BasePartitionTap
keepParentOnDelete, OPEN_WRITES_THRESHOLD_DEFAULT, openWritesThreshold, parent, partition
 
Constructor Summary
PartitionTap(Hfs parent, Partition partition)
          Constructor PartitionTap creates a new PartitionTap instance using the given parent Hfs Tap as the base path and default Scheme, and the partition.
PartitionTap(Hfs parent, Partition partition, int openWritesThreshold)
          Constructor PartitionTap creates a new PartitionTap instance using the given parent Hfs Tap as the base path and default Scheme, and the partition.
PartitionTap(Hfs parent, Partition partition, SinkMode sinkMode)
          Constructor PartitionTap creates a new PartitionTap instance using the given parent Hfs Tap as the base path and default Scheme, and the partition.
PartitionTap(Hfs parent, Partition partition, SinkMode sinkMode, boolean keepParentOnDelete)
          Constructor PartitionTap creates a new PartitionTap instance using the given parent Hfs Tap as the base path and default Scheme, and the partition.
PartitionTap(Hfs parent, Partition partition, SinkMode sinkMode, boolean keepParentOnDelete, int openWritesThreshold)
          Constructor PartitionTap creates a new PartitionTap instance using the given parent Hfs Tap as the base path and default Scheme, and the partition.
 
Method Summary
protected  TupleEntrySchemeCollector createTupleEntrySchemeCollector(FlowProcess<org.apache.hadoop.mapred.JobConf> flowProcess, Tap parent, String path, long sequence)
           
protected  TupleEntrySchemeIterator createTupleEntrySchemeIterator(FlowProcess<org.apache.hadoop.mapred.JobConf> flowProcess, Tap parent, String path, org.apache.hadoop.mapred.RecordReader recordReader)
           
protected  String getCurrentIdentifier(FlowProcess<org.apache.hadoop.mapred.JobConf> flowProcess)
           
 void sourceConfInit(FlowProcess<org.apache.hadoop.mapred.JobConf> flowProcess, org.apache.hadoop.mapred.JobConf conf)
          Method sourceConfInit initializes this instance as a source.
 
Methods inherited from class cascading.tap.partition.BasePartitionTap
commitResource, createResource, deleteResource, equals, getChildPartitionIdentifiers, getIdentifier, getModifiedTime, getOpenWritesThreshold, getParent, getPartition, hashCode, openForRead, openForWrite, resourceExists, rollbackResource, toString
 
Methods inherited from class cascading.tap.Tap
createResource, deleteResource, flowConfInit, getConfigDef, getFullIdentifier, getFullIdentifier, getModifiedTime, getScheme, getSinkFields, getSinkMode, getSourceFields, getStepConfigDef, getTrace, hasConfigDef, hasStepConfigDef, id, isEquivalentTo, isKeep, isReplace, isSink, isSource, isTemporary, isUpdate, openForRead, openForWrite, outgoingScopeFor, presentSinkFields, presentSourceFields, resolveIncomingOperationArgumentFields, resolveIncomingOperationPassThroughFields, resourceExists, retrieveSinkFields, retrieveSourceFields, setScheme, sinkConfInit, taps
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Constructor Detail

PartitionTap

@ConstructorProperties(value={"parent","partition"})
public PartitionTap(Hfs parent,
                                               Partition partition)
Constructor PartitionTap creates a new PartitionTap instance using the given parent Hfs Tap as the base path and default Scheme, and the partition.

Parameters:
parent - of type Tap
partition - of type String

PartitionTap

@ConstructorProperties(value={"parent","partition","openWritesThreshold"})
public PartitionTap(Hfs parent,
                                               Partition partition,
                                               int openWritesThreshold)
Constructor PartitionTap creates a new PartitionTap instance using the given parent Hfs Tap as the base path and default Scheme, and the partition.

openWritesThreshold limits the number of open files to be output to.

Parameters:
parent - of type Hfs
partition - of type String
openWritesThreshold - of type int

PartitionTap

@ConstructorProperties(value={"parent","partition","sinkMode"})
public PartitionTap(Hfs parent,
                                               Partition partition,
                                               SinkMode sinkMode)
Constructor PartitionTap creates a new PartitionTap instance using the given parent Hfs Tap as the base path and default Scheme, and the partition.

Parameters:
parent - of type Tap
partition - of type String
sinkMode - of type SinkMode

PartitionTap

@ConstructorProperties(value={"parent","partition","sinkMode","keepParentOnDelete"})
public PartitionTap(Hfs parent,
                                               Partition partition,
                                               SinkMode sinkMode,
                                               boolean keepParentOnDelete)
Constructor PartitionTap creates a new PartitionTap instance using the given parent Hfs Tap as the base path and default Scheme, and the partition.

keepParentOnDelete, when set to true, prevents the parent Tap from being deleted when BasePartitionTap.deleteResource(Object) is called, typically an issue when used inside a Cascade.

Parameters:
parent - of type Tap
partition - of type String
sinkMode - of type SinkMode
keepParentOnDelete - of type boolean

PartitionTap

@ConstructorProperties(value={"parent","partition","sinkMode","keepParentOnDelete","openWritesThreshold"})
public PartitionTap(Hfs parent,
                                               Partition partition,
                                               SinkMode sinkMode,
                                               boolean keepParentOnDelete,
                                               int openWritesThreshold)
Constructor PartitionTap creates a new PartitionTap instance using the given parent Hfs Tap as the base path and default Scheme, and the partition.

keepParentOnDelete, when set to true, prevents the parent Tap from being deleted when BasePartitionTap.deleteResource(Object) is called, typically an issue when used inside a Cascade.

openWritesThreshold limits the number of open files to be output to.

Parameters:
parent - of type Tap
partition - of type String
sinkMode - of type SinkMode
keepParentOnDelete - of type boolean
openWritesThreshold - of type int
Method Detail

createTupleEntrySchemeCollector

protected TupleEntrySchemeCollector createTupleEntrySchemeCollector(FlowProcess<org.apache.hadoop.mapred.JobConf> flowProcess,
                                                                    Tap parent,
                                                                    String path,
                                                                    long sequence)
                                                             throws IOException
Specified by:
createTupleEntrySchemeCollector in class BasePartitionTap<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector>
Throws:
IOException

createTupleEntrySchemeIterator

protected TupleEntrySchemeIterator createTupleEntrySchemeIterator(FlowProcess<org.apache.hadoop.mapred.JobConf> flowProcess,
                                                                  Tap parent,
                                                                  String path,
                                                                  org.apache.hadoop.mapred.RecordReader recordReader)
                                                           throws IOException
Specified by:
createTupleEntrySchemeIterator in class BasePartitionTap<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector>
Throws:
IOException

getCurrentIdentifier

protected String getCurrentIdentifier(FlowProcess<org.apache.hadoop.mapred.JobConf> flowProcess)
Specified by:
getCurrentIdentifier in class BasePartitionTap<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector>

sourceConfInit

public void sourceConfInit(FlowProcess<org.apache.hadoop.mapred.JobConf> flowProcess,
                           org.apache.hadoop.mapred.JobConf conf)
Description copied from class: Tap
Method sourceConfInit initializes this instance as a source.

This method maybe called more than once if this Tap instance is used outside the scope of a Flow instance or if it participates in multiple times in a given Flow or across different Flows in a Cascade.

In the context of a Flow, it will be called after FlowListener.onStarting(cascading.flow.Flow)

Note that no resources or services should be modified by this method.

Overrides:
sourceConfInit in class Tap<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector>
Parameters:
flowProcess - of type FlowProcess
conf - of type Config


Copyright © 2007-2013 Concurrent, Inc. All Rights Reserved.