cascading.tap.hadoop
Class Hfs

java.lang.Object
  extended by cascading.tap.Tap<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector>
      extended by cascading.tap.hadoop.Hfs
All Implemented Interfaces:
FlowElement, Serializable
Direct Known Subclasses:
Dfs, Lfs

public class Hfs
extends Tap<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector>

Class Hfs is the base class for all Hadoop file system access. Hfs may only be used with the HadoopFlowConnector when creating Hadoop executable Flow instances.

Optionally use Dfs or Lfs for resources specific to Hadoop Distributed file system or the Local file system, respectively.

Use the Hfs class if the 'kind' of resource is unknown at design time. To use, prefix a scheme to the 'stringPath'. Where hdfs://... will denote Dfs, and file://... will denote Lfs.

Call setTemporaryDirectory(java.util.Map, String) to use a different temporary file directory path other than the current Hadoop default path.

By default Cascading on Hadoop will assume any source or sink Tap using the file:// URI scheme intends to read files from the local client filesystem (for example when using the Lfs Tap) where the Hadoop job jar is started, Tap so will force any MapReduce jobs reading or writing to file:// resources to run in Hadoop "local mode" so that the file can be read.

To change this behavior, setLocalModeScheme(java.util.Map, String) to set a different scheme value, or to "none" to disable entirely for the case the file to be read is available on every Hadoop processing node in the exact same path.

See Also:
Serialized Form

Field Summary
static String LOCAL_MODE_SCHEME
          Fields LOCAL_MODE_SCHEME *
protected  String stringPath
          Field stringPath
static String TEMPORARY_DIRECTORY
          Field TEMPORARY_DIRECTORY
 
Constructor Summary
protected Hfs()
           
  Hfs(Fields fields, String stringPath)
          Deprecated. 
  Hfs(Fields fields, String stringPath, boolean replace)
          Deprecated. 
  Hfs(Fields fields, String stringPath, SinkMode sinkMode)
          Deprecated. 
protected Hfs(Scheme<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector,?,?> scheme)
           
  Hfs(Scheme<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector,?,?> scheme, String stringPath)
          Constructor Hfs creates a new Hfs instance.
  Hfs(Scheme<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector,?,?> scheme, String stringPath, boolean replace)
          Deprecated. 
  Hfs(Scheme<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector,?,?> scheme, String stringPath, SinkMode sinkMode)
          Constructor Hfs creates a new Hfs instance.
 
Method Summary
 boolean createResource(org.apache.hadoop.mapred.JobConf conf)
          Method createResource creates the underlying resource.
 boolean deleteResource(org.apache.hadoop.mapred.JobConf conf)
          Method deleteResource deletes the resource represented by this instance.
 long getBlockSize(org.apache.hadoop.mapred.JobConf conf)
          Method getBlockSize returns the blocksize specified by the underlying file system for this resource.
 String[] getChildIdentifiers(org.apache.hadoop.mapred.JobConf conf)
          Method getChildIdentifiers returns an array of child identifiers if this resource is a directory.
protected  org.apache.hadoop.fs.FileSystem getDefaultFileSystem(org.apache.hadoop.mapred.JobConf jobConf)
           
 URI getDefaultFileSystemURIScheme(org.apache.hadoop.mapred.JobConf jobConf)
          Method getDefaultFileSystemURIScheme returns the URI scheme for the default Hadoop FileSystem.
protected  org.apache.hadoop.fs.FileSystem getFileSystem(org.apache.hadoop.mapred.JobConf jobConf)
           
 String getFullIdentifier(org.apache.hadoop.mapred.JobConf conf)
          Method getFullIdentifier returns a fully qualified resource identifier.
 String getIdentifier()
          Method getIdentifier returns a String representing the resource this Tap instance represents.
protected static String getLocalModeScheme(org.apache.hadoop.mapred.JobConf conf, String defaultValue)
           
 long getModifiedTime(org.apache.hadoop.mapred.JobConf conf)
          Method getModifiedTime returns the date this resource was last modified.
 org.apache.hadoop.fs.Path getPath()
           
 int getReplication(org.apache.hadoop.mapred.JobConf conf)
          Method getReplication returns the replication specified by the underlying file system for this resource.
 long getSize(org.apache.hadoop.mapred.JobConf conf)
          Method getSize returns the size of the file referenced by this tap.
static String getTemporaryDirectory(Map<Object,Object> properties)
          Method getTemporaryDirectory returns the configured temporary directory from the given properties object.
static org.apache.hadoop.fs.Path getTempPath(org.apache.hadoop.mapred.JobConf conf)
           
 URI getURIScheme(org.apache.hadoop.mapred.JobConf jobConf)
           
 boolean isDirectory(org.apache.hadoop.mapred.JobConf conf)
          Method isDirectory returns true if the underlying resource represents a directory or folder instead of an individual file.
protected  String makeTemporaryPathDirString(String name)
           
protected  URI makeURIScheme(org.apache.hadoop.mapred.JobConf jobConf)
           
 TupleEntryIterator openForRead(FlowProcess<org.apache.hadoop.mapred.JobConf> flowProcess, org.apache.hadoop.mapred.RecordReader input)
          Method openForRead opens the resource represented by this Tap instance for reading.
 TupleEntryCollector openForWrite(FlowProcess<org.apache.hadoop.mapred.JobConf> flowProcess, org.apache.hadoop.mapred.OutputCollector output)
          Method openForWrite opens the resource represented by this Tap instance for writing.
 boolean resourceExists(org.apache.hadoop.mapred.JobConf conf)
          Method resourceExists returns true if the path represented by this instance exists.
static void setLocalModeScheme(Map<Object,Object> properties, String scheme)
          Method setLocalModeScheme provides a means to change the scheme value used to detect when a MapReduce job should be run in Hadoop local mode.
protected  void setStringPath(String stringPath)
           
static void setTemporaryDirectory(Map<Object,Object> properties, String tempDir)
          Method setTemporaryDirectory sets the temporary directory on the given properties object.
protected  void setUriScheme(URI uriScheme)
           
 void sinkConfInit(FlowProcess<org.apache.hadoop.mapred.JobConf> process, org.apache.hadoop.mapred.JobConf conf)
          Method sinkConfInit initializes this instance as a sink.
 void sourceConfInit(FlowProcess<org.apache.hadoop.mapred.JobConf> process, org.apache.hadoop.mapred.JobConf conf)
          Method sourceConfInit initializes this instance as a source.
 
Methods inherited from class cascading.tap.Tap
commitResource, equals, flowConfInit, getConfigDef, getScheme, getSinkFields, getSinkMode, getSourceFields, getStepConfigDef, getTrace, hasConfigDef, hashCode, hasStepConfigDef, id, isEquivalentTo, isKeep, isReplace, isSink, isSource, isTemporary, isUpdate, openForRead, openForWrite, outgoingScopeFor, presentSinkFields, presentSourceFields, resolveIncomingOperationArgumentFields, resolveIncomingOperationPassThroughFields, retrieveSinkFields, retrieveSourceFields, rollbackResource, setScheme, taps, toString
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Field Detail

TEMPORARY_DIRECTORY

public static final String TEMPORARY_DIRECTORY
Field TEMPORARY_DIRECTORY

See Also:
Constant Field Values

LOCAL_MODE_SCHEME

public static final String LOCAL_MODE_SCHEME
Fields LOCAL_MODE_SCHEME *

See Also:
Constant Field Values

stringPath

protected String stringPath
Field stringPath

Constructor Detail

Hfs

protected Hfs()

Hfs

@ConstructorProperties(value="scheme")
protected Hfs(Scheme<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector,?,?> scheme)

Hfs

@Deprecated
@ConstructorProperties(value={"fields","stringPath"})
public Hfs(Fields fields,
                                                 String stringPath)
Deprecated. 

Constructor Hfs creates a new Hfs instance.

Parameters:
fields - of type Fields
stringPath - of type String

Hfs

@Deprecated
@ConstructorProperties(value={"fields","stringPath","replace"})
public Hfs(Fields fields,
                                                 String stringPath,
                                                 boolean replace)
Deprecated. 

Constructor Hfs creates a new Hfs instance.

Parameters:
fields - of type Fields
stringPath - of type String
replace - of type boolean

Hfs

@Deprecated
@ConstructorProperties(value={"fields","stringPath","sinkMode"})
public Hfs(Fields fields,
                                                 String stringPath,
                                                 SinkMode sinkMode)
Deprecated. 

Constructor Hfs creates a new Hfs instance.

Parameters:
fields - of type Fields
stringPath - of type String
sinkMode - of type SinkMode

Hfs

@ConstructorProperties(value={"scheme","stringPath"})
public Hfs(Scheme<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector,?,?> scheme,
                                      String stringPath)
Constructor Hfs creates a new Hfs instance.

Parameters:
scheme - of type Scheme
stringPath - of type String

Hfs

@Deprecated
@ConstructorProperties(value={"scheme","stringPath","replace"})
public Hfs(Scheme<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector,?,?> scheme,
                                                 String stringPath,
                                                 boolean replace)
Deprecated. 

Constructor Hfs creates a new Hfs instance.

Parameters:
scheme - of type Scheme
stringPath - of type String
replace - of type boolean

Hfs

@ConstructorProperties(value={"scheme","stringPath","sinkMode"})
public Hfs(Scheme<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector,?,?> scheme,
                                      String stringPath,
                                      SinkMode sinkMode)
Constructor Hfs creates a new Hfs instance.

Parameters:
scheme - of type Scheme
stringPath - of type String
sinkMode - of type SinkMode
Method Detail

setTemporaryDirectory

public static void setTemporaryDirectory(Map<Object,Object> properties,
                                         String tempDir)
Method setTemporaryDirectory sets the temporary directory on the given properties object.

Parameters:
properties - of type Map
tempDir - of type String

getTemporaryDirectory

public static String getTemporaryDirectory(Map<Object,Object> properties)
Method getTemporaryDirectory returns the configured temporary directory from the given properties object.

Parameters:
properties - of type Map
Returns:
a String or null if not set

setLocalModeScheme

public static void setLocalModeScheme(Map<Object,Object> properties,
                                      String scheme)
Method setLocalModeScheme provides a means to change the scheme value used to detect when a MapReduce job should be run in Hadoop local mode. By default the value is "file", set to "none" to disable entirely.

Parameters:
properties - of tyep Map
scheme - a String

getLocalModeScheme

protected static String getLocalModeScheme(org.apache.hadoop.mapred.JobConf conf,
                                           String defaultValue)

setStringPath

protected void setStringPath(String stringPath)

setUriScheme

protected void setUriScheme(URI uriScheme)

getURIScheme

public URI getURIScheme(org.apache.hadoop.mapred.JobConf jobConf)

makeURIScheme

protected URI makeURIScheme(org.apache.hadoop.mapred.JobConf jobConf)

getDefaultFileSystemURIScheme

public URI getDefaultFileSystemURIScheme(org.apache.hadoop.mapred.JobConf jobConf)
Method getDefaultFileSystemURIScheme returns the URI scheme for the default Hadoop FileSystem.

Parameters:
jobConf - of type JobConf
Returns:
URI

getDefaultFileSystem

protected org.apache.hadoop.fs.FileSystem getDefaultFileSystem(org.apache.hadoop.mapred.JobConf jobConf)

getFileSystem

protected org.apache.hadoop.fs.FileSystem getFileSystem(org.apache.hadoop.mapred.JobConf jobConf)

getIdentifier

public String getIdentifier()
Description copied from class: Tap
Method getIdentifier returns a String representing the resource this Tap instance represents.

Often, if the tap accesses a filesystem, the identifier is nothing more than the path to the file or directory. In other cases it may be a an URL or URI representing a connection string or remote resource.

Any two Tap instances having the same value for the identifier are considered equal.

Specified by:
getIdentifier in class Tap<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector>
Returns:
String

getPath

public org.apache.hadoop.fs.Path getPath()

getFullIdentifier

public String getFullIdentifier(org.apache.hadoop.mapred.JobConf conf)
Description copied from class: Tap
Method getFullIdentifier returns a fully qualified resource identifier.

Overrides:
getFullIdentifier in class Tap<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector>
Parameters:
conf - of type Config
Returns:
String

sourceConfInit

public void sourceConfInit(FlowProcess<org.apache.hadoop.mapred.JobConf> process,
                           org.apache.hadoop.mapred.JobConf conf)
Description copied from class: Tap
Method sourceConfInit initializes this instance as a source.

This method maybe called more than once if this Tap instance is used outside the scope of a Flow instance or if it participates in multiple times in a given Flow or across different Flows in a Cascade.

In the context of a Flow, it will be called after FlowListener.onStarting(cascading.flow.Flow)

Note that no resources or services should be modified by this method.

Overrides:
sourceConfInit in class Tap<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector>
Parameters:
process - of type FlowProcess
conf - of type Config

sinkConfInit

public void sinkConfInit(FlowProcess<org.apache.hadoop.mapred.JobConf> process,
                         org.apache.hadoop.mapred.JobConf conf)
Description copied from class: Tap
Method sinkConfInit initializes this instance as a sink.

This method maybe called more than once if this Tap instance is used outside the scope of a Flow instance or if it participates in multiple times in a given Flow or across different Flows in a Cascade.

Note this method will be called in context of this Tap being used as a traditional 'sink' and as a 'trap'.

In the context of a Flow, it will be called after FlowListener.onStarting(cascading.flow.Flow)

Note that no resources or services should be modified by this method. If this Tap instance returns true for Tap.isReplace(), then Tap.deleteResource(Object) will be called by the parent Flow.

Overrides:
sinkConfInit in class Tap<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector>
Parameters:
process - of type FlowProcess
conf - of type Config

openForRead

public TupleEntryIterator openForRead(FlowProcess<org.apache.hadoop.mapred.JobConf> flowProcess,
                                      org.apache.hadoop.mapred.RecordReader input)
                               throws IOException
Description copied from class: Tap
Method openForRead opens the resource represented by this Tap instance for reading.

input value may be null, if so, sub-classes must inquire with the underlying Scheme via Scheme.sourceConfInit(cascading.flow.FlowProcess, Tap, Object) to get the proper input type and instantiate it before calling super.openForRead().

Note the returned iterator will return the same instance of TupleEntry on every call, thus a copy must be made of either the TupleEntry or the underlying Tuple instance if they are to be stored in a Collection.

Specified by:
openForRead in class Tap<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector>
Parameters:
flowProcess - of type FlowProcess
input - of type Input
Returns:
TupleEntryIterator
Throws:
IOException - when the resource cannot be opened

openForWrite

public TupleEntryCollector openForWrite(FlowProcess<org.apache.hadoop.mapred.JobConf> flowProcess,
                                        org.apache.hadoop.mapred.OutputCollector output)
                                 throws IOException
Description copied from class: Tap
Method openForWrite opens the resource represented by this Tap instance for writing.

This method is used internally and does not honor the SinkMode setting. If SinkMode is SinkMode.REPLACE, this call may fail. See Tap.openForWrite(cascading.flow.FlowProcess).

output value may be null, if so, sub-classes must inquire with the underlying Scheme via Scheme.sinkConfInit(cascading.flow.FlowProcess, Tap, Object) to get the proper output type and instantiate it before calling super.openForWrite().

Specified by:
openForWrite in class Tap<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector>
Parameters:
flowProcess - of type FlowProcess
output - of type Output
Returns:
TupleEntryCollector
Throws:
IOException - when the resource cannot be opened

createResource

public boolean createResource(org.apache.hadoop.mapred.JobConf conf)
                       throws IOException
Description copied from class: Tap
Method createResource creates the underlying resource.

Specified by:
createResource in class Tap<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector>
Parameters:
conf - of type Config
Returns:
boolean
Throws:
IOException - when there is an error making directories

deleteResource

public boolean deleteResource(org.apache.hadoop.mapred.JobConf conf)
                       throws IOException
Description copied from class: Tap
Method deleteResource deletes the resource represented by this instance.

Specified by:
deleteResource in class Tap<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector>
Parameters:
conf - of type Config
Returns:
boolean
Throws:
IOException - when the resource cannot be deleted

resourceExists

public boolean resourceExists(org.apache.hadoop.mapred.JobConf conf)
                       throws IOException
Description copied from class: Tap
Method resourceExists returns true if the path represented by this instance exists.

Specified by:
resourceExists in class Tap<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector>
Parameters:
conf - of type Config
Returns:
true if the underlying resource already exists
Throws:
IOException - when the status cannot be determined

isDirectory

public boolean isDirectory(org.apache.hadoop.mapred.JobConf conf)
                    throws IOException
Method isDirectory returns true if the underlying resource represents a directory or folder instead of an individual file.

Parameters:
conf - of JobConf
Returns:
boolean
Throws:
IOException - when

getSize

public long getSize(org.apache.hadoop.mapred.JobConf conf)
             throws IOException
Method getSize returns the size of the file referenced by this tap.

Parameters:
conf - of type Properties
Returns:
The size of the file reference by this tap.
Throws:
IOException

getBlockSize

public long getBlockSize(org.apache.hadoop.mapred.JobConf conf)
                  throws IOException
Method getBlockSize returns the blocksize specified by the underlying file system for this resource.

Parameters:
conf - of JobConf
Returns:
long
Throws:
IOException - when

getReplication

public int getReplication(org.apache.hadoop.mapred.JobConf conf)
                   throws IOException
Method getReplication returns the replication specified by the underlying file system for this resource.

Parameters:
conf - of JobConf
Returns:
int
Throws:
IOException - when

getChildIdentifiers

public String[] getChildIdentifiers(org.apache.hadoop.mapred.JobConf conf)
                             throws IOException
Method getChildIdentifiers returns an array of child identifiers if this resource is a directory.

This method will skip Hadoop log directories (_log).

Parameters:
conf - of JobConf
Returns:
String[]
Throws:
IOException - when

getModifiedTime

public long getModifiedTime(org.apache.hadoop.mapred.JobConf conf)
                     throws IOException
Description copied from class: Tap
Method getModifiedTime returns the date this resource was last modified.

Specified by:
getModifiedTime in class Tap<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector>
Parameters:
conf - of type Config
Returns:
The date this resource was last modified.
Throws:
IOException

getTempPath

public static org.apache.hadoop.fs.Path getTempPath(org.apache.hadoop.mapred.JobConf conf)

makeTemporaryPathDirString

protected String makeTemporaryPathDirString(String name)


Copyright © 2007-2012 Concurrent, Inc. All Rights Reserved.