|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object cascading.tap.Tap<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector> cascading.tap.hadoop.Hfs
public class Hfs
Class Hfs is the base class for all Hadoop file system access. Hfs may only be used with the
HadoopFlowConnector
when creating Hadoop executable Flow
instances.
GlobHfs
and less so by Hfs.
Hfs will accept /*
(wildcard) paths, but not all convenience methods like
getSize(org.apache.hadoop.mapred.JobConf)
will behave properly or reliably. Nor can the Hfs instance
with a wildcard path be used as a sink to write data.
In those cases use GlobHfs since it is a sub-class of MultiSourceTap
.
Optionally use Dfs
or Lfs
for resources specific to Hadoop Distributed file system or
the Local file system, respectively. Using Hfs is the best practice when possible, Lfs and Dfs are conveniences.
Use the Hfs class if the 'kind' of resource is unknown at design time. To use, prefix a scheme to the 'stringPath'. Where
hdfs://...
will denote Dfs, and file://...
will denote Lfs.
Call setTemporaryDirectory(java.util.Map, String)
to use a different temporary file directory path
other than the current Hadoop default path.
By default Cascading on Hadoop will assume any source or sink Tap using the file://
URI scheme
intends to read files from the local client filesystem (for example when using the Lfs
Tap) where the Hadoop
job jar is started, Tap so will force any MapReduce jobs reading or writing to file://
resources to run in
Hadoop "standalone mode" so that the file can be read.
To change this behavior, HfsProps.setLocalModeScheme(java.util.Map, String)
to set a different scheme value,
or to "none" to disable entirely for the case the file to be read is available on every Hadoop processing node
in the exact same path.
Hfs can optionally combine multiple small files (or a series of small "blocks") into larger "splits". This reduces
the number of resulting map tasks created by Hadoop and can improve application performance.
This is enabled by calling HfsProps.setUseCombinedInput(boolean)
to true
. By default, merging
or combining splits into large ones is disabled.
Field Summary | |
---|---|
protected String |
stringPath
Field stringPath |
static String |
TEMPORARY_DIRECTORY
Deprecated. see HfsProps.TEMPORARY_DIRECTORY |
Constructor Summary | |
---|---|
protected |
Hfs()
|
|
Hfs(Fields fields,
String stringPath)
Deprecated. |
|
Hfs(Fields fields,
String stringPath,
boolean replace)
Deprecated. |
|
Hfs(Fields fields,
String stringPath,
SinkMode sinkMode)
Deprecated. |
protected |
Hfs(Scheme<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector,?,?> scheme)
|
|
Hfs(Scheme<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector,?,?> scheme,
String stringPath)
Constructor Hfs creates a new Hfs instance. |
|
Hfs(Scheme<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector,?,?> scheme,
String stringPath,
boolean replace)
Deprecated. |
|
Hfs(Scheme<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector,?,?> scheme,
String stringPath,
SinkMode sinkMode)
Constructor Hfs creates a new Hfs instance. |
Method Summary | |
---|---|
protected void |
applySourceConfInitIdentifiers(FlowProcess<org.apache.hadoop.mapred.JobConf> process,
org.apache.hadoop.mapred.JobConf conf,
String... fullIdentifiers)
|
boolean |
createResource(org.apache.hadoop.mapred.JobConf conf)
Method createResource creates the underlying resource. |
boolean |
deleteResource(org.apache.hadoop.mapred.JobConf conf)
Method deleteResource deletes the resource represented by this instance. |
long |
getBlockSize(org.apache.hadoop.mapred.JobConf conf)
Method getBlockSize returns the blocksize specified by the underlying file system for this resource. |
String[] |
getChildIdentifiers(org.apache.hadoop.mapred.JobConf conf)
Method getChildIdentifiers returns an array of child identifiers if this resource is a directory. |
protected org.apache.hadoop.fs.FileSystem |
getDefaultFileSystem(org.apache.hadoop.mapred.JobConf jobConf)
|
URI |
getDefaultFileSystemURIScheme(org.apache.hadoop.mapred.JobConf jobConf)
Method getDefaultFileSystemURIScheme returns the URI scheme for the default Hadoop FileSystem. |
protected org.apache.hadoop.fs.FileSystem |
getFileSystem(org.apache.hadoop.mapred.JobConf jobConf)
|
String |
getFullIdentifier(org.apache.hadoop.mapred.JobConf conf)
Method getFullIdentifier returns a fully qualified resource identifier. |
String |
getIdentifier()
Method getIdentifier returns a String representing the resource this Tap instance represents. |
protected static String |
getLocalModeScheme(org.apache.hadoop.mapred.JobConf conf,
String defaultValue)
|
long |
getModifiedTime(org.apache.hadoop.mapred.JobConf conf)
Method getModifiedTime returns the date this resource was last modified. |
org.apache.hadoop.fs.Path |
getPath()
|
int |
getReplication(org.apache.hadoop.mapred.JobConf conf)
Method getReplication returns the replication specified by the underlying file system for
this resource. |
long |
getSize(org.apache.hadoop.mapred.JobConf conf)
Method getSize returns the size of the file referenced by this tap. |
static String |
getTemporaryDirectory(Map<Object,Object> properties)
Deprecated. see HfsProps |
static org.apache.hadoop.fs.Path |
getTempPath(org.apache.hadoop.mapred.JobConf conf)
|
URI |
getURIScheme(org.apache.hadoop.mapred.JobConf jobConf)
|
protected static boolean |
getUseCombinedInput(org.apache.hadoop.mapred.JobConf conf)
|
boolean |
isDirectory(org.apache.hadoop.mapred.JobConf conf)
Method isDirectory returns true if the underlying resource represents a directory or folder instead of an individual file. |
protected String |
makeTemporaryPathDirString(String name)
|
protected URI |
makeURIScheme(org.apache.hadoop.mapred.JobConf jobConf)
|
TupleEntryIterator |
openForRead(FlowProcess<org.apache.hadoop.mapred.JobConf> flowProcess,
org.apache.hadoop.mapred.RecordReader input)
Method openForRead opens the resource represented by this Tap instance for reading. |
TupleEntryCollector |
openForWrite(FlowProcess<org.apache.hadoop.mapred.JobConf> flowProcess,
org.apache.hadoop.mapred.OutputCollector output)
Method openForWrite opens the resource represented by this Tap instance for writing. |
boolean |
resourceExists(org.apache.hadoop.mapred.JobConf conf)
Method resourceExists returns true if the path represented by this instance exists. |
protected void |
setStringPath(String stringPath)
|
static void |
setTemporaryDirectory(Map<Object,Object> properties,
String tempDir)
Deprecated. see HfsProps |
protected void |
setUriScheme(URI uriScheme)
|
void |
sinkConfInit(FlowProcess<org.apache.hadoop.mapred.JobConf> process,
org.apache.hadoop.mapred.JobConf conf)
Method sinkConfInit initializes this instance as a sink. |
void |
sourceConfInit(FlowProcess<org.apache.hadoop.mapred.JobConf> process,
org.apache.hadoop.mapred.JobConf conf)
Method sourceConfInit initializes this instance as a source. |
protected void |
sourceConfInitAddInputPath(org.apache.hadoop.mapred.JobConf conf,
org.apache.hadoop.fs.Path qualifiedPath)
|
protected void |
sourceConfInitComplete(FlowProcess<org.apache.hadoop.mapred.JobConf> process,
org.apache.hadoop.mapred.JobConf conf)
|
protected static void |
verifyNoDuplicates(org.apache.hadoop.mapred.JobConf conf)
|
Methods inherited from class java.lang.Object |
---|
clone, finalize, getClass, notify, notifyAll, wait, wait, wait |
Field Detail |
---|
@Deprecated public static final String TEMPORARY_DIRECTORY
HfsProps.TEMPORARY_DIRECTORY
protected String stringPath
Constructor Detail |
---|
protected Hfs()
@ConstructorProperties(value="scheme") protected Hfs(Scheme<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector,?,?> scheme)
@Deprecated @ConstructorProperties(value={"fields","stringPath"}) public Hfs(Fields fields, String stringPath)
fields
- of type FieldsstringPath
- of type String@Deprecated @ConstructorProperties(value={"fields","stringPath","replace"}) public Hfs(Fields fields, String stringPath, boolean replace)
fields
- of type FieldsstringPath
- of type Stringreplace
- of type boolean@Deprecated @ConstructorProperties(value={"fields","stringPath","sinkMode"}) public Hfs(Fields fields, String stringPath, SinkMode sinkMode)
fields
- of type FieldsstringPath
- of type StringsinkMode
- of type SinkMode@ConstructorProperties(value={"scheme","stringPath"}) public Hfs(Scheme<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector,?,?> scheme, String stringPath)
scheme
- of type SchemestringPath
- of type String@Deprecated @ConstructorProperties(value={"scheme","stringPath","replace"}) public Hfs(Scheme<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector,?,?> scheme, String stringPath, boolean replace)
scheme
- of type SchemestringPath
- of type Stringreplace
- of type boolean@ConstructorProperties(value={"scheme","stringPath","sinkMode"}) public Hfs(Scheme<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector,?,?> scheme, String stringPath, SinkMode sinkMode)
scheme
- of type SchemestringPath
- of type StringsinkMode
- of type SinkModeMethod Detail |
---|
@Deprecated public static void setTemporaryDirectory(Map<Object,Object> properties, String tempDir)
HfsProps
properties
- of type Map@Deprecated public static String getTemporaryDirectory(Map<Object,Object> properties)
HfsProps
properties
- of type Mapprotected static String getLocalModeScheme(org.apache.hadoop.mapred.JobConf conf, String defaultValue)
protected static boolean getUseCombinedInput(org.apache.hadoop.mapred.JobConf conf)
protected void setStringPath(String stringPath)
protected void setUriScheme(URI uriScheme)
public URI getURIScheme(org.apache.hadoop.mapred.JobConf jobConf)
protected URI makeURIScheme(org.apache.hadoop.mapred.JobConf jobConf)
public URI getDefaultFileSystemURIScheme(org.apache.hadoop.mapred.JobConf jobConf)
jobConf
- of type JobConf
protected org.apache.hadoop.fs.FileSystem getDefaultFileSystem(org.apache.hadoop.mapred.JobConf jobConf)
protected org.apache.hadoop.fs.FileSystem getFileSystem(org.apache.hadoop.mapred.JobConf jobConf)
public String getIdentifier()
Tap
getIdentifier
in class Tap<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector>
public org.apache.hadoop.fs.Path getPath()
public String getFullIdentifier(org.apache.hadoop.mapred.JobConf conf)
Tap
getFullIdentifier
in class Tap<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector>
conf
- of type Config
public void sourceConfInit(FlowProcess<org.apache.hadoop.mapred.JobConf> process, org.apache.hadoop.mapred.JobConf conf)
Tap
Flow
instance or if it participates in multiple times in a given Flow or across different Flows in
a Cascade
.
In the context of a Flow, it will be called after
FlowListener.onStarting(cascading.flow.Flow)
Note that no resources or services should be modified by this method.
sourceConfInit
in class Tap<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector>
process
- of type FlowProcessconf
- of type Configprotected static void verifyNoDuplicates(org.apache.hadoop.mapred.JobConf conf)
protected void applySourceConfInitIdentifiers(FlowProcess<org.apache.hadoop.mapred.JobConf> process, org.apache.hadoop.mapred.JobConf conf, String... fullIdentifiers)
protected void sourceConfInitAddInputPath(org.apache.hadoop.mapred.JobConf conf, org.apache.hadoop.fs.Path qualifiedPath)
protected void sourceConfInitComplete(FlowProcess<org.apache.hadoop.mapred.JobConf> process, org.apache.hadoop.mapred.JobConf conf)
public void sinkConfInit(FlowProcess<org.apache.hadoop.mapred.JobConf> process, org.apache.hadoop.mapred.JobConf conf)
Tap
Flow
instance or if it participates in multiple times in a given Flow or across different Flows in
a Cascade
.
Note this method will be called in context of this Tap being used as a traditional 'sink' and as a 'trap'.
In the context of a Flow, it will be called after
FlowListener.onStarting(cascading.flow.Flow)
Note that no resources or services should be modified by this method. If this Tap instance returns true for
Tap.isReplace()
, then Tap.deleteResource(Object)
will be called by the parent Flow.
sinkConfInit
in class Tap<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector>
process
- of type FlowProcessconf
- of type Configpublic TupleEntryIterator openForRead(FlowProcess<org.apache.hadoop.mapred.JobConf> flowProcess, org.apache.hadoop.mapred.RecordReader input) throws IOException
Tap
input
value may be null, if so, sub-classes must inquire with the underlying Scheme
via Scheme.sourceConfInit(cascading.flow.FlowProcess, Tap, Object)
to get the proper
input type and instantiate it before calling super.openForRead()
.
Note the returned iterator will return the same instance of TupleEntry
on every call,
thus a copy must be made of either the TupleEntry or the underlying Tuple
instance if they are to be
stored in a Collection.
openForRead
in class Tap<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector>
flowProcess
- of type FlowProcessinput
- of type Input
IOException
- when the resource cannot be openedpublic TupleEntryCollector openForWrite(FlowProcess<org.apache.hadoop.mapred.JobConf> flowProcess, org.apache.hadoop.mapred.OutputCollector output) throws IOException
Tap
SinkMode
setting. If SinkMode is
SinkMode.REPLACE
, this call may fail. See Tap.openForWrite(cascading.flow.FlowProcess)
.
output
value may be null, if so, sub-classes must inquire with the underlying Scheme
via Scheme.sinkConfInit(cascading.flow.FlowProcess, Tap, Object)
to get the proper
output type and instantiate it before calling super.openForWrite()
.
openForWrite
in class Tap<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector>
flowProcess
- of type FlowProcessoutput
- of type Output
IOException
- when the resource cannot be openedpublic boolean createResource(org.apache.hadoop.mapred.JobConf conf) throws IOException
Tap
createResource
in class Tap<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector>
conf
- of type Config
IOException
- when there is an error making directoriespublic boolean deleteResource(org.apache.hadoop.mapred.JobConf conf) throws IOException
Tap
deleteResource
in class Tap<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector>
conf
- of type Config
IOException
- when the resource cannot be deletedpublic boolean resourceExists(org.apache.hadoop.mapred.JobConf conf) throws IOException
Tap
resourceExists
in class Tap<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector>
conf
- of type Config
IOException
- when the status cannot be determinedpublic boolean isDirectory(org.apache.hadoop.mapred.JobConf conf) throws IOException
FileType
isDirectory
in interface FileType<org.apache.hadoop.mapred.JobConf>
conf
- of JobConf
IOException
public long getSize(org.apache.hadoop.mapred.JobConf conf) throws IOException
FileType
getSize
in interface FileType<org.apache.hadoop.mapred.JobConf>
conf
- of type Config
IOException
public long getBlockSize(org.apache.hadoop.mapred.JobConf conf) throws IOException
blocksize
specified by the underlying file system for this resource.
conf
- of JobConf
IOException
- whenpublic int getReplication(org.apache.hadoop.mapred.JobConf conf) throws IOException
replication
specified by the underlying file system for
this resource.
conf
- of JobConf
IOException
- whenpublic String[] getChildIdentifiers(org.apache.hadoop.mapred.JobConf conf) throws IOException
FileType
_log
).
getChildIdentifiers
in interface FileType<org.apache.hadoop.mapred.JobConf>
conf
- of JobConf
IOException
public long getModifiedTime(org.apache.hadoop.mapred.JobConf conf) throws IOException
Tap
getModifiedTime
in class Tap<org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector>
conf
- of type Config
IOException
public static org.apache.hadoop.fs.Path getTempPath(org.apache.hadoop.mapred.JobConf conf)
protected String makeTemporaryPathDirString(String name)
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |