public class Hfs extends Tap<Configuration,RecordReader,OutputCollector> implements FileType<Configuration>
FlowConnector
sub-classes when creating Hadoop executable Flow
instances.
Paths typically should point to a directory, where in turn all the "part" files immediately in that directory will
be included. This is the practice Hadoop expects. Sub-directories are not included and typically result in a failure.
To include sub-directories, Hadoop supports "globing". Globing is a frustrating feature and is supported more
robustly by GlobHfs
and less so by Hfs.
Hfs will accept /*
(wildcard) paths, but not all convenience methods like
jobConf.getSize
will behave properly or reliably. Nor can the Hfs instance
with a wildcard path be used as a sink to write data.
In those cases use GlobHfs since it is a sub-class of MultiSourceTap
.
Optionally use Dfs
or Lfs
for resources specific to Hadoop Distributed file system or
the Local file system, respectively. Using Hfs is the best practice when possible, Lfs and Dfs are conveniences.
Use the Hfs class if the 'kind' of resource is unknown at design time. To use, prefix a scheme to the 'stringPath'. Where
hdfs://...
will denote Dfs, and file://...
will denote Lfs.
Call HfsProps.setTemporaryDirectory(java.util.Map, String)
to use a different temporary file directory path
other than the current Hadoop default path.
By default Cascading on Hadoop will assume any source or sink Tap using the file://
URI scheme
intends to read files from the local client filesystem (for example when using the Lfs
Tap) where the Hadoop
job jar is started. Subsequently Cascading will force any MapReduce jobs reading or writing to file://
resources
to run in Hadoop "standalone mode" so that the file can be read.
To change this behavior, HfsProps.setLocalModeScheme(java.util.Map, String)
to set a different scheme value,
or to "none" to disable entirely for the case the file to be read is available on every Hadoop processing node
in the exact same path.
When using a MapReduce planner, Hfs can optionally combine multiple small files (or a series of small "blocks") into
larger "splits". This reduces the number of resulting map tasks created by Hadoop and can improve application
performance.
This is enabled by calling HfsProps.setUseCombinedInput(boolean)
to true
. By default, merging
or combining splits into large ones is disabled.
Apache Tez planner does not require this setting, it is supported by default and enabled by the application manager.Modifier and Type | Field and Description |
---|---|
protected String |
stringPath
Field stringPath
|
Modifier | Constructor and Description |
---|---|
protected |
Hfs() |
protected |
Hfs(Scheme<Configuration,RecordReader,OutputCollector,?,?> scheme) |
|
Hfs(Scheme<Configuration,RecordReader,OutputCollector,?,?> scheme,
String stringPath)
Constructor Hfs creates a new Hfs instance.
|
|
Hfs(Scheme<Configuration,RecordReader,OutputCollector,?,?> scheme,
String stringPath,
SinkMode sinkMode)
Constructor Hfs creates a new Hfs instance.
|
Modifier and Type | Method and Description |
---|---|
protected void |
applySourceConfInitIdentifiers(FlowProcess<? extends Configuration> process,
Configuration conf,
String... fullIdentifiers) |
boolean |
createResource(Configuration conf) |
boolean |
deleteChildResource(Configuration conf,
String childIdentifier) |
boolean |
deleteChildResource(FlowProcess<? extends Configuration> flowProcess,
String childIdentifier) |
boolean |
deleteResource(Configuration conf) |
long |
getBlockSize(Configuration conf)
Method getBlockSize returns the
blocksize specified by the underlying file system for this resource. |
long |
getBlockSize(FlowProcess<? extends Configuration> flowProcess)
Method getBlockSize returns the
blocksize specified by the underlying file system for this resource. |
String[] |
getChildIdentifiers(Configuration conf) |
String[] |
getChildIdentifiers(Configuration conf,
int depth,
boolean fullyQualified) |
String[] |
getChildIdentifiers(FlowProcess<? extends Configuration> flowProcess) |
String[] |
getChildIdentifiers(FlowProcess<? extends Configuration> flowProcess,
int depth,
boolean fullyQualified) |
protected static boolean |
getCombinedInputSafeMode(Configuration conf) |
protected FileSystem |
getDefaultFileSystem(Configuration configuration) |
URI |
getDefaultFileSystemURIScheme(Configuration configuration)
Method getDefaultFileSystemURIScheme returns the URI scheme for the default Hadoop FileSystem.
|
protected FileSystem |
getFileSystem(Configuration configuration) |
String |
getFullIdentifier(Configuration conf) |
String |
getIdentifier() |
protected static String |
getLocalModeScheme(Configuration conf,
String defaultValue) |
long |
getModifiedTime(Configuration conf) |
Path |
getPath() |
int |
getReplication(Configuration conf)
Method getReplication returns the
replication specified by the underlying file system for
this resource. |
int |
getReplication(FlowProcess<? extends Configuration> flowProcess)
Method getReplication returns the
replication specified by the underlying file system for
this resource. |
long |
getSize(Configuration conf) |
long |
getSize(FlowProcess<? extends Configuration> flowProcess) |
static Path |
getTempPath(Configuration conf) |
URI |
getURIScheme(Configuration jobConf) |
protected static boolean |
getUseCombinedInput(Configuration conf) |
boolean |
isDirectory(Configuration conf) |
boolean |
isDirectory(FlowProcess<? extends Configuration> flowProcess) |
protected String |
makeTemporaryPathDirString(String name) |
protected URI |
makeURIScheme(Configuration configuration) |
TupleEntryIterator |
openForRead(FlowProcess<? extends Configuration> flowProcess,
RecordReader input) |
TupleEntryCollector |
openForWrite(FlowProcess<? extends Configuration> flowProcess,
OutputCollector output) |
void |
resetFileStatuses()
Method resetFileStatuses removes the status cache, if any.
|
boolean |
resourceExists(Configuration conf) |
protected void |
setStringPath(String stringPath) |
protected void |
setUriScheme(URI uriScheme) |
void |
sinkConfInit(FlowProcess<? extends Configuration> process,
Configuration conf) |
void |
sourceConfInit(FlowProcess<? extends Configuration> process,
Configuration conf) |
protected void |
sourceConfInitAddInputPath(Configuration conf,
Path qualifiedPath) |
protected void |
sourceConfInitComplete(FlowProcess<? extends Configuration> process,
Configuration conf) |
protected static void |
verifyNoDuplicates(Configuration conf) |
commitResource, createResource, deleteResource, equals, flowConfInit, getConfigDef, getFullIdentifier, getModifiedTime, getNodeConfigDef, getScheme, getSinkFields, getSinkMode, getSourceFields, getStepConfigDef, getTrace, hasConfigDef, hashCode, hasNodeConfigDef, hasStepConfigDef, id, isEquivalentTo, isKeep, isReplace, isSink, isSource, isTemporary, isUpdate, openForRead, openForWrite, outgoingScopeFor, prepareResourceForRead, prepareResourceForWrite, presentSinkFields, presentSourceFields, resolveIncomingOperationArgumentFields, resolveIncomingOperationPassThroughFields, resourceExists, retrieveSinkFields, retrieveSourceFields, rollbackResource, setScheme, taps, toString
protected String stringPath
protected Hfs()
@ConstructorProperties(value="scheme") protected Hfs(Scheme<Configuration,RecordReader,OutputCollector,?,?> scheme)
@ConstructorProperties(value={"scheme","stringPath"}) public Hfs(Scheme<Configuration,RecordReader,OutputCollector,?,?> scheme, String stringPath)
scheme
- of type SchemestringPath
- of type String@ConstructorProperties(value={"scheme","stringPath","sinkMode"}) public Hfs(Scheme<Configuration,RecordReader,OutputCollector,?,?> scheme, String stringPath, SinkMode sinkMode)
scheme
- of type SchemestringPath
- of type StringsinkMode
- of type SinkModeprotected static String getLocalModeScheme(Configuration conf, String defaultValue)
protected static boolean getUseCombinedInput(Configuration conf)
protected static boolean getCombinedInputSafeMode(Configuration conf)
protected void setStringPath(String stringPath)
protected void setUriScheme(URI uriScheme)
public URI getURIScheme(Configuration jobConf)
protected URI makeURIScheme(Configuration configuration)
public URI getDefaultFileSystemURIScheme(Configuration configuration)
configuration
- of type JobConfprotected FileSystem getDefaultFileSystem(Configuration configuration)
protected FileSystem getFileSystem(Configuration configuration)
public String getIdentifier()
getIdentifier
in class Tap<Configuration,RecordReader,OutputCollector>
public String getFullIdentifier(Configuration conf)
getFullIdentifier
in class Tap<Configuration,RecordReader,OutputCollector>
public void sourceConfInit(FlowProcess<? extends Configuration> process, Configuration conf)
sourceConfInit
in class Tap<Configuration,RecordReader,OutputCollector>
protected static void verifyNoDuplicates(Configuration conf)
protected void applySourceConfInitIdentifiers(FlowProcess<? extends Configuration> process, Configuration conf, String... fullIdentifiers)
protected void sourceConfInitAddInputPath(Configuration conf, Path qualifiedPath)
protected void sourceConfInitComplete(FlowProcess<? extends Configuration> process, Configuration conf)
public void sinkConfInit(FlowProcess<? extends Configuration> process, Configuration conf)
sinkConfInit
in class Tap<Configuration,RecordReader,OutputCollector>
public TupleEntryIterator openForRead(FlowProcess<? extends Configuration> flowProcess, RecordReader input) throws IOException
openForRead
in class Tap<Configuration,RecordReader,OutputCollector>
IOException
public TupleEntryCollector openForWrite(FlowProcess<? extends Configuration> flowProcess, OutputCollector output) throws IOException
openForWrite
in class Tap<Configuration,RecordReader,OutputCollector>
IOException
public boolean createResource(Configuration conf) throws IOException
createResource
in class Tap<Configuration,RecordReader,OutputCollector>
IOException
public boolean deleteResource(Configuration conf) throws IOException
deleteResource
in class Tap<Configuration,RecordReader,OutputCollector>
IOException
public boolean deleteChildResource(FlowProcess<? extends Configuration> flowProcess, String childIdentifier) throws IOException
IOException
public boolean deleteChildResource(Configuration conf, String childIdentifier) throws IOException
IOException
public boolean resourceExists(Configuration conf) throws IOException
resourceExists
in class Tap<Configuration,RecordReader,OutputCollector>
IOException
public boolean isDirectory(FlowProcess<? extends Configuration> flowProcess) throws IOException
isDirectory
in interface FileType<Configuration>
IOException
public boolean isDirectory(Configuration conf) throws IOException
isDirectory
in interface FileType<Configuration>
IOException
public long getSize(FlowProcess<? extends Configuration> flowProcess) throws IOException
getSize
in interface FileType<Configuration>
IOException
public long getSize(Configuration conf) throws IOException
getSize
in interface FileType<Configuration>
IOException
public long getBlockSize(FlowProcess<? extends Configuration> flowProcess) throws IOException
blocksize
specified by the underlying file system for this resource.flowProcess
- IOException
- whenpublic long getBlockSize(Configuration conf) throws IOException
blocksize
specified by the underlying file system for this resource.conf
- of JobConfIOException
- whenpublic int getReplication(FlowProcess<? extends Configuration> flowProcess) throws IOException
replication
specified by the underlying file system for
this resource.flowProcess
- IOException
- whenpublic int getReplication(Configuration conf) throws IOException
replication
specified by the underlying file system for
this resource.conf
- of JobConfIOException
- whenpublic String[] getChildIdentifiers(FlowProcess<? extends Configuration> flowProcess) throws IOException
getChildIdentifiers
in interface FileType<Configuration>
IOException
public String[] getChildIdentifiers(Configuration conf) throws IOException
getChildIdentifiers
in interface FileType<Configuration>
IOException
public String[] getChildIdentifiers(FlowProcess<? extends Configuration> flowProcess, int depth, boolean fullyQualified) throws IOException
getChildIdentifiers
in interface FileType<Configuration>
IOException
public String[] getChildIdentifiers(Configuration conf, int depth, boolean fullyQualified) throws IOException
getChildIdentifiers
in interface FileType<Configuration>
IOException
public long getModifiedTime(Configuration conf) throws IOException
getModifiedTime
in class Tap<Configuration,RecordReader,OutputCollector>
IOException
public static Path getTempPath(Configuration conf)
protected String makeTemporaryPathDirString(String name)
public void resetFileStatuses()
Copyright © 2007-2015 Concurrent, Inc. All Rights Reserved.