cascading.tap.hadoop
Class ZipInputFormat

java.lang.Object
  extended by org.apache.hadoop.mapred.FileInputFormat<LongWritable,Text>
      extended by cascading.tap.hadoop.ZipInputFormat
All Implemented Interfaces:
InputFormat<LongWritable,Text>, JobConfigurable

public class ZipInputFormat
extends FileInputFormat<LongWritable,Text>
implements JobConfigurable

Class ZipInputFormat is an InputFormat for zip files. Each file within a zip file is broken into lines. Either line-feed or carriage-return are used to signal end of line. Keys are the position in the file, and values are the line of text.

If the underlying FileSystem is HDFS or FILE, each ZipEntry is returned as a unique split. Otherwise this input format returns false for isSplitable, and will subsequently iterate over each ZipEntry and treat all internal files as the 'same' file.


Field Summary
 
Fields inherited from class org.apache.hadoop.mapred.FileInputFormat
LOG
 
Constructor Summary
ZipInputFormat()
           
 
Method Summary
 void configure(JobConf conf)
           
 RecordReader<LongWritable,Text> getRecordReader(InputSplit genericSplit, JobConf job, Reporter reporter)
           
 InputSplit[] getSplits(JobConf job, int numSplits)
          Splits files returned by listPathsInternal(JobConf).
protected  boolean isAllowSplits(FileSystem fs)
           
protected  boolean isSplitable(FileSystem fs, Path file)
          Return true only if the file is in ZIP format.
protected  Path[] listPathsInternal(JobConf jobConf)
           
protected  FileStatus[] listStatus(JobConf jobConf)
           
 
Methods inherited from class org.apache.hadoop.mapred.FileInputFormat
addInputPath, addInputPaths, computeSplitSize, getBlockIndex, getInputPathFilter, getInputPaths, getSplitHosts, setInputPathFilter, setInputPaths, setInputPaths, setMinSplitSize
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

ZipInputFormat

public ZipInputFormat()
Method Detail

configure

public void configure(JobConf conf)
Specified by:
configure in interface JobConfigurable

isSplitable

protected boolean isSplitable(FileSystem fs,
                              Path file)
Return true only if the file is in ZIP format.

Overrides:
isSplitable in class FileInputFormat<LongWritable,Text>
Parameters:
fs - the file system that the file is on
file - the path that represents this file
Returns:
is this file splitable?

listPathsInternal

protected Path[] listPathsInternal(JobConf jobConf)
                            throws IOException
Throws:
IOException

listStatus

protected FileStatus[] listStatus(JobConf jobConf)
                           throws IOException
Overrides:
listStatus in class FileInputFormat<LongWritable,Text>
Throws:
IOException

getSplits

public InputSplit[] getSplits(JobConf job,
                              int numSplits)
                       throws IOException
Splits files returned by listPathsInternal(JobConf). Each file is expected to be in zip format and each split corresponds to ZipEntry.

Specified by:
getSplits in interface InputFormat<LongWritable,Text>
Overrides:
getSplits in class FileInputFormat<LongWritable,Text>
Parameters:
job - the JobConf data structure, see JobConf
numSplits - the number of splits required. Ignored here
Throws:
IOException - if input files are not in zip format

getRecordReader

public RecordReader<LongWritable,Text> getRecordReader(InputSplit genericSplit,
                                                       JobConf job,
                                                       Reporter reporter)
                                                throws IOException
Specified by:
getRecordReader in interface InputFormat<LongWritable,Text>
Specified by:
getRecordReader in class FileInputFormat<LongWritable,Text>
Throws:
IOException

isAllowSplits

protected boolean isAllowSplits(FileSystem fs)


Copyright © 2007-2010 Concurrent, Inc. All Rights Reserved.