cascading.scheme.hadoop
Class TextLine

java.lang.Object
  extended by cascading.scheme.Scheme<JobConf,RecordReader,OutputCollector,Object[],Object[]>
      extended by cascading.scheme.hadoop.TextLine
All Implemented Interfaces:
Serializable
Direct Known Subclasses:
TextDelimited

public class TextLine
extends Scheme<JobConf,RecordReader,OutputCollector,Object[],Object[]>

A TextLine is a type of Scheme for plain text files. Files are broken into lines. Either line-feed or carriage-return are used to signal end of line.

By default, this scheme returns a Tuple with two fields, "offset" and "line".

Many of the constructors take both "sourceFields" and "sinkFields". sourceFields denote the field names to be used instead of the names "offset" and "line". sinkFields is a selector and is by default Fields.ALL. Any available field names can be given if only a subset of the incoming fields should be used.

If a Fields instance is passed on the constructor as sourceFields having only one field, the return tuples will simply be the "line" value using the given field name.

Note that TextLine will concatenate all the Tuple values for the selected fields with a TAB delimiter before writing out the line.

Note sink compression is TextLine.Compress.DISABLE by default. If null is passed to the constructor for the compression value, it will remain disabled.

If any of the input files end with ".zip", an error will be thrown.

See Also:
Serialized Form

Nested Class Summary
static class TextLine.Compress
           
 
Field Summary
static Fields DEFAULT_SOURCE_FIELDS
          Field DEFAULT_SOURCE_FIELDS
 
Constructor Summary
TextLine()
          Creates a new TextLine instance that sources "offset" and "line" fields, and sinks all incoming fields, where "offset" is the byte offset in the input file.
TextLine(Fields sourceFields)
          Creates a new TextLine instance.
TextLine(Fields sourceFields, Fields sinkFields)
          Creates a new TextLine instance.
TextLine(Fields sourceFields, Fields sinkFields, int numSinkParts)
          Creates a new TextLine instance.
TextLine(Fields sourceFields, Fields sinkFields, TextLine.Compress sinkCompression)
          Constructor TextLine creates a new TextLine instance.
TextLine(Fields sourceFields, Fields sinkFields, TextLine.Compress sinkCompression, int numSinkParts)
          Constructor TextLine creates a new TextLine instance.
TextLine(Fields sourceFields, int numSinkParts)
          Creates a new TextLine instance.
TextLine(int numSinkParts)
          Creates a new TextLine instance that sources "offset" and "line" fields, and sinks all incoming fields, where "offset" is the byte offset in the input file.
TextLine(TextLine.Compress sinkCompression)
          Creates a new TextLine instance that sources "offset" and "line" fields, and sinks all incoming fields, where "offset" is the byte offset in the input file.
 
Method Summary
 TextLine.Compress getSinkCompression()
          Method getSinkCompression returns the sinkCompression of this TextLine object.
 void presentSinkFields(FlowProcess<JobConf> flowProcess, Tap tap, Fields fields)
          Method presentSinkFields is called after the planner is invoked and all fields are resolved.
 void presentSourceFields(FlowProcess<JobConf> flowProcess, Tap tap, Fields fields)
          Method presentSourceFields is called after the planner is invoked and all fields are resolved.
 void setSinkCompression(TextLine.Compress sinkCompression)
          Method setSinkCompression sets the sinkCompression of this TextLine object.
 void sink(FlowProcess<JobConf> flowProcess, SinkCall<Object[],OutputCollector> sinkCall)
          Method sink writes out the given Tuple found on SinkCall.getOutgoingEntry() to the SinkCall.getOutput().
 void sinkConfInit(FlowProcess<JobConf> flowProcess, Tap<JobConf,RecordReader,OutputCollector> tap, JobConf conf)
          Method sinkInit initializes this instance as a sink.
 boolean source(FlowProcess<JobConf> flowProcess, SourceCall<Object[],RecordReader> sourceCall)
          Method source will read a new "record" or value from SourceCall.getInput() and populate the available Tuple via SourceCall.getIncomingEntry() and return true on success or false if no more values available.
 void sourceCleanup(FlowProcess<JobConf> flowProcess, SourceCall<Object[],RecordReader> sourceCall)
          Method sourceCleanup is used to destroy resources created by Scheme.sourcePrepare(cascading.flow.FlowProcess, SourceCall).
 void sourceConfInit(FlowProcess<JobConf> flowProcess, Tap<JobConf,RecordReader,OutputCollector> tap, JobConf conf)
          Method sourceInit initializes this instance as a source.
protected  void sourceHandleInput(SourceCall<Object[],RecordReader> sourceCall)
           
 void sourcePrepare(FlowProcess<JobConf> flowProcess, SourceCall<Object[],RecordReader> sourceCall)
          Method sourcePrepare is used to initialize resources needed during each call of Scheme.source(cascading.flow.FlowProcess, SourceCall).
protected  void verify(Fields sourceFields)
           
 
Methods inherited from class cascading.scheme.Scheme
equals, getNumSinkParts, getSinkFields, getSourceFields, getTrace, hashCode, isSink, isSource, isSymmetrical, presentSinkFieldsInternal, presentSourceFieldsInternal, retrieveSinkFields, retrieveSourceFields, setNumSinkParts, setSinkFields, setSourceFields, sinkCleanup, sinkPrepare, toString
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Field Detail

DEFAULT_SOURCE_FIELDS

public static final Fields DEFAULT_SOURCE_FIELDS
Field DEFAULT_SOURCE_FIELDS

Constructor Detail

TextLine

public TextLine()
Creates a new TextLine instance that sources "offset" and "line" fields, and sinks all incoming fields, where "offset" is the byte offset in the input file.


TextLine

@ConstructorProperties(value="numSinkParts")
public TextLine(int numSinkParts)
Creates a new TextLine instance that sources "offset" and "line" fields, and sinks all incoming fields, where "offset" is the byte offset in the input file.

Parameters:
numSinkParts - of type int

TextLine

@ConstructorProperties(value="sinkCompression")
public TextLine(TextLine.Compress sinkCompression)
Creates a new TextLine instance that sources "offset" and "line" fields, and sinks all incoming fields, where "offset" is the byte offset in the input file.

Parameters:
sinkCompression - of type Compress

TextLine

@ConstructorProperties(value={"sourceFields","sinkFields"})
public TextLine(Fields sourceFields,
                                           Fields sinkFields)
Creates a new TextLine instance. If sourceFields has one field, only the text line will be returned in the subsequent tuples.

Parameters:
sourceFields - the source fields for this scheme
sinkFields - the sink fields for this scheme

TextLine

@ConstructorProperties(value={"sourceFields","sinkFields","numSinkParts"})
public TextLine(Fields sourceFields,
                                           Fields sinkFields,
                                           int numSinkParts)
Creates a new TextLine instance. If sourceFields has one field, only the text line will be returned in the subsequent tuples.

Parameters:
sourceFields - the source fields for this scheme
sinkFields - the sink fields for this scheme
numSinkParts - of type int

TextLine

@ConstructorProperties(value={"sourceFields","sinkFields","sinkCompression"})
public TextLine(Fields sourceFields,
                                           Fields sinkFields,
                                           TextLine.Compress sinkCompression)
Constructor TextLine creates a new TextLine instance. If sourceFields has one field, only the text line will be returned in the subsequent tuples.

Parameters:
sourceFields - of type Fields
sinkFields - of type Fields
sinkCompression - of type Compress

TextLine

@ConstructorProperties(value={"sourceFields","sinkFields","sinkCompression","numSinkParts"})
public TextLine(Fields sourceFields,
                                           Fields sinkFields,
                                           TextLine.Compress sinkCompression,
                                           int numSinkParts)
Constructor TextLine creates a new TextLine instance. If sourceFields has one field, only the text line will be returned in the subsequent tuples.

Parameters:
sourceFields - of type Fields
sinkFields - of type Fields
sinkCompression - of type Compress
numSinkParts - of type int

TextLine

@ConstructorProperties(value="sourceFields")
public TextLine(Fields sourceFields)
Creates a new TextLine instance. If sourceFields has one field, only the text line will be returned in the subsequent tuples.

Parameters:
sourceFields - the source fields for this scheme

TextLine

@ConstructorProperties(value={"sourceFields","numSinkParts"})
public TextLine(Fields sourceFields,
                                           int numSinkParts)
Creates a new TextLine instance. If sourceFields has one field, only the text line will be returned in the subsequent tuples. The resulting data set will have numSinkParts.

Parameters:
sourceFields - the source fields for this scheme
numSinkParts - of type int
Method Detail

verify

protected void verify(Fields sourceFields)

getSinkCompression

public TextLine.Compress getSinkCompression()
Method getSinkCompression returns the sinkCompression of this TextLine object.

Returns:
the sinkCompression (type Compress) of this TextLine object.

setSinkCompression

public void setSinkCompression(TextLine.Compress sinkCompression)
Method setSinkCompression sets the sinkCompression of this TextLine object. If null, compression will remain disabled.

Parameters:
sinkCompression - the sinkCompression of this TextLine object.

sourceConfInit

public void sourceConfInit(FlowProcess<JobConf> flowProcess,
                           Tap<JobConf,RecordReader,OutputCollector> tap,
                           JobConf conf)
Description copied from class: Scheme
Method sourceInit initializes this instance as a source.

This method is executed client side as a means to provide necessary configuration parameters used by the underlying platform.

It is not intended to initialize resources that would be necessary during the execution of this class, like a "formatter" or "parser".

See Scheme.sourcePrepare(cascading.flow.FlowProcess, SourceCall) if resources much be initialized before use. And Scheme.sourceCleanup(cascading.flow.FlowProcess, SourceCall) if resources must be destroyed after use.

Specified by:
sourceConfInit in class Scheme<JobConf,RecordReader,OutputCollector,Object[],Object[]>
tap - of type Tap
conf - of type JobConf @throws IOException on initialization failure

presentSourceFields

public void presentSourceFields(FlowProcess<JobConf> flowProcess,
                                Tap tap,
                                Fields fields)
Description copied from class: Scheme
Method presentSourceFields is called after the planner is invoked and all fields are resolved. This method presents to the Scheme the actual source fields after any planner intervention.

This method is called after Scheme.retrieveSourceFields(cascading.flow.FlowProcess, cascading.tap.Tap).

Overrides:
presentSourceFields in class Scheme<JobConf,RecordReader,OutputCollector,Object[],Object[]>
Parameters:
flowProcess - of type FlowProcess
tap - of type Tap
fields - of type Fields

presentSinkFields

public void presentSinkFields(FlowProcess<JobConf> flowProcess,
                              Tap tap,
                              Fields fields)
Description copied from class: Scheme
Method presentSinkFields is called after the planner is invoked and all fields are resolved. This method presents to the Scheme the actual source fields after any planner intervention.

This method is called after Scheme.retrieveSinkFields(cascading.flow.FlowProcess, cascading.tap.Tap).

Overrides:
presentSinkFields in class Scheme<JobConf,RecordReader,OutputCollector,Object[],Object[]>
Parameters:
flowProcess - of type FlowProcess
tap - of type Tap
fields - of type Fields

sinkConfInit

public void sinkConfInit(FlowProcess<JobConf> flowProcess,
                         Tap<JobConf,RecordReader,OutputCollector> tap,
                         JobConf conf)
Description copied from class: Scheme
Method sinkInit initializes this instance as a sink.

This method is executed client side as a means to provide necessary configuration parameters used by the underlying platform.

It is not intended to initialize resources that would be necessary during the execution of this class, like a "formatter" or "parser".

See Scheme.sinkPrepare(cascading.flow.FlowProcess, SinkCall) if resources much be initialized before use. And Scheme.sinkCleanup(cascading.flow.FlowProcess, SinkCall) if resources must be destroyed after use.

Specified by:
sinkConfInit in class Scheme<JobConf,RecordReader,OutputCollector,Object[],Object[]>
tap - of type Tap
conf - of type JobConf @throws IOException on initialization failure

sourcePrepare

public void sourcePrepare(FlowProcess<JobConf> flowProcess,
                          SourceCall<Object[],RecordReader> sourceCall)
Description copied from class: Scheme
Method sourcePrepare is used to initialize resources needed during each call of Scheme.source(cascading.flow.FlowProcess, SourceCall).

Be sure to place any initialized objects in the SourceContext so each instance will remain threadsafe.

Overrides:
sourcePrepare in class Scheme<JobConf,RecordReader,OutputCollector,Object[],Object[]>
Parameters:
flowProcess - of Process
sourceCall - of SourceCall

source

public boolean source(FlowProcess<JobConf> flowProcess,
                      SourceCall<Object[],RecordReader> sourceCall)
               throws IOException
Description copied from class: Scheme
Method source will read a new "record" or value from SourceCall.getInput() and populate the available Tuple via SourceCall.getIncomingEntry() and return true on success or false if no more values available.

It's ok to set a new Tuple instance on the incomingEntry TupleEntry, or to simply re-use the existing instance.

Note this is only time it is safe to modify a Tuple instance handed over via a method call.

This method may optionally throw a TapException if it cannot process a particular instance of data. If the payload Tuple is set on the TapException, that Tuple will be written to any applicable failure trap Tap.

Specified by:
source in class Scheme<JobConf,RecordReader,OutputCollector,Object[],Object[]>
Parameters:
flowProcess - of Process
sourceCall - of SourceCall
Returns:
returns true when a Tuple was successfully read
Throws:
IOException

sourceHandleInput

protected void sourceHandleInput(SourceCall<Object[],RecordReader> sourceCall)

sourceCleanup

public void sourceCleanup(FlowProcess<JobConf> flowProcess,
                          SourceCall<Object[],RecordReader> sourceCall)
Description copied from class: Scheme
Method sourceCleanup is used to destroy resources created by Scheme.sourcePrepare(cascading.flow.FlowProcess, SourceCall).

Overrides:
sourceCleanup in class Scheme<JobConf,RecordReader,OutputCollector,Object[],Object[]>
Parameters:
flowProcess - of Process
sourceCall - of SourceCall

sink

public void sink(FlowProcess<JobConf> flowProcess,
                 SinkCall<Object[],OutputCollector> sinkCall)
          throws IOException
Description copied from class: Scheme
Method sink writes out the given Tuple found on SinkCall.getOutgoingEntry() to the SinkCall.getOutput().

This method may optionally throw a TapException if it cannot process a particular instance of data. If the payload Tuple is set on the TapException, that Tuple will be written to any applicable failure trap Tap. If not set, the incoming Tuple will be written instead.

Specified by:
sink in class Scheme<JobConf,RecordReader,OutputCollector,Object[],Object[]>
Parameters:
flowProcess - of Process
sinkCall - of SinkCall
Throws:
IOException


Copyright © 2007-2012 Concurrent, Inc. All Rights Reserved.