cascading.scheme
Class TextDelimited

java.lang.Object
  extended by cascading.scheme.Scheme
      extended by cascading.scheme.TextLine
          extended by cascading.scheme.TextDelimited
All Implemented Interfaces:
Serializable

public class TextDelimited
extends TextLine

Class TextDelimited is a sub-class of TextLine. It provides direct support for delimited text files, like TAB (\t) or COMMA (,) delimited files. It also optionally allows for quoted values.

TextDelimited may also be used to skip the "header" in a file, where the header is defined as the very first line in every input file. That is, if the byte offset of the current line from the input is zero (0), that line will be skipped.

By default headers are not skipped.

By default this Scheme is both strict and safe.

Strict meaning if a line of text does not parse into the expected number of fields, this class will throw a TapException. If strict is false, then Tuple will be returned with null values for the missing fields.

Safe meaning if a field cannot be coerced into an expected type, a null will be used for the value. If safe is false, a TapException will be thrown.

Also by default, quote strings are not searched for to improve processing speed. If a file is COMMA delimited but may have COMMA's in a value, the whole value should be surrounded by the quote string, typically double quotes (").

Note all empty fields in a line will be returned as null unless coerced into a new type.

This Scheme may source/sink Fields.ALL, when given on the constructor the new instance will automatically default to strict == false as the number of fields parsed are arbitrary or unknown. A type array may not be given either, so all values will be returned as Strings.

See Also:
TextLine, Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class cascading.scheme.TextLine
TextLine.Compress
 
Field Summary
protected  Pattern cleanPattern
          Field cleanPattern
protected  Pattern escapePattern
          Field escapePattern
protected  Pattern splitPattern
          Field splitPattern
 
Fields inherited from class cascading.scheme.TextLine
DEFAULT_SOURCE_FIELDS
 
Constructor Summary
TextDelimited(Fields fields, boolean skipHeader, String delimiter)
          Constructor TextDelimited creates a new TextDelimited instance.
TextDelimited(Fields fields, boolean skipHeader, String delimiter, Class[] types)
          Constructor TextDelimited creates a new TextDelimited instance.
TextDelimited(Fields fields, boolean skipHeader, String delimiter, String quote)
          Constructor TextDelimited creates a new TextDelimited instance.
TextDelimited(Fields fields, boolean skipHeader, String delimiter, String quote, Class[] types)
          Constructor TextDelimited creates a new TextDelimited instance.
TextDelimited(Fields fields, boolean skipHeader, String delimiter, String quote, Class[] types, boolean safe)
          Constructor TextDelimited creates a new TextDelimited instance.
TextDelimited(Fields fields, String delimiter)
          Constructor TextDelimited creates a new TextDelimited instance.
TextDelimited(Fields fields, String delimiter, Class[] types)
          Constructor TextDelimited creates a new TextDelimited instance.
TextDelimited(Fields fields, String delimiter, String quote)
          Constructor TextDelimited creates a new TextDelimited instance.
TextDelimited(Fields fields, String delimiter, String quote, Class[] types)
          Constructor TextDelimited creates a new TextDelimited instance.
TextDelimited(Fields fields, String delimiter, String quote, Class[] types, boolean safe)
          Constructor TextDelimited creates a new TextDelimited instance.
TextDelimited(Fields fields, TextLine.Compress sinkCompression, boolean skipHeader, String delimiter)
          Constructor TextDelimited creates a new TextDelimited instance.
TextDelimited(Fields fields, TextLine.Compress sinkCompression, boolean skipHeader, String delimiter, boolean strict, String quote, Class[] types, boolean safe)
          Constructor TextDelimited creates a new TextDelimited instance.
TextDelimited(Fields fields, TextLine.Compress sinkCompression, boolean skipHeader, String delimiter, Class[] types)
          Constructor TextDelimited creates a new TextDelimited instance.
TextDelimited(Fields fields, TextLine.Compress sinkCompression, boolean skipHeader, String delimiter, Class[] types, boolean safe)
          Constructor TextDelimited creates a new TextDelimited instance.
TextDelimited(Fields fields, TextLine.Compress sinkCompression, boolean skipHeader, String delimiter, String quote)
          Constructor TextDelimited creates a new TextDelimited instance.
TextDelimited(Fields fields, TextLine.Compress sinkCompression, boolean skipHeader, String delimiter, String quote, Class[] types)
          Constructor TextDelimited creates a new TextDelimited instance.
TextDelimited(Fields fields, TextLine.Compress sinkCompression, boolean skipHeader, String delimiter, String quote, Class[] types, boolean safe)
          Constructor TextDelimited creates a new TextDelimited instance.
TextDelimited(Fields fields, TextLine.Compress sinkCompression, String delimiter)
          Constructor TextDelimited creates a new TextDelimited instance.
TextDelimited(Fields fields, TextLine.Compress sinkCompression, String delimiter, Class[] types)
          Constructor TextDelimited creates a new TextDelimited instance.
TextDelimited(Fields fields, TextLine.Compress sinkCompression, String delimiter, Class[] types, boolean safe)
          Constructor TextDelimited creates a new TextDelimited instance.
TextDelimited(Fields fields, TextLine.Compress sinkCompression, String delimiter, String quote)
          Constructor TextDelimited creates a new TextDelimited instance.
TextDelimited(Fields fields, TextLine.Compress sinkCompression, String delimiter, String quote, Class[] types)
          Constructor TextDelimited creates a new TextDelimited instance.
TextDelimited(Fields fields, TextLine.Compress sinkCompression, String delimiter, String quote, Class[] types, boolean safe)
          Constructor TextDelimited creates a new TextDelimited instance.
 
Method Summary
static Object[] cleanSplit(Object[] split, Pattern cleanPattern, Pattern escapePattern, String quote)
          Method cleanSplit will return a quote free array of String values, the given split array will be updated in place.
static Pattern createCleanPatternFor(String quote)
          Method createCleanPatternFor creates a regex Pattern for removing quote characters from a String.
static Pattern createEscapePatternFor(String quote)
          Method createEscapePatternFor creates a regex Pattern cleaning quote escapes from a String.
static String[] createSplit(String value, Pattern splitPattern, int numValues)
          Method createSplit will split the given value with the given splitPattern.
static Pattern createSplitPatternFor(String delimiter, String quote)
          Method createSplitPatternFor creates a regex Pattern for splitting a line of text into its component parts using the given delimiter and quote Strings.
 void sink(TupleEntry tupleEntry, OutputCollector outputCollector)
          Method sink writes out the given Tuple instance to the outputCollector.
 Tuple source(Object key, Object value)
          Method source takes the given Hadoop key and value and returns a new Tuple instance.
 
Methods inherited from class cascading.scheme.TextLine
getSinkCompression, setSinkCompression, sinkInit, sourceInit
 
Methods inherited from class cascading.scheme.Scheme
equals, getNumSinkParts, getSinkFields, getSourceFields, getTrace, hashCode, isSink, isSource, isSymmetrical, isWriteDirect, setNumSinkParts, setSinkFields, setSourceFields, toString
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Field Detail

splitPattern

protected Pattern splitPattern
Field splitPattern


cleanPattern

protected Pattern cleanPattern
Field cleanPattern


escapePattern

protected Pattern escapePattern
Field escapePattern

Constructor Detail

TextDelimited

@ConstructorProperties(value={"fields","delimiter"})
public TextDelimited(Fields fields,
                                                String delimiter)
Constructor TextDelimited creates a new TextDelimited instance.

Parameters:
fields - of type Fields
delimiter - of type String

TextDelimited

@ConstructorProperties(value={"fields","skipHeader","delimiter"})
public TextDelimited(Fields fields,
                                                boolean skipHeader,
                                                String delimiter)
Constructor TextDelimited creates a new TextDelimited instance.

Parameters:
fields - of type Fields
skipHeader - of type boolean
delimiter - of type String

TextDelimited

@ConstructorProperties(value={"fields","delimiter","types"})
public TextDelimited(Fields fields,
                                                String delimiter,
                                                Class[] types)
Constructor TextDelimited creates a new TextDelimited instance.

Parameters:
fields - of type Fields
delimiter - of type String
types - of type Class[]

TextDelimited

@ConstructorProperties(value={"fields","skipHeader","delimiter","types"})
public TextDelimited(Fields fields,
                                                boolean skipHeader,
                                                String delimiter,
                                                Class[] types)
Constructor TextDelimited creates a new TextDelimited instance.

Parameters:
fields - of type Fields
skipHeader - of type boolean
delimiter - of type String
types - of type Class[]

TextDelimited

@ConstructorProperties(value={"fields","delimiter","quote","types"})
public TextDelimited(Fields fields,
                                                String delimiter,
                                                String quote,
                                                Class[] types)
Constructor TextDelimited creates a new TextDelimited instance.

Parameters:
fields - of type Fields
delimiter - of type String
quote - of type String
types - of type Class[]

TextDelimited

@ConstructorProperties(value={"fields","skipHeader","delimiter","quote","types"})
public TextDelimited(Fields fields,
                                                boolean skipHeader,
                                                String delimiter,
                                                String quote,
                                                Class[] types)
Constructor TextDelimited creates a new TextDelimited instance.

Parameters:
fields - of type Fields
skipHeader - of type boolean
delimiter - of type String
quote - of type String
types - of type Class[]

TextDelimited

@ConstructorProperties(value={"fields","delimiter","quote","types","safe"})
public TextDelimited(Fields fields,
                                                String delimiter,
                                                String quote,
                                                Class[] types,
                                                boolean safe)
Constructor TextDelimited creates a new TextDelimited instance.

Parameters:
fields - of type Fields
delimiter - of type String
quote - of type String
types - of type Class[]
safe - of type boolean

TextDelimited

@ConstructorProperties(value={"fields","skipHeader","delimiter","quote","types","safe"})
public TextDelimited(Fields fields,
                                                boolean skipHeader,
                                                String delimiter,
                                                String quote,
                                                Class[] types,
                                                boolean safe)
Constructor TextDelimited creates a new TextDelimited instance.

Parameters:
fields - of type Fields
skipHeader - of type boolean
delimiter - of type String
quote - of type String
types - of type Class[]
safe - of type boolean

TextDelimited

@ConstructorProperties(value={"fields","sinkCompression","delimiter"})
public TextDelimited(Fields fields,
                                                TextLine.Compress sinkCompression,
                                                String delimiter)
Constructor TextDelimited creates a new TextDelimited instance.

Parameters:
fields - of type Fields
sinkCompression - of type Compress
delimiter - of type String

TextDelimited

@ConstructorProperties(value={"fields","sinkCompression","skipHeader","delimiter"})
public TextDelimited(Fields fields,
                                                TextLine.Compress sinkCompression,
                                                boolean skipHeader,
                                                String delimiter)
Constructor TextDelimited creates a new TextDelimited instance.

Parameters:
fields - of type Fields
sinkCompression - of type Compress
skipHeader - of type boolean
delimiter - of type String

TextDelimited

@ConstructorProperties(value={"fields","sinkCompression","delimiter","types"})
public TextDelimited(Fields fields,
                                                TextLine.Compress sinkCompression,
                                                String delimiter,
                                                Class[] types)
Constructor TextDelimited creates a new TextDelimited instance.

Parameters:
fields - of type Fields
sinkCompression - of type Compress
delimiter - of type String
types - of type Class[]

TextDelimited

@ConstructorProperties(value={"fields","sinkCompression","skipHeader","delimiter","types"})
public TextDelimited(Fields fields,
                                                TextLine.Compress sinkCompression,
                                                boolean skipHeader,
                                                String delimiter,
                                                Class[] types)
Constructor TextDelimited creates a new TextDelimited instance.

Parameters:
fields - of type Fields
sinkCompression - of type Compress
skipHeader - of type boolean
delimiter - of type String
types - of type Class[]

TextDelimited

@ConstructorProperties(value={"fields","sinkCompression","delimiter","types","safe"})
public TextDelimited(Fields fields,
                                                TextLine.Compress sinkCompression,
                                                String delimiter,
                                                Class[] types,
                                                boolean safe)
Constructor TextDelimited creates a new TextDelimited instance.

Parameters:
fields - of type Fields
sinkCompression - of type Compress
delimiter - of type String
types - of type Class[]
safe - of type boolean

TextDelimited

@ConstructorProperties(value={"fields","sinkCompression","skipHeader","delimiter","types","safe"})
public TextDelimited(Fields fields,
                                                TextLine.Compress sinkCompression,
                                                boolean skipHeader,
                                                String delimiter,
                                                Class[] types,
                                                boolean safe)
Constructor TextDelimited creates a new TextDelimited instance.

Parameters:
fields - of type Fields
sinkCompression - of type Compress
skipHeader - of type boolean
delimiter - of type String
types - of type Class[]
safe - of type boolean

TextDelimited

@ConstructorProperties(value={"fields","delimiter","quote"})
public TextDelimited(Fields fields,
                                                String delimiter,
                                                String quote)
Constructor TextDelimited creates a new TextDelimited instance.

Parameters:
fields - of type Fields
delimiter - of type String
quote - of type String

TextDelimited

@ConstructorProperties(value={"fields","skipHeader","delimiter","quote"})
public TextDelimited(Fields fields,
                                                boolean skipHeader,
                                                String delimiter,
                                                String quote)
Constructor TextDelimited creates a new TextDelimited instance.

Parameters:
fields - of type Fields
skipHeader - of type boolean
delimiter - of type String
quote - of type String

TextDelimited

@ConstructorProperties(value={"fields","sinkCompression","skipHeader","delimiter","quote"})
public TextDelimited(Fields fields,
                                                TextLine.Compress sinkCompression,
                                                String delimiter,
                                                String quote)
Constructor TextDelimited creates a new TextDelimited instance.

Parameters:
fields - of type Fields
sinkCompression - of type Compress
delimiter - of type String
quote - of type String

TextDelimited

public TextDelimited(Fields fields,
                     TextLine.Compress sinkCompression,
                     boolean skipHeader,
                     String delimiter,
                     String quote)
Constructor TextDelimited creates a new TextDelimited instance.

Parameters:
fields - of type Fields
sinkCompression - of type Compress
skipHeader - of type boolean
delimiter - of type String
quote - of type String

TextDelimited

@ConstructorProperties(value={"fields","sinkCompression","delimiter","quote","types"})
public TextDelimited(Fields fields,
                                                TextLine.Compress sinkCompression,
                                                String delimiter,
                                                String quote,
                                                Class[] types)
Constructor TextDelimited creates a new TextDelimited instance.

Parameters:
fields - of type Fields
sinkCompression - of type Compress
delimiter - of type String
quote - of type String
types - of type Class[]

TextDelimited

@ConstructorProperties(value={"fields","sinkCompression","skipHeader","delimiter","quote","types"})
public TextDelimited(Fields fields,
                                                TextLine.Compress sinkCompression,
                                                boolean skipHeader,
                                                String delimiter,
                                                String quote,
                                                Class[] types)
Constructor TextDelimited creates a new TextDelimited instance.

Parameters:
fields - of type Fields
sinkCompression - of type Compress
skipHeader - of type boolean
delimiter - of type String
quote - of type String
types - of type Class[]

TextDelimited

@ConstructorProperties(value={"fields","sinkCompression","delimiter","quote","types","safe"})
public TextDelimited(Fields fields,
                                                TextLine.Compress sinkCompression,
                                                String delimiter,
                                                String quote,
                                                Class[] types,
                                                boolean safe)
Constructor TextDelimited creates a new TextDelimited instance.

Parameters:
fields - of type Fields
sinkCompression - of type Compress
delimiter - of type String
quote - of type String
types - of type Class[]
safe - of type boolean

TextDelimited

@ConstructorProperties(value={"fields","sinkCompression","skipHeader","delimiter","quote","types","safe"})
public TextDelimited(Fields fields,
                                                TextLine.Compress sinkCompression,
                                                boolean skipHeader,
                                                String delimiter,
                                                String quote,
                                                Class[] types,
                                                boolean safe)
Constructor TextDelimited creates a new TextDelimited instance.

Parameters:
fields - of type Fields
sinkCompression - of type Compress
skipHeader - of type boolean
delimiter - of type String
quote - of type String
types - of type Class[]
safe - of type boolean

TextDelimited

@ConstructorProperties(value={"fields","sinkCompression","skipHeader","delimiter","strict","quote","types","safe"})
public TextDelimited(Fields fields,
                                                TextLine.Compress sinkCompression,
                                                boolean skipHeader,
                                                String delimiter,
                                                boolean strict,
                                                String quote,
                                                Class[] types,
                                                boolean safe)
Constructor TextDelimited creates a new TextDelimited instance.

Parameters:
fields - of type Fields
sinkCompression - of type Compress
skipHeader - of type boolean
delimiter - of type String
strict - of type boolean
quote - of type String
types - of type Class[]
safe - of type boolean
Method Detail

createEscapePatternFor

public static Pattern createEscapePatternFor(String quote)
Method createEscapePatternFor creates a regex Pattern cleaning quote escapes from a String.

If quote is null or empty, a null value will be returned;

Parameters:
quote - of type String
Returns:
Pattern

createCleanPatternFor

public static Pattern createCleanPatternFor(String quote)
Method createCleanPatternFor creates a regex Pattern for removing quote characters from a String.

If quote is null or empty, a null value will be returned;

Parameters:
quote - of type String
Returns:
Pattern

createSplitPatternFor

public static Pattern createSplitPatternFor(String delimiter,
                                            String quote)
Method createSplitPatternFor creates a regex Pattern for splitting a line of text into its component parts using the given delimiter and quote Strings. quote may be null.

Parameters:
delimiter - of type String
quote - of type String
Returns:
Pattern

source

public Tuple source(Object key,
                    Object value)
Description copied from class: Scheme
Method source takes the given Hadoop key and value and returns a new Tuple instance.

Overrides:
source in class TextLine
Parameters:
key - of type WritableComparable
value - of type Writable
Returns:
Tuple

createSplit

public static String[] createSplit(String value,
                                   Pattern splitPattern,
                                   int numValues)
Method createSplit will split the given value with the given splitPattern.

Parameters:
value - of type String
splitPattern - of type Pattern
numValues - of type int
Returns:
String[]

cleanSplit

public static Object[] cleanSplit(Object[] split,
                                  Pattern cleanPattern,
                                  Pattern escapePattern,
                                  String quote)
Method cleanSplit will return a quote free array of String values, the given split array will be updated in place.

If cleanPattern is null, quote cleaning will not be performed, but all empty String values will be replaces with a null value.

Parameters:
split - of type Object[]
cleanPattern - of type Pattern
escapePattern - of type Pattern
quote - of type String
Returns:
Object[] as a convenience

sink

public void sink(TupleEntry tupleEntry,
                 OutputCollector outputCollector)
          throws IOException
Description copied from class: Scheme
Method sink writes out the given Tuple instance to the outputCollector.

Overrides:
sink in class TextLine
outputCollector - of type OutputCollector @throws IOException when
Throws:
IOException


Copyright © 2007-2010 Concurrent, Inc. All Rights Reserved.