The
cascading.operation.regex.RegexSplitter
function will split an argument value by a regex pattern String.
Internally, this function uses
java.util.regex.Pattern#split()
, thus
behaves accordingly. By default this function splits on the TAB
character ("\t"). If a known number of values will emerge from
this function, it can declare field names. In this case, if the
splitter encounters more split values than field names, the
remaining values will be discarded, see
java.util.regex.Pattern#split( input, limit )
for more information.
The
cascading.operation.regex.RegexParser
function is used to extract a regular expression matched value
from an incoming argument value. If the regular expression is
sufficiently complex, and int array may be provided which
specifies which regex groups should be returned into which field
names.
// incoming -> "line" String regex = "^([^ ]*) +[^ ]* +[^ ]* +\\[([^]]*)\\] +" + "\\\"([^ ]*) ([^ ]*) [^ ]*\\\" ([^ ]*) ([^ ]*).*$"; Fields fieldDeclaration = new Fields( "ip", "time", "method", "event", "status", "size" ); int[] groups = {1, 2, 3, 4, 5, 6}; RegexParser parser = new RegexParser( fieldDeclaration, regex, groups ); assembly = new Each( assembly, new Fields( "line" ), parser ); // outgoing -> "ip", "time", "method", "event", "status", "size"
Above, we parse an Apache log "line" into its parts. Note the int[] groups array starts at 1, not 0. Group 0 is the whole group, so if included the first field would be a copy of "line" and not "ip".
The
cascading.operation.regex.RegexReplace
function is used to replace a regex matched value with a
replacement value. It maybe used in a "replace all" or "replace
first" mode. See
java.util.regex.Matcher#replaceAll()
and
java.util.regex.Matcher#replaceFirst()
methods.
// incoming -> "line" RegexReplace replace = new RegexReplace( new Fields( "clean-line" ), "\\s+", " ", true ); assembly = new Each( assembly, new Fields( "line" ), replace ); // outgoing -> "clean-line"
Above we replace all adjoined white space characters with a single space character.
The
cascading.operation.regex.RegexFilter
function will apply a regular expression pattern String against
every input Tuple value and filter the Tuple stream accordingly.
By default, Tuples that match the given pattern are kept, and
Tuples that do not match are filtered out. This can be changed
by setting "removeMatch" totrue
. Also, by default,
the whole Tuple is matched against the given pattern String (TAB
delimited). If "matchEachElement" is set totrue
,
the pattern is applied to each Tuple value individually. See the
java.util.regex.Matcher#find()
method.
// incoming -> "ip", "time", "method", "event", "status", "size" Filter filter = new RegexFilter( "^68\\..*" ); assembly = new Each( assembly, new Fields( "ip" ), filter ); // outgoing -> "ip", "time", "method", "event", "status", "size"
Above we keep all lines where the "ip" address starts with "68.".
The
cascading.operation.regex.RegexGenerator
function will emit a new Tuple for every matched regular
expression group, instead of a Tuple with every group as a
value.
// incoming -> "line" String regex = "(?<!\\pL)(?=\\pL)[^ ]*(?<=\\pL)(?!\\pL)"; Function function = new RegexGenerator( new Fields( "word" ), regex ); assembly = new Each( assembly, new Fields( "line" ), function ); // outgoing -> "word"
Above each "line" in a document is parsed into unique words and stored in the "word" field of each result Tuple.
The
cascading.operation.regex.RegexSplitGenerator
function will emit a new Tuple for every split on the incoming
argument value delimited by the given pattern String. The
behavior is similar to the RegexSplitter function.
Copyright © 2007-2008 Concurrent, Inc. All Rights Reserved.