The
cascading.operation.regex.RegexSplitter
function splits an argument value based on a regex pattern
String. (For the opposite effect, see the FieldJoiner function.)
Internally, this function uses
java.util.regex.Pattern.split()
, and it
behaves accordingly. By default, it splits on the TAB character
("\t"). If it is known that a determinate number of values will
emerge from this function, it can declare field names. In this
case, if the splitter encounters more split values than field
names, the remaining values are discarded. For more information,
see java.util.regex.Pattern.split( input, limit
)
.
The
cascading.operation.regex.RegexParser
function is used to extract a regex-matched value from an
incoming argument value. If the regular expression is
sufficiently complex, an int
array may be
provided to specify which regex groups should be returned in
which field names.
// incoming -> "line"
String regex =
"^([^ ]*) +[^ ]* +[^ ]* +\\[([^]]*)\\] +" +
"\\\"([^ ]*) ([^ ]*) [^ ]*\\\" ([^ ]*) ([^ ]*).*$";
Fields fieldDeclaration =
new Fields( "ip", "time", "method", "event", "status", "size" );
int[] groups = {1, 2, 3, 4, 5, 6};
RegexParser parser = new RegexParser( fieldDeclaration, regex, groups );
assembly = new Each( assembly, new Fields( "line" ), parser );
// outgoing -> "ip", "time", "method", "event", "status", "size"
In the example above, a line from an Apache access log is
parsed into its component parts. Note that the
int[]
groups array starts at 1, not 0.
Group 0 is the whole group, so if the first field is included,
it is a copy of "line" and not "ip".
The
cascading.operation.regex.RegexReplace
function is used to replace a regex-matched value with a
specified replacement value. It can operate in a "replace all"
or "replace first" mode. For more information, see the methods
java.util.regex.Matcher.replaceAll()
and
java.util.regex.Matcher.replaceFirst()
.
// incoming -> "line"
RegexReplace replace =
new RegexReplace( new Fields( "clean-line" ), "\\s+", " ", true );
assembly = new Each( assembly, new Fields( "line" ), replace );
// outgoing -> "clean-line"
In the example above, all adjoined white space characters are replaced with a single space character.
The
cascading.operation.regex.RegexFilter
function filters a Tuple stream based on a specified regex
value. By default, tuples that match the given pattern are kept,
and tuples that do not match are filtered out. This can be
reversed by setting "removeMatch" to true
. Also, by
default, the whole Tuple is matched against the given regex
String (in tab-delimited sections). If "matchEachElement" is set
to true
, the pattern is applied to each Tuple value
individually. For more information, see the
java.util.regex.Matcher.find()
method.
// incoming -> "ip", "time", "method", "event", "status", "size"
Filter filter = new RegexFilter( "^68\\..*" );
assembly = new Each( assembly, new Fields( "ip" ), filter );
// outgoing -> "ip", "time", "method", "event", "status", "size"
The above keeps all lines in which "68." appears at the start of the IP address.
The
cascading.operation.regex.RegexGenerator
function emits a new tuple for every string (found in an input
tuple) that matches a specified regex pattern.
// incoming -> "line"
String regex = "(?<!\\pL)(?=\\pL)[^ ]*(?<=\\pL)(?!\\pL)";
Function function = new RegexGenerator( new Fields( "word" ), regex );
assembly = new Each( assembly, new Fields( "line" ), function );
// outgoing -> "word"
Above each "line" in a document is parsed into unique words and stored in the "word" field of each result Tuple.
The
cascading.operation.regex.RegexSplitGenerator
function emits a new Tuple for every split on the incoming
argument value delimited by the given pattern String. The
behavior is similar to the RegexSplitter
function, except that (assuming multiple matches)
RegexSplitter
emits a single tuple that
may contain multiple values, and
RegexSplitGenerator
emits multiple tuples
that each contain only one value, as does
RegexGenerator
.
Copyright © 2007-2012 Concurrent, Inc. All Rights Reserved.