7.6 Regular Expression Operations

RegexSplitter

The cascading.operation.regex.RegexSplitter function will split an argument value by a regex pattern String. Internally, this function uses java.util.regex.Pattern#split(), thus behaves accordingly. By default this function splits on the TAB character ("\t"). If a known number of values will emerge from this function, it can declare field names. In this case, if the splitter encounters more split values than field names, the remaining values will be discarded, see java.util.regex.Pattern#split( input, limit ) for more information.

RegexParser

The cascading.operation.regex.RegexParser function is used to extract a regular expression matched value from an incoming argument value. If the regular expression is sufficiently complex, and int array may be provided which specifies which regex groups should be returned into which field names.

// incoming -> "line"

String regex =
  "^([^ ]*) +[^ ]* +[^ ]* +\\[([^]]*)\\] +" +
  "\\\"([^ ]*) ([^ ]*) [^ ]*\\\" ([^ ]*) ([^ ]*).*$";
Fields fieldDeclaration =
  new Fields( "ip", "time", "method", "event", "status", "size" );
int[] groups = {1, 2, 3, 4, 5, 6};
RegexParser parser = new RegexParser( fieldDeclaration, regex, groups );
assembly = new Each( assembly, new Fields( "line" ), parser );

// outgoing -> "ip", "time", "method", "event", "status", "size"

Above, we parse an Apache log "line" into its parts. Note the int[] groups array starts at 1, not 0. Group 0 is the whole group, so if included the first field would be a copy of "line" and not "ip".

RegexReplace

The cascading.operation.regex.RegexReplace function is used to replace a regex matched value with a replacement value. It maybe used in a "replace all" or "replace first" mode. See java.util.regex.Matcher#replaceAll() and java.util.regex.Matcher#replaceFirst() methods.

// incoming -> "line"

RegexReplace replace =
  new RegexReplace( new Fields( "clean-line" ), "\\s+", " ", true );
assembly = new Each( assembly, new Fields( "line" ), replace );

// outgoing -> "clean-line"

Above we replace all adjoined white space characters with a single space character.

RegexFilter

The cascading.operation.regex.RegexFilter function will apply a regular expression pattern String against every input Tuple value and filter the Tuple stream accordingly. By default, Tuples that match the given pattern are kept, and Tuples that do not match are filtered out. This can be changed by setting "removeMatch" totrue. Also, by default, the whole Tuple is matched against the given pattern String (TAB delimited). If "matchEachElement" is set totrue, the pattern is applied to each Tuple value individually. See the java.util.regex.Matcher#find() method.

// incoming -> "ip", "time", "method", "event", "status", "size"

Filter filter = new RegexFilter( "^68\\..*" );
assembly = new Each( assembly, new Fields( "ip" ), filter );

// outgoing -> "ip", "time", "method", "event", "status", "size"

Above we keep all lines where the "ip" address starts with "68.".

RegexGenerator

The cascading.operation.regex.RegexGenerator function will emit a new Tuple for every matched regular expression group, instead of a Tuple with every group as a value.

// incoming -> "line"

String regex = "(?<!\\pL)(?=\\pL)[^ ]*(?<=\\pL)(?!\\pL)";
Function function = new RegexGenerator( new Fields( "word" ), regex );
assembly = new Each( assembly, new Fields( "line" ), function );

// outgoing -> "word"

Above each "line" in a document is parsed into unique words and stored in the "word" field of each result Tuple.

RegexSplitGenerator

The cascading.operation.regex.RegexSplitGenerator function will emit a new Tuple for every split on the incoming argument value delimited by the given pattern String. The behavior is similar to the RegexSplitter function.

Copyright © 2007-2008 Concurrent, Inc. All Rights Reserved.