3.7 Fields Sets

Cascading applications can perform complex manipulation or "field algebra" on the fields stored in tuples, using Fields sets, a feature of the Fields class that provides a sort of wildcard tool for referencing sets of field values.

These predefined Fields sets are constant values on the Fields class. They can be used in many places where the Fields class is expected. They are:

Fields.ALL

The cascading.tuple.Fields.ALL constant is a wildcard that represents all the current available fields.

// incoming -> first, last, age

String expression = "first + \" \" + last";
Fields fields = new Fields( "full" );
ExpressionFunction full =
  new ExpressionFunction( fields, expression, String.class );

assembly =
  new Each( assembly, new Fields( "first", "last" ), full, Fields.ALL );

// outgoing -> first, last, age, full
Fields.RESULTS

The cascading.tuple.Fields.RESULTS constant is used to represent the field names of the current operations return values. This Fields set may only be used as an output selector on a pipe, causing the pipe to output a tuple containing the operation results.

// incoming -> first, last, age

String expression = "first + \" \" + last";
Fields fields = new Fields( "full" );
ExpressionFunction full =
  new ExpressionFunction( fields, expression, String.class );

Fields firstLast = new Fields( "first", "last" );
assembly =
  new Each( assembly, firstLast, full, Fields.RESULTS );

// outgoing -> full
Fields.REPLACE

The cascading.tuple.Fields.REPLACE constant is used as an output selector to inline-replace values in the incoming tuple with the results of an operation. This convenient Fields set allows operations to overwrite the value stored in the specified field. The current operation must either specify the identical argument selector field names used by the pipe, or use the ARGS Fields set.

// incoming -> first, last, age

// coerce to int
Identity function = new Identity( Fields.ARGS, Integer.class );

Fields age = new Fields( "age" );
assembly = new Each( assembly, age, function, Fields.REPLACE );

// outgoing -> first, last, age
Fields.SWAP

The cascading.tuple.Fields.SWAP constant is used as an output selector to swap the operation arguments with its results. Neither the argument and result field names, nor the size, need to be the same. This is useful for when the operation arguments are no longer necessary and the result Fields and values should be appended to the remainder of the input field names and Tuple.

// incoming -> first, last, age

String expression = "first + \" \" + last";
Fields fields = new Fields( "full" );
ExpressionFunction full =
  new ExpressionFunction( fields, expression, String.class );

Fields firstLast = new Fields( "first", "last" );
assembly = new Each( assembly, firstLast, full, Fields.SWAP );

// outgoing -> age, full
Fields.ARGS

The cascading.tuple.Fields.ARGS constant is used to let a given operation inherit the field names of its argument Tuple. This Fields set is a convenience and is typically used when the Pipe output selector is RESULTS or REPLACE. It is specifically used by the Identity Function when coercing values from Strings to primitive types.

// incoming -> first, last, age

// coerce to int
Identity function = new Identity( Fields.ARGS, Integer.class );

Fields age = new Fields( "age" );
assembly = new Each( assembly, age, function, Fields.REPLACE );

// outgoing -> first, last, age
Fields.GROUP

The cascading.tuple.Fields.GROUP constant represents all the fields used as grouping key in the most recent grouping. If no previous grouping exists in the pipe assembly, GROUP represents all the current field names.

// incoming -> first, last, age

assembly = new GroupBy( assembly, new Fields( "first", "last" ) );

FieldJoiner full = new FieldJoiner( new Fields( "full" ), " " );

assembly = new Each( assembly, Fields.GROUP, full, Fields.ALL );

// outgoing -> first, last, age, full
Fields.VALUES

The cascading.tuple.Fields.VALUES constant represents all the fields not used as grouping fields in a previous Group. That is, if you have fields "a", "b", and "c", and group on "a", Fields.VALUES will resolve to "b" and "c".

// incoming -> first, last, age

assembly = new GroupBy( assembly, new Fields( "age" ) );

FieldJoiner full = new FieldJoiner( new Fields( "full" ), " " );

assembly = new Each( assembly, Fields.VALUES, full, Fields.ALL );

// outgoing -> first, last, age, full
Fields.UNKNOWN

The cascading.tuple.Fields.UNKNOWN constant is used when Fields must be declared, but it's not known how many fields or what their names are. This allows for processing tuples of arbitrary length from an input source or some operation. Use this Fields set with caution.

// incoming -> line

RegexSplitter function = new RegexSplitter( Fields.UNKNOWN, "\t" );

Fields fields = new Fields( "line" );
assembly =
  new Each( assembly, fields, function, Fields.RESULTS );

// outgoing -> unknown
Fields.NONE

The cascading.tuple.Fields.NONE constant is used to specify no fields. Typically used as an argument selector for Operations that do not process any Tuples, like cascading.operation.Insert.

// incoming -> first, last, age

Insert constant = new Insert( new Fields( "zip" ), "77373" );

assembly = new Each( assembly, Fields.NONE, constant, Fields.ALL );

// outgoing -> first, last, age, zip

The chart below shows common ways to merge input and result fields for the desired output fields. A few minutes with this chart may help clarify the discussion of fields, tuples, and pipes. Also see Each and Every Pipes for details on the different columns and their relationships to the Each and Every pipes and Functions, Aggregators, and Buffers.

Copyright © 2007-2012 Concurrent, Inc. All Rights Reserved.