Cascading 3.0 User Guide - Built-In SubAssemblies

Built-in SubAssemblies

There are a number of helper SubAssemblies provided by the core Cascading library.

Many of the assemblies that are described below can be coded to ignore null values. This allows for an optional but closer resemblance to how similar functions in SQL perform.

Optimized Aggregations

The following SubAssemblies are implementations or optimizations of more discrete aggregate functions. Many of these SubAssemblies rely on the AggregateBy base class, which can be subclassed by developers to create custom aggregations leveraging the internal partial aggregation implementation.

Unique

The cascading.pipe.assembly.Unique SubAssembly is used to remove duplicate values in a Tuple stream. Uniqueness is determined by the values of all fields listed in uniqueFields. Use Fields.ALL as the uniqueFields argument to find all distinct Tuples in a stream.

Example 1. Using the Unique SubAssembly
// incoming -> first, last

assembly = new Unique( assembly, new Fields( "first", "last" ) );

// outgoing -> first, last

Unique uses the FirstNBuffer to more efficiently determine unique values.

AggregateBy

The cascading.pipe.assembly.AggregateBy SubAssembly is an implementation of the partial aggregation pattern, and is the base class for built-in and custom partial aggregation implementations like AverageBy or CountBy.

Generally the AggregateBy class is used to combine multiple AggregateBy subclasses into a single Pipe.

Example 2. Composing partial aggregations with AggregateBy
Pipe assembly = new Pipe( "assembly" );

// ...
Fields groupingFields = new Fields( "date" );

// note we do not pass the parent assembly Pipe in
Fields valueField = new Fields( "size" );
Fields sumField = new Fields( "total-size", long.class );
SumBy sumBy = new SumBy( valueField, sumField );

Fields countField = new Fields( "num-events" );
CountBy countBy = new CountBy( countField );

assembly = new AggregateBy( assembly, groupingFields, sumBy, countBy );

To create a custom partial aggregation, subclass the AggregateBy class and implement the appropriate internal interfaces. See the Javadoc for details.

AverageBy

The cascading.pipe.assembly.AverageBy SubAssembly calculates an average of the given valueFields and returns the result in the averageField field. AverageBy may be combined with other AggregateBy subclasses so they may be executed simultaneously over the same grouping.

Example 3. Using AverageBy
Pipe assembly = new Pipe( "assembly" );

// ...
Fields groupingFields = new Fields( "date" );
Fields valueField = new Fields( "size" );
Fields avgField = new Fields( "avg-size" );
assembly = new AverageBy( assembly, groupingFields, valueField, avgField );

CountBy

The cascading.pipe.assembly.CountBy SubAssembly performs a count over the given groupingFields and returns the result in the countField field. CountBy may be combined with other AggregateBy subclasses so they may be executed simultaneously over the same grouping.

Example 4. Using CountBy
Pipe assembly = new Pipe( "assembly" );

// ...
Fields groupingFields = new Fields( "date" );
Fields countField = new Fields( "count" );
assembly = new CountBy( assembly, groupingFields, countField );

SumBy

The cascading.pipe.assembly.SumBy SubAssembly performs a sum over the given valueFields and returns the result in the sumField field. SumBy may be combined with other AggregateBy subclasses so that they may be executed simultaneously over the same grouping.

Example 5. Using SumBy
Pipe assembly = new Pipe( "assembly" );

// ...
Fields groupingFields = new Fields( "date" );
Fields valueField = new Fields( "size" );
Fields sumField = new Fields( "total-size" );
assembly =
  new SumBy( assembly, groupingFields, valueField, sumField, long.class );

FirstBy

The cascading.pipe.assembly.FirstBy SubAssembly is used to return the first encountered value in the given valueFields. FirstBy may be combined with other AggregateBy subclasses so they may be executed simultaneously over the same grouping.

Example 6. Using FirstBy
Pipe assembly = new Pipe( "assembly" );

// ...
Fields groupingFields = new Fields( "date" );
Fields valueField = new Fields( "size" );

// we want the largest size in this grouping
valueField.setComparator( "size", new LongComparator() );

assembly =
  new FirstBy( assembly, groupingFields, valueField );

Note if the valueFields Fields instance has field comparators, they will be used to sort the order of argument values. Otherwise, the fields will not be sorted in any deterministic order.

Stream Shaping

Coerce

The cascading.pipe.assembly.Coerce SubAssembly is used to coerce a set of values from one type to another type — for example, to convert the age field from a String to an Integer.

Example 7. Using Coerce
// incoming -> first, last, age

assembly =
  new Coerce( assembly, new Fields( "age" ), Integer.class );

// outgoing -> first, last, age

Discard

The cascading.pipe.assembly.Discard SubAssembly is used to shape the Tuple stream by discarding all fields given on the constructor. All unlisted fields are retained.

Example 8. Using Discard
// incoming -> first, last, age

assembly = new Discard( assembly, new Fields( "age" ) );

// outgoing -> first, last

Rename

The cascading.pipe.assembly.Rename SubAssembly is used to rename a field.

Example 9. Using Rename
// incoming -> first, last, age

assembly =
  new Rename( assembly, new Fields( "age" ), new Fields( "years" ) );

// outgoing -> first, last, years

Retain

The cascading.pipe.assembly.Retain SubAssembly is used to shape the Tuple stream by retaining all fields given on the constructor. All unlisted fields are discarded.

Example 10. Using Retain
// incoming -> first, last, age

assembly = new Retain( assembly, new Fields( "first", "last" ) );

// outgoing -> first, last