10. Built-in Assemblies

There are a number of helper SubAssemblies provided by the core cascading library.

As of Cascading 2.2, many of the below assemblies can optionally ignore null values. This allows for an optional but closer resemblance to how similar functions in SQL perform.

10.1 AggregateBy

The cascading.pipe.assembly.AggregateBy SubAssembly is an implementation of the Partial Aggregation pattern, and is the base class for built-in and custom partial aggregation implementations like AverageBy or CountBy.

Generally the AggregateBy class is used to combine multiple AggregateBy subclasses into a single Pipe.

Example 10.1. Composing partials with AggregateBy

Pipe assembly = new Pipe( "assembly" );

// ...
Fields groupingFields = new Fields( "date" );

// note we do not pass the parent assembly Pipe in
Fields valueField = new Fields( "size" );
Fields sumField = new Fields( "total-size", long.class );
SumBy sumBy = new SumBy( valueField, sumField );

Fields countField = new Fields( "num-events" );
CountBy countBy = new CountBy( countField );

assembly = new AggregateBy( assembly, groupingFields, sumBy, countBy );

To create a custom partial aggregation, subclass the AggregateBy class and implement the appropriate internal interfaces. See the Javadoc for details.

AverageBy

The cascading.pipe.assembly.AverageBy SubAssembly performs an average over the given valueFields and returns the result in the averageField field. AverageBy may be combined with other AggregateBy subclasses so they may be executed simultaneously over the same grouping.

Example 10.2. Using AverageBy

Pipe assembly = new Pipe( "assembly" );

// ...
Fields groupingFields = new Fields( "date" );
Fields valueField = new Fields( "size" );
Fields avgField = new Fields( "avg-size" );
assembly = new AverageBy( assembly, groupingFields, valueField, avgField );

CountBy

The cascading.pipe.assembly.CountBy SubAssembly performs a count over the given groupingFields and returns the result in the countField field. CountBy may be combined with other AggregateBy subclasses so they may be executed simultaneously over the same grouping.

Example 10.3. Using CountBy

Pipe assembly = new Pipe( "assembly" );

// ...
Fields groupingFields = new Fields( "date" );
Fields countField = new Fields( "count" );
assembly = new CountBy( assembly, groupingFields, countField );

SumBy

The cascading.pipe.assembly.SumBy SubAssembly performs a sum over the given valueFields and returns the result in the sumField field. SumBy may be combined with other AggregateBy subclasses so they may be executed simultaneously over the same grouping.

Example 10.4. Using SumBy

Pipe assembly = new Pipe( "assembly" );

// ...
Fields groupingFields = new Fields( "date" );
Fields valueField = new Fields( "size" );
Fields sumField = new Fields( "total-size" );
assembly =
  new SumBy( assembly, groupingFields, valueField, sumField, long.class );

FirstBy

The cascading.pipe.assembly.FirstBy SubAssembly is used to return the first encountered value in the given valueFields. FirstBy may be combined with other AggregateBy subclasses so they may be executed simultaneously over the same grouping.

Example 10.5. Using FirstBy

Pipe assembly = new Pipe( "assembly" );

// ...
Fields groupingFields = new Fields( "date" );
Fields valueField = new Fields( "size" );

// we want the largest size in this grouping
valueField.setComparator( "size", new LongComparator() );

assembly =
  new FirstBy( assembly, groupingFields, valueField );

Note if the valueFields Fields instance has field comparators, they will be used to sort the argument values to influence what values are seen first. Otherwise the fields will not be sorted in any deterministic order.

Copyright © 2007-2012 Concurrent, Inc. All Rights Reserved.