Cascading 3.0 User Guide - Built-In SubAssemblies
Built-in SubAssemblies
There are a number of helper SubAssemblies provided by the core Cascading library.
Many of the assemblies that are described below can be coded to ignore null values. This allows for an optional but closer resemblance to how similar functions in SQL perform. |
Optimized Aggregations
The following SubAssemblies are implementations or optimizations of more discrete aggregate functions. Many of these SubAssemblies rely on the AggregateBy base class, which can be subclassed by developers to create custom aggregations leveraging the internal partial aggregation implementation.
Unique
The cascading.pipe.assembly.Unique SubAssembly is used to remove duplicate values in a Tuple stream. Uniqueness is determined by the values of all fields listed in uniqueFields. Use Fields.ALL as the uniqueFields argument to find all distinct Tuples in a stream.
// incoming -> first, last
assembly = new Unique( assembly, new Fields( "first", "last" ) );
// outgoing -> first, last
Unique uses the FirstNBuffer to more efficiently determine unique values.
AggregateBy
The cascading.pipe.assembly.AggregateBy SubAssembly is an implementation of the partial aggregation pattern, and is the base class for built-in and custom partial aggregation implementations like AverageBy or CountBy.
Generally the AggregateBy class is used to combine multiple AggregateBy subclasses into a single Pipe.
Pipe assembly = new Pipe( "assembly" );
// ...
Fields groupingFields = new Fields( "date" );
// note we do not pass the parent assembly Pipe in
Fields valueField = new Fields( "size" );
Fields sumField = new Fields( "total-size", long.class );
SumBy sumBy = new SumBy( valueField, sumField );
Fields countField = new Fields( "num-events" );
CountBy countBy = new CountBy( countField );
assembly = new AggregateBy( assembly, groupingFields, sumBy, countBy );
To create a custom partial aggregation, subclass the AggregateBy class and implement the appropriate internal interfaces. See the Javadoc for details.
AverageBy
The cascading.pipe.assembly.AverageBy SubAssembly calculates an average of the given valueFields and returns the result in the averageField field. AverageBy may be combined with other AggregateBy subclasses so they may be executed simultaneously over the same grouping.
Pipe assembly = new Pipe( "assembly" );
// ...
Fields groupingFields = new Fields( "date" );
Fields valueField = new Fields( "size" );
Fields avgField = new Fields( "avg-size" );
assembly = new AverageBy( assembly, groupingFields, valueField, avgField );
CountBy
The cascading.pipe.assembly.CountBy SubAssembly performs a count over the given groupingFields and returns the result in the countField field. CountBy may be combined with other AggregateBy subclasses so they may be executed simultaneously over the same grouping.
Pipe assembly = new Pipe( "assembly" );
// ...
Fields groupingFields = new Fields( "date" );
Fields countField = new Fields( "count" );
assembly = new CountBy( assembly, groupingFields, countField );
SumBy
The cascading.pipe.assembly.SumBy SubAssembly performs a sum over the given valueFields and returns the result in the sumField field. SumBy may be combined with other AggregateBy subclasses so that they may be executed simultaneously over the same grouping.
Pipe assembly = new Pipe( "assembly" );
// ...
Fields groupingFields = new Fields( "date" );
Fields valueField = new Fields( "size" );
Fields sumField = new Fields( "total-size" );
assembly =
new SumBy( assembly, groupingFields, valueField, sumField, long.class );
FirstBy
The cascading.pipe.assembly.FirstBy SubAssembly is used to return the first encountered value in the given valueFields. FirstBy may be combined with other AggregateBy subclasses so they may be executed simultaneously over the same grouping.
Pipe assembly = new Pipe( "assembly" );
// ...
Fields groupingFields = new Fields( "date" );
Fields valueField = new Fields( "size" );
// we want the largest size in this grouping
valueField.setComparator( "size", new LongComparator() );
assembly =
new FirstBy( assembly, groupingFields, valueField );
Note if the valueFields Fields instance has field comparators, they will be used to sort the order of argument values. Otherwise, the fields will not be sorted in any deterministic order.
Stream Shaping
Coerce
The cascading.pipe.assembly.Coerce SubAssembly is used to coerce a set of values from one type to another type — for example, to convert the age field from a String to an Integer.
// incoming -> first, last, age
assembly =
new Coerce( assembly, new Fields( "age" ), Integer.class );
// outgoing -> first, last, age
Discard
The cascading.pipe.assembly.Discard SubAssembly is used to shape the Tuple stream by discarding all fields given on the constructor. All unlisted fields are retained.
// incoming -> first, last, age
assembly = new Discard( assembly, new Fields( "age" ) );
// outgoing -> first, last
Rename
The cascading.pipe.assembly.Rename SubAssembly is used to rename a field.
// incoming -> first, last, age
assembly =
new Rename( assembly, new Fields( "age" ), new Fields( "years" ) );
// outgoing -> first, last, years
Retain
The cascading.pipe.assembly.Retain SubAssembly is used to shape the Tuple stream by retaining all fields given on the constructor. All unlisted fields are discarded.
// incoming -> first, last, age
assembly = new Retain( assembly, new Fields( "first", "last" ) );
// outgoing -> first, last