7. Field Typing and Type Coercion

7.1 Field Typing

As of Cascading 2.2, the Fields class can hold type information for each field, and the Cascading planner can propagate that information from source Tap instances to downstream Operations through to sink Tap instances.

This allows for Taps to read and store type information for external systems and applications, error detection during joins (detecting non-comparable types), to enforce canonical representations within the Tuple (prevent a field from switching arbitrarily between String and Integer types), and to allow for pluggable coercion from one type to another type, even if either isn't a Java primitive.

To declare types, simply pass type information to the Fields instance either through the constructor or via a fluent API.

Example 7.1. Constructor

Fields resultFields = new Fields( "count", Long.class ); // null is ok

Example 7.2. Fluent

Fields resultFields = new Fields( "count" ).applyTypes( long.class ); // null becomes 0

Note the first example uses Long.class, and the second long.class. Since Long is an object, we are letting Cascading know that the null value can be set. If declared long (a primitive) then null becomes zero.

In practice, typed fields can only be used when they declare the results of an operation, for example:

Example 7.3. Declaring Typed Results

Pipe assembly = new Pipe( "assembly" );

// ...
Fields groupingFields = new Fields( "date" );

// note we do not pass the parent assembly Pipe in
Fields valueField = new Fields( "size" );
Fields sumField = new Fields( "total-size", long.class );
SumBy sumBy = new SumBy( valueField, sumField );

Fields countField = new Fields( "num-events" );
CountBy countBy = new CountBy( countField );

assembly = new AggregateBy( assembly, groupingFields, sumBy, countBy );

Here the type information serves two roles. First, it allows a downstream consumer of the field value to know the type maintained in the tuple. Second, the SumBy sub-assembly now has a simpler API and can get the type information it needs internally to perform the aggregation directly from the Fields instance.

Note that the TextDelimited and other Scheme classes should have any type information declared so it can be maintained by the Cascading planner. Custom Scheme types also have the opportunity to read type information from any field or data sources they represent so it can be handed to the planner during runtime.

Copyright © 2007-2012 Concurrent, Inc. All Rights Reserved.