6.8 Custom Types and Serialization

The Tuple class is a generic container for all java.lang.Object instances (1.0 required all objects be of typejava.lang.Comparable). Subsequently any primitive value or custom Class can be stored in a Tuple instance, that is, returned by a Function, Aggregator, or Buffer as a result value.

But for this to work any Class that isn't a primitive value or a Hadoop Writable type will need to have a corresponding Hadoop 'serialization' class registered in the Hadoop configuration files for your cluster. Hadoop Writable types work because there is already a generic serialization implementation built into Hadoop. See the Hadoop documentation for registering a new serialization helper or to create Writable types. Cascading will automatically inherit any registered serialization implementations.

During serialization and deserialization of Tuple instances that contain custom types, the Cascading Tuple serialization framework will need to store the class name (as a String) before serializing the custom object. This can be very space inneficient. To overcome this, custom types can add the SerializationToken Java annotation to the custom type class. The SerializationToken annotation expects two arrays, one of integers named tokens, and one of Class name strings. Both arrays must be the same size, and no token can be less than 128 (the first 128 values are for internal use).

During serialization and deserialization, the token values are used instead of the String Class names to reduce the amount of storage used.

Serialization tokens may also be stored in the Hadoop config files or set as a property passed to the FlowConnector, with the property name cascading.serialization.tokens. The value of this property is a comma separated list of token=classname values.

Note Cascading will natively serialize/deserialize all primitives and byte arrays (byte[]). It also uses the token 127 for the Hadoop BytesWritable class.

Along with custom serialization, Cascading supports lazy deserialization during Tuple comparison when Hadoop sorts keys during the "shuffle" phase. This is accomplished by implementing the StreamComparator interface. See the javadoc for detailed instructions on implementing and the unit tests for examples.

By default Cascading will lazily deserialize each element in the Tuple during sorting for comparison. But the StreamComparator allows for complex/custom Java types to also lazily deserialize fields in the object during comparison.

Copyright © 2007-2008 Concurrent, Inc. All Rights Reserved.