12.2 Custom Types and Serialization

12.2 Custom Types and Serialization
Prev	12. Extending Cascading	Next

The Tuple class is a generic container for all java.lang.Object instances. Thus any primitive value or custom Class can be stored in a Tuple instance - that is, returned by a Function, Aggregator, or Buffer as a result value.

But for this to work when using the Cascading Hadoop mode, any Class that isn't a primitive type or a Hadoop Writable type requires a corresponding Hadoop serialization class registered in the Hadoop configuration files for your cluster. Hadoop Writable types work because there is already a generic serialization implementation built into Hadoop. See the Hadoop documentation for information on registering a new serialization helper or creating Writable types. Registered serialization implementations are automatically inherited by Cascading.

During serialization and deserialization of Tuple instances that contain custom types, the Cascading Tuple serialization framework must store the class name (as a String) before serializing the custom object. This can be very space-inefficient. To overcome this, custom types can add the SerializationToken Java annotation to the custom type class. The SerializationToken annotation expects two arrays - one of integers that are used as tokens, and one of Class name strings. Both arrays must be the same size. The integer tokens must all have values of 128 or greater, since the first 128 values are reserved for internal use.

During serialization and deserialization, the token values are used instead of the String Class names, in order to reduce the amount of storage used.

Serialization tokens may also be stored in the Hadoop config files or set as a property passed to the FlowConnector, with the property name cascading.serialization.tokens. The value of this property is a comma separated list of token=classname values.

Note that Cascading natively serializes/deserializes all primitives and byte arrays (byte[]), if the developer registers the BytesSerialization class by using TupleSerializationProps.addSerialization(properties, BytesSerialization.class.getName(). The token 127 is used for the Hadoop BytesWritable class.

By default, Cascading uses lazy deserialization on Tuple elements during comparisons when Hadoop sorts keys during the "shuffle" phase.

Cascading supports custom serialization for custom types, as well as lazy deserialization of custom types during comparisons. This is accomplished by implementing the StreamComparator interface. See the Javadoc for detailed instructions on implemention, and the unit tests for examples.