The Tuple
class is a generic container for
all java.lang.Object
instances (1.0 required all
objects be of typejava.lang.Comparable
).
Subsequently any primitive value or custom Class can be stored in a
Tuple
instance, that is, returned by a
Function
, Aggregator
, or
Buffer
as a result value.
But for this to work any Class that isn't a primitive value or a
Hadoop Writable
type will need to have a
corresponding Hadoop 'serialization' class registered in the Hadoop
configuration files for your cluster. Hadoop
Writable
types work because there is already a
generic serialization implementation built into Hadoop. See the Hadoop
documentation for registering a new serialization helper or to create
Writable
types. Cascading will automatically
inherit any registered serialization implementations.
During serialization and deserialization of
Tuple
instances that contain custom types, the
Cascading Tuple
serialization framework will need
to store the class name (as a String
) before
serializing the custom object. This can be very space inneficient. To
overcome this, custom types can add the
SerializationToken
Java annotation to the custom
type class. The SerializationToken
annotation
expects two arrays, one of integers named tokens, and one of Class name
strings. Both arrays must be the same size, and no token can be less
than 128 (the first 128 values are for internal use).
During serialization and deserialization, the token values are
used instead of the String
Class names to reduce
the amount of storage used.
Serialization tokens may also be stored in the Hadoop config files
or set as a property passed to the FlowConnector
,
with the property name cascading.serialization.tokens
. The
value of this property is a comma separated list of
token=classname
values.
Note Cascading will natively serialize/deserialize all primitives
and byte arrays (byte[]
). It also uses the token 127 for
the Hadoop BytesWritable
class.
Along with custom serialization, Cascading supports lazy
deserialization during Tuple comparison when Hadoop sorts keys during
the "shuffle" phase. This is accomplished by implementing the
StreamComparator
interface. See the javadoc for
detailed instructions on implementing and the unit tests for examples.
By default Cascading will lazily deserialize each element in the
Tuple during sorting for comparison. But the
StreamComparator
allows for complex/custom Java
types to also lazily deserialize fields in the object during
comparison.
Copyright © 2007-2008 Concurrent, Inc. All Rights Reserved.