The Tuple
class is a generic container for
all java.lang.Object
instances. Thus any
primitive value or custom Class can be stored in a
Tuple
instance - that is, returned by a
Function
, Aggregator
, or
Buffer
as a result value.
But for this to work when using the Cascading Hadoop mode, any
Class that isn't a primitive type or a Hadoop
Writable
type requires a corresponding Hadoop
serialization class registered in the Hadoop configuration files for
your cluster. Hadoop Writable
types work because
there is already a generic serialization implementation built into
Hadoop. See the Hadoop documentation for information on registering a
new serialization helper or creating Writable
types. Registered serialization implementations are automatically
inherited by Cascading.
During serialization and deserialization of
Tuple
instances that contain custom types, the
Cascading Tuple
serialization framework must
store the class name (as a String
) before
serializing the custom object. This can be very space-inefficient. To
overcome this, custom types can add the
SerializationToken
Java annotation to the custom
type class. The SerializationToken
annotation
expects two arrays - one of integers that are used as tokens, and one of
Class name strings. Both arrays must be the same size. The integer
tokens must all have values of 128 or greater, since the first 128
values are reserved for internal use.
During serialization and deserialization, the token values are
used instead of the String
Class names, in order
to reduce the amount of storage used.
Serialization tokens may also be stored in the Hadoop config files
or set as a property passed to the FlowConnector
,
with the property name cascading.serialization.tokens
. The
value of this property is a comma separated list of
token=classname
values.
Note that Cascading natively serializes/deserializes all
primitives and byte arrays (byte[]
), if the developer
registers the BytesSerialization
class by using
TupleSerializationProps.addSerialization(properties,
BytesSerialization.class.getName()
. The token 127 is used for the
Hadoop BytesWritable
class.
By default, Cascading uses lazy deserialization on Tuple elements during comparisons when Hadoop sorts keys during the "shuffle" phase.
Cascading supports custom serialization for custom types, as well
as lazy deserialization of custom types during comparisons. This is
accomplished by implementing the StreamComparator
interface. See the Javadoc for detailed instructions on implemention,
and the unit tests for examples.
Copyright © 2007-2012 Concurrent, Inc. All Rights Reserved.