The PartitionTap
Tap
class provides a simple means to break large datasets into smaller sets
based on data item values. This is also commonly called binning the data, where each "bin" of data is
named after some data value(s) shared by the members of that bin. For
example, this is a simple way to organize log files by month and year.
PartitionTap replaces the TemplateTap in previous versions of Cascading
and adds the ability for a PartitionTap instance to be used as both a
sink and a source. Previously, TemplateTap could only be used as a
sink.
TextDelimited scheme =
new TextDelimited( new Fields( "entry" ), "\t" );
Hfs parentTap = new Hfs( scheme, path );
// dirs named "[year]-[month]"
DelimitedPartition partition = new DelimitedPartition( new Fields( "year", "month" ), "-" );
Tap monthsTap = new PartitionTap( parentTap, partition, SinkMode.REPLACE );
In the example above, we construct a parent
Hfs
tap
and pass it to the
constructor of a PartitionTap
instance, along
with a cascading.tap.partition.DelimitedPartition
"partitioner". If more complex path formatting is necessary, you may
implement the cascading.tap.partition.Partition
interface.
It is important to see in the above example that the
parentTap
will only sink "entry" fields to a text delimited
file. But the monthsTap
expects "year", "month", and
"entry" fields from the tuple stream. Here data is stored in the
directory name for each partition when the PartitionTap is a sink, there
is no need to redundantly store the data in the text delimited file.
When reading from a PartitionTap
, the directory
name will be parsed and its values will be added to the outgoing tuple
stream when the PartitionTap
is a source.
Note that you can only create sub-directories to bin data into. Hadoop must still write "part" files into each bin directory, and there is no safe mechanism for manipulating part file names.
One last thing to keep in mind is whether binning happens during
the Map phase or the Reduce phase. By doing a
GroupBy
on the values used to populate the
template, binning will happen during the Reduce phase, and will likely
scale much better in cases where there are a very large number of unique
values used in the template resulting in a large number of
directories.
Copyright © 2007-2012 Concurrent, Inc. All Rights Reserved.