class provides a simple means to break large datasets into smaller sets
based on data item values. This is also commonly called binning the data, where each "bin" of data is
named after some data value(s) shared by the members of that bin. For
example, this is a simple way to organize log files by month and year.
PartitionTap replaces the TemplateTap in previous versions of Cascading
and adds the ability for a PartitionTap instance to be used as both a
sink and a source. Previously, TemplateTap could only be used as a
TextDelimited scheme = new TextDelimited( new Fields( "entry" ), "\t" ); Hfs parentTap = new Hfs( scheme, path ); // dirs named "[year]-[month]" DelimitedPartition partition = new DelimitedPartition( new Fields( "year", "month" ), "-" ); Tap monthsTap = new PartitionTap( parentTap, partition, SinkMode.REPLACE );
In the example above, we construct a parent
tap and pass it to the
constructor of a
PartitionTap instance, along
"partitioner". If more complex path formatting is necessary, you may
It is important to see in the above example that the
parentTap will only sink "entry" fields to a text delimited
file. But the
monthsTap expects "year", "month", and
"entry" fields from the tuple stream. Here data is stored in the
directory name for each partition when the PartitionTap is a sink, there
is no need to redundantly store the data in the text delimited file.
When reading from a
PartitionTap, the directory
name will be parsed and its values will be added to the outgoing tuple
stream when the
PartitionTap is a source.
Note that you can only create sub-directories to bin data into. Hadoop must still write "part" files into each bin directory, and there is no safe mechanism for manipulating part file names.
One last thing to keep in mind is whether binning happens during
the Map phase or the Reduce phase. By doing a
GroupBy on the values used to populate the
template, binning will happen during the Reduce phase, and will likely
scale much better in cases where there are a very large number of unique
values used in the template resulting in a large number of
As of Cascading 2.7, the PartitionTap now works when Hadoop "CombineFileInputFormat" support is enabled allowing for the reading of collections of small files within a single Hadoop input split.
Copyright © 2007-2012 Concurrent, Inc. All Rights Reserved.