8.7 PartitionTaps

The PartitionTap Tap class provides a simple means to break large datasets into smaller sets based on data item values. This is also commonly called binning the data, where each "bin" of data is named after some data value(s) shared by the members of that bin. For example, this is a simple way to organize log files by month and year. PartitionTap replaces the TemplateTap in previous versions of Cascading and adds the ability for a PartitionTap instance to be used as both a sink and a source. Previously, TemplateTap could only be used as a sink.

TextDelimited scheme =
  new TextDelimited( new Fields( "entry" ), "\t" );
Hfs parentTap = new Hfs( scheme, path );

// dirs named "[year]-[month]"
DelimitedPartition partition = new DelimitedPartition( new Fields( "year", "month" ), "-" );
Tap monthsTap = new PartitionTap( parentTap, partition, SinkMode.REPLACE );

In the example above, we construct a parent Hfs tap and pass it to the constructor of a PartitionTap instance, along with a cascading.tap.partition.DelimitedPartition "partitioner". If more complex path formatting is necessary, you may implement the cascading.tap.partition.Partition interface.

It is important to see in the above example that the parentTap will only sink "entry" fields to a text delimited file. But the monthsTap expects "year", "month", and "entry" fields from the tuple stream. Here data is stored in the directory name for each partition when the PartitionTap is a sink, there is no need to redundantly store the data in the text delimited file. When reading from a PartitionTap, the directory name will be parsed and its values will be added to the outgoing tuple stream when the PartitionTap is a source.

Note that you can only create sub-directories to bin data into. Hadoop must still write "part" files into each bin directory, and there is no safe mechanism for manipulating part file names.

One last thing to keep in mind is whether binning happens during the Map phase or the Reduce phase. By doing a GroupBy on the values used to populate the template, binning will happen during the Reduce phase, and will likely scale much better in cases where there are a very large number of unique values used in the template resulting in a large number of directories.

Copyright © 2007-2012 Concurrent, Inc. All Rights Reserved.