7.6 Template taps

The TemplateTap Tap class provides a simple means to break large datasets into smaller sets based on data item values. This is commonly called partitioning or binning the data, where each "bin" of data is named after some data value(s) shared by the members of that bin. For example, this is a simple way to organize log files by month and year.

TextDelimited scheme =
  new TextDelimited( new Fields( "year", "month", "entry" ), "\t" );
Hfs tap = new Hfs( scheme, path );

String template = "%s-%s"; // dirs named "year-month"
Tap months = new TemplateTap( tap, template, SinkMode.REPLACE );

In the example above, we construct a parent Hfs tap and pass it to the constructor of a Templatetap instance, along with a String format "template". This format template is populated in the order in which values are declared via the Scheme class. If more complex path formatting is necessary, you may subclass the Templatetap.

Note that you can only create sub-directories to bin data into. Hadoop must still write "part" files into each bin directory, and there is no safe mechanism for manipulating part file names.

One last thing to keep in mind is whether binning happens during the Map phase or the Reduce phase. By doing a GroupBy on the values used to populate the template, binning will happen during the Reduce phase, and will likely scale much better in cases where there are a very large number of unique values used in the template resulting in a large number of directories.

Copyright © 2007-2012 Concurrent, Inc. All Rights Reserved.