6.5 Template Taps

The TemplateTap Tap class provides a simple means to break large datasets into smaller sets based on values in the dataset. Typically this is called 'binning' the data, where each 'bin' of data is named after values shared by the data in that bin. For example, organizing log files by month and year.

TextDelimited scheme = new TextDelimited( new Fields( "year", "month", "entry" ), "\t" );
Hfs tap = new Hfs( scheme, path );

String template = "%s-%s"; // dirs named "year-month"
Tap months = new TemplateTap( tap, template, SinkMode.REPLACE );

In the above example, we construct a parent Hfs Tap and pass it to the constructor of a TemplateTap instance along with a String format 'template'. This format template is populated in the order values are declared via the Scheme class. If more complex path formatting is necessary then you may subclass the TemplateTap.

Note that you can only create sub-directories to bin data into. Hadoop must still write 'part' files into each bin directory.

One last thing to keep in mind is whether or not 'binning' happens during the Map or Reduce phase. By doing a GroupBy on the values that will be used to populate the template, binning will happen during the Reduce phase and likely scale much better if there are a very large number of unique grouping keys.

Copyright © 2007-2008 Concurrent, Inc. All Rights Reserved.