The TemplateTap
Tap
class provides a simple means to break large datasets into smaller sets
based on values in the dataset. Typically this is called 'binning' the
data, where each 'bin' of data is named after values shared by the data
in that bin. For example, organizing log files by month and year.
TextDelimited scheme = new TextDelimited( new Fields( "year", "month", "entry" ), "\t" ); Hfs tap = new Hfs( scheme, path ); String template = "%s-%s"; // dirs named "year-month" Tap months = new TemplateTap( tap, template, SinkMode.REPLACE );
In the above example, we construct a parent
Hfs
Tap
and pass it to the
constructor of a TemplateTap
instance along with
a String format 'template'. This format template is populated in the
order values are declared via the Scheme
class.
If more complex path formatting is necessary then you may subclass the
TemplateTap
.
Note that you can only create sub-directories to bin data into. Hadoop must still write 'part' files into each bin directory.
One last thing to keep in mind is whether or not 'binning' happens
during the Map or Reduce phase. By doing a
GroupBy
on the values that will be used to
populate the template, binning will happen during the Reduce phase and
likely scale much better if there are a very large number of unique
grouping keys.
Copyright © 2007-2008 Concurrent, Inc. All Rights Reserved.