The TemplateTap
Tap
class provides a simple means to break large datasets into smaller sets
based on data item values. This is commonly called partitioning or
binning the data, where each "bin" of
data is named after some data value(s) shared by the members of that
bin. For example, this is a simple way to organize log files by month
and year.
TextDelimited scheme =
new TextDelimited( new Fields( "year", "month", "entry" ), "\t" );
Hfs tap = new Hfs( scheme, path );
String template = "%s-%s"; // dirs named "year-month"
Tap months = new TemplateTap( tap, template, SinkMode.REPLACE );
In the example above, we construct a parent
Hfs
tap
and pass it to the
constructor of a Templatetap
instance, along with
a String format "template". This format template is populated in the
order in which values are declared via the Scheme
class. If more complex path formatting is necessary, you may subclass
the Templatetap
.
Note that you can only create sub-directories to bin data into. Hadoop must still write "part" files into each bin directory, and there is no safe mechanism for manipulating part file names.
One last thing to keep in mind is whether binning happens during
the Map phase or the Reduce phase. By doing a
GroupBy
on the values used to populate the
template, binning will happen during the Reduce phase, and will likely
scale much better in cases where there are a very large number of unique
values used in the template resulting in a large number of
directories.
Copyright © 2007-2012 Concurrent, Inc. All Rights Reserved.