During runtime, Hadoop must be told which application jar file
should be pushed to the cluster. Typically, this is done via the Hadoop
API JobConf
object.
Cascading offers a shorthand for configuring this parameter, demonstrated here:
Properties properties = new Properties();
// pass in the class name of your application
// this will find the parent jar at runtime
properties = AppProps.appProps()
.setName( "sample-app" )
.setVersion( "1.2.3" )
.addTags( "deploy:prod", "team:engineering" )
.setJarClass( Main.class ) // find jar from class
.buildProperties( properties ); // returns a copy
// ALTERNATIVELY ...
// pass in the path to the parent jar
properties = AppProps.appProps()
.setName( "sample-app" )
.setVersion( "1.2.3" )
.addTags( "deploy:prod", "team:engineering" )
.setJarPath( pathToJar ) // set jar path
.buildProperties( properties ); // returns a copy
// pass properties to the connector
FlowConnector flowConnector = new HadoopFlowConnector( properties );
Above we see two ways to set the same property - via the
setJarClass()
method, and via the
setJarPath()
method. One is based on a Class
name, and the other is based on a literal path.
The first method takes a Class object that owns the "main"
function for this application. The assumption here is that
Main.class
is not located in a Java Jar that is stored in
the lib
folder of the application Jar. If it is,
that Jar is pushed to the cluster, not the parent application
jar.
The second method simply sets the path to the Java Jar as a property.
In your application, only one of these methods needs to be called, but one of them must be called to properly configure Hadoop.
Example 4.1. Configuring the Application Jar with a JobConf
JobConf jobConf = new JobConf();
// pass in the class name of your application
// this will find the parent jar at runtime
jobConf.setJarByClass( Main.class );
// ALTERNATIVELY ...
// pass in the path to the parent jar
jobConf.setJar( pathToJar );
// build the properties object using jobConf as defaults
Properties properties = AppProps.appProps()
.setName( "sample-app" )
.setVersion( "1.2.3" )
.addTags( "deploy:prod", "team:engineering" )
.buildProperties( jobConf );
// pass properties to the connector
FlowConnector flowConnector = new HadoopFlowConnector( properties );
Above we are starting with an existing Hadoop
JobConf
instance and building a Properties object
with it as the default.
Note that AppProps
is a helper fluent API
for setting properties that define Flows or configure the underlying
platform. There are quite a few "Props" based classes that expose fluent
API calls, the ones most commonly used are below.
cascading.property.AppProps | Allows for setting application specific properties. Some properties are required by the underlying platform, like application Jar. Others are simple meta-data used by compatible management tools, like tags. |
cascading.flow.FlowConnectorProps | Allows for setting a DebugLevel or
AssertionLevel for a given FlowConnector
to target. Also allows for setting intermediate
DecoratorTap sub-classes to be used if
any. |
cascading.flow.FlowProps | Allows for setting any Flow specific properties like the maximum concurrent steps to be scheduled, or changing the default Tuple Comparator class. |
cascading.cascade.CascadeProps | Allows for setting any Cascade specific properties like the maximum concurrent Flows to be scheduled. |
cascading.tap.hadoop.HfsProps | Allows for setting Hadoop specific FileSystem properties, specifically properties around enabling the 'combined input format' support. Combining inputs minimized the performance penalty around processing large numbers of small files. |
Copyright © 2007-2012 Concurrent, Inc. All Rights Reserved.