4.3 Configuring

During runtime, Hadoop must be told which application jar file should be pushed to the cluster. Typically, this is done via the Hadoop API JobConf object.

Cascading offers a shorthand for configuring this parameter, demonstrated here:

Properties properties = new Properties();

// pass in the class name of your application
// this will find the parent jar at runtime
properties = AppProps.appProps()
  .setName( "sample-app" )
  .setVersion( "1.2.3" )
  .addTags( "deploy:prod", "team:engineering" )
  .setJarClass( Main.class ) // find jar from class
  .buildProperties( properties ); // returns a copy

// ALTERNATIVELY ...

// pass in the path to the parent jar
properties = AppProps.appProps()
  .setName( "sample-app" )
  .setVersion( "1.2.3" )
  .addTags( "deploy:prod", "team:engineering" )
  .setJarPath( pathToJar ) // set jar path
  .buildProperties( properties ); // returns a copy


// pass properties to the connector
FlowConnector flowConnector = new HadoopFlowConnector( properties );

Above we see two ways to set the same property - via the setJarClass() method, and via the setJarPath() method. One is based on a Class name, and the other is based on a literal path.

The first method takes a Class object that owns the "main" function for this application. The assumption here is that Main.class is not located in a Java Jar that is stored in the lib folder of the application Jar. If it is, that Jar is pushed to the cluster, not the parent application jar.

The second method simply sets the path to the Java Jar as a property.

In your application, only one of these methods needs to be called, but one of them must be called to properly configure Hadoop.

Example 4.1. Configuring the Application Jar with a JobConf

JobConf jobConf = new JobConf();

// pass in the class name of your application
// this will find the parent jar at runtime
jobConf.setJarByClass( Main.class );

// ALTERNATIVELY ...

// pass in the path to the parent jar
jobConf.setJar( pathToJar );

// build the properties object using jobConf as defaults
Properties properties = AppProps.appProps()
  .setName( "sample-app" )
  .setVersion( "1.2.3" )
  .addTags( "deploy:prod", "team:engineering" )
  .buildProperties( jobConf );

// pass properties to the connector
FlowConnector flowConnector = new HadoopFlowConnector( properties );

Above we are starting with an existing Hadoop JobConf instance and building a Properties object with it as the default.

Note that AppProps is a helper fluent API for setting properties that define Flows or configure the underlying platform. There are quite a few "Props" based classes that expose fluent API calls, the ones most commonly used are below.

cascading.property.AppPropsAllows for setting application specific properties. Some properties are required by the underlying platform, like application Jar. Others are simple meta-data used by compatible management tools, like tags.
cascading.flow.FlowConnectorPropsAllows for setting a DebugLevel or AssertionLevel for a given FlowConnector to target. Also allows for setting intermediate DecoratorTap sub-classes to be used if any.
cascading.flow.FlowPropsAllows for setting any Flow specific properties like the maximum concurrent steps to be scheduled, or changing the default Tuple Comparator class.
cascading.cascade.CascadePropsAllows for setting any Cascade specific properties like the maximum concurrent Flows to be scheduled.
cascading.tap.hadoop.HfsPropsAllows for setting Hadoop specific FileSystem properties, specifically properties around enabling the 'combined input format' support. Combining inputs minimized the performance penalty around processing large numbers of small files.

Copyright © 2007-2012 Concurrent, Inc. All Rights Reserved.