Cascading 3.0 User Guide - Apache Hadoop MapReduce Platform

Apache Hadoop MapReduce Platform

By default, Apache Hadoop provides an API called MapReduce for performing computation at scale.

The following documentation covers details about using Cascading on the MapReduce platform that are not covered in the Apache Hadoop documentation of this guide.

Configuring Applications

At runtime, Hadoop must be told which application JAR file should be pushed to the cluster. Historically, this is done via the Hadoop API JobConf object, as seen in the example below.

In order to remain platform-independent, use the AppProps class as described in Configuring Applications.

If you must use an existing JobConf instance, consider the example below:

Example 1. Configuring the application JAR with a JobConf

JobConf jobConf = new JobConf();

// pass in the class name of your application
// this will find the parent jar at runtime
jobConf.setJarByClass( Main.class );

// ALTERNATIVELY ...

// pass in the path to the parent jar
jobConf.setJar( pathToJar );

// build the properties object using jobConf as defaults
Properties properties = AppProps.appProps()
  .setName( "sample-app" )
  .setVersion( "1.2.3" )
  .addTags( "deploy:prod", "team:engineering" )
  .buildProperties( jobConf );

// pass properties to the connector
FlowConnector flowConnector = new Hadoop2MR1FlowConnector( properties );

In the example above we see two ways to use methods to set the same property:

setJarClass(): In this method, you invoke a Class object by name. In the example, the Class that is named owns the "main" function for this application. The assumption here is that Main.class is not located in a Java JAR that is stored in the lib folder of the application JAR. If it is, the dependent lib folder JAR will be pushed to the cluster, not to the parent application JAR (the JAR containing the lib folder).
setJarPath(): This method requires setting a literal path to the Java JAR as a property.

In your application, only one of these methods must be called to properly configure Hadoop.

Creating Flows from a JobConf

If a MapReduce job already exists, then the cascading.flow.hadoop.MapReduceFlow class should be used. To do this, create a Hadoop JobConf instance and simply pass it into the MapReduceFlow constructor. The resulting Flow instance can be used like any other Flow.

Note both multiple MapReduceFlow instances and other Flow instances can be passed to a CascadeConnector to produce a Cascade.

Building

Cascading ships with several JARs and dependencies in the download archive.

Alternatively, Cascading is available via Maven and Ivy through the Conjars repository, along with a number of other Cascading-related projects. See http://conjars.org for more information.

The Cascading Hadoop artifacts include the following:

cascading-core-3.x.y.jar: This JAR contains the Cascading Core class files. It should be packaged with lib/*.jar when using Hadoop.
cascading-hadoop-3.x.y.jar: This JAR contains the Cascading Hadoop 1 specific dependencies. It should be packaged with lib/*.jar when using Hadoop.
cascading-hadoop2-mr1-3.x.y.jar: This JAR contains the Cascading Hadoop 2 specific dependencies. It should be packaged with lib/*.jar when using Hadoop.

Do not package both cascading-hadoop-3.x.y.jar and cascading-hadoop2-mr1-3.x.y.jar JAR files into your application. Choose the version that matches your Hadoop distribution version.

Cascading works with either of the Hadoop processing modes: the default local stand-alone mode and the distributed cluster mode. As specified in the Hadoop documentation, running in cluster mode requires the creation of a Hadoop job JAR that includes the Cascading JARs, plus any needed third-party JARs, in its lib directory. This is true regardless of whether they are Cascading Hadoop-mode applications or raw Hadoop MapReduce applications.