Cascading 3.0 User Guide - Apache Tez Platform

Apache Tez Platform

The following documentation covers details about using Cascading on the Apache Tez platform that are not covered in the Apache Hadoop documentation of this guide.

The most up-to-date information about running Cascading on Apache Tez and supported Tez releases can be found in a GitHub repo README at:

Released Source: https://github.com/cascading/cascading/tree/3.0/cascading-hadoop2-tez
Work-in-Progress Source: https://github.com/cwensel/cascading/tree/wip-3.0/cascading-hadoop2-tez

Apache Tez is a noticeable improvement over MapReduce. Tez’s merits include:

No more "identity mappers" — mappers that simply forward data to a reducer
Support for multiple outputs
No prefixing data with join ordinality
Suppression of sorting when not needed
Removal of HDFS as an intermediate store between jobs

Configuring Applications

During runtime, Hadoop must be told which application JAR file should be pushed to the cluster.

In order to remain platform-independent, the AppProps class should be used as described in the configuring applications for Hadoop documentation.

Building

Cascading ships with several JARs and dependencies in the download archive.

Alternatively, Cascading is available over Maven and Ivy through the Conjars repository, along with a number of other Cascading-related projects. See http://conjars.org for more information.

The Cascading Hadoop artifacts include the following:

cascading-core-3.x.y.jar: This JAR contains the Cascading Core class files. It should be packaged with lib/*.jar when using Hadoop.
cascading-hadoop2-tez-3.x.y.jar: This JAR contains the Cascading Hadoop 2 and Apache Tez specific dependencies. It should be packaged with lib/*.jar when using Hadoop.

Cascading works with either of the Hadoop processing modes — the default local stand-alone mode and the distributed cluster mode. As specified in the Hadoop documentation, running in cluster mode requires the creation of a Hadoop job JAR that includes the Cascading JARs, plus any needed third-party JARs, in its lib directory. This is true regardless of whether they are Cascading Hadoop-mode applications or raw Apache Tez applications.