Cascading for the Impatient, Part 1

Part 1 - Distributed file copy

The lesson today is how to write a simple Cascading app. The goal is clear and concise: create the simplest application possible in Cascading, while following best practices. No bangs, no whistles, just good solid code.

https://github.com/Cascading/Impatient/tree/master/part1

Here’s a brief Java program, about a dozen lines long. It copies lines of text from file A to file B. It uses 1 mapper in Apache Hadoop. No reducer needed. A conceptual diagram for this implementation in Cascading is shown as:

plumb1

Certainly this same work could be performed in much quicker ways, such as using cp on Linux. However this Cascading example is merely a starting point. We’ll build on this example, adding new pieces of code to explore features and strengths of Cascading. We’ll keep building until we have a MapReduce implementation of TF-IDF for scoring the relative ``importance'' of keywords in a set of documents. In other words, Text Mining 101. What you might find when you peek inside Lucene for example, or some other text indexing framework. Moreover, we’ll show how to use TDD features of Cascading, to build robust MapReduce apps for scale.

Source

As explained in the introduction of the series, all code is hosted on github. If you have not cloned the repository yet, please do it now:

git clone git://github.com/Cascading/Impatient.git

First, we create a source tap to specify the input data. That data happens to be formatted as tab-separated values (TSV) with a header row:

String inPath = args[ 0 ];
Tap inTap = new Hfs( new TextDelimited( true, "\t" ), inPath );

Next we create a sink tap to specify the output data, which is also TSV:

String outPath = args[ 1 ];
Tap outTap = new Hfs( new TextDelimited( true, "\t" ), outPath );

Then we create a pipe to connect the taps:

Pipe copyPipe = new Pipe( "copy" );

Here comes the fun part. Get your tool belt ready, because we need to do a little plumbing… Connect the taps and pipes into a flow:

FlowDef flowDef = FlowDef.flowDef()
.addSource( copyPipe, inTap )
.addTailSink( copyPipe, outTap );

The notion of a workflow lives at the heart of Cascading. Instead of thinking in terms of mapper and reducer steps in a MapReduce job, we prefer to think about apps. Real-world apps tend to use lots of job steps. Those are connected and have dependencies, which are typically specified by a directed acyclic graph (DAG). Cascading uses FlowDef objects to define how a MapReduce app - a.k.a., a DAG of MapReduce job steps - must be connected.

Now that we have a flow defined, the last line of code runs it:

flowConnector.connect( flowDef ).complete();

Place those source lines all into a Main method, then build a JAR file. You should be good to go.

If you want to read in more detail about the classes in the Cascading API which were used, see the Cascading 3.0 User Guide and JavaDoc.

Build

To build the sample app from the command line use:

gradle clean jar

What you should have at this point is a JAR file which is nearly ready to drop into your Maven repo - almost. Actually, we provide a community jar repository for Cascading libraries and extensions at http://conjars.org

Run

Before running this sample app, you’ll need to have a supported release of Apache Hadoop installed. Here’s what was used to develop and test our example code:

$ hadoop version
Hadoop 2.4.1

Be sure to set your HADOOP_HOME environment variable. Then clear the output directory (Apache Hadoop insists, if you’re running in standalone mode) and run the app:

rm -rf output
hadoop jar ./build/libs/impatient.jar data/rain.txt output/rain

Notice how those command line arguments align with args[] in the source. The file data/rain.txt gets copied, TSV row by TSV row. Output text gets stored in the partition file output/rain which you can then verify:

more output/rain/part-00000

Here’s a log file from our run of the sample app. If your run looks terribly different, something is probably not set up correctly. Drop us a line on the cascading-user email forum. Plenty of experienced Cascading users are discussing taps and pipes and flows there, and eager to help.

That’s it in a nutshell, our simplest app possible in Cascading. Not quite a ``Hello World'', but more like a `Hi there, bus stop''. Or something.

Driven

Let us get started accessing your Driven environment. Driven is an application-performance management application that helps you visualize you application development, debugging, performance tuning, monitoring and other operational aspects for data-driven applications built on Cascading.

Driven is free for developer use on driven.io.

To get started, download the Driven plugin

Note

How does Driven work? When you run your Cascading application, the framework underneath instruments your code and sends these telemetry signals to your Driven server, which can reside in the Cloud (trial.driven.io) or be self-hosted. The Driven plugin, which resides close your application, only sends the operational meta-data to the server for analysis and visualization — no application data is ever transmitted. The operational meta-data includes the graph it constructs to run the application on the native computational fabric (MapReduce, local..) as well as the counter information it collects from the run.

Since this is a very elementary application, we will not have much to analyze - we will wait for later parts in the tutorial for that. However, it is important at this stage to learn how to setup and access your Driven environment.

When you execute your application, you will see the URL in your console prompt. Your URL, will of course be different than the one in this screenshot.

driven-part1-b

This URL will give you insights about your application, some of which we will cover as part of the tutorial. If you have not installed the Driven plugin, you can still see how Driven will help you visualize such an application by following the link.

Note

To take advantage of additional features such as comparing your application statistics with historical runs, collaborating with teams, and managing multiple applications, we suggest that you get a free API key at http://www.driven.io/choose-trial/. If you have registered and finished the instructions to install the key, log in at https://trial.driven.io to visualize your application flow. We will cover cover various features of the Driven application in the following tutorials.

driven-part1-a

We will get additional insights in later parts as we create more complex applications. From the screenshot, you will see two key components as part of the application developer view. The top half will help you visualize the graph associated with your application, showing you all the dependencies between different Cascading steps and flows. Clicking on the two taps (green circles) will give you additional attribute information, including reference to the source code where the Tap was defined.

The bottom half of the screen contains the 'Timeline View', which will give details associated with each flow run. You can click on the 'Add Columns' to explore other counters too. As your applications get more complex, these counters will help you gain insights if a particular run-time behavior is caused by code, the infrastructure, or the network.

To understand how best to understand the timing counters, read the following note on timing durations

Learn how to implement the classical word count with Cascading in Part 2 of Cascading for the Impatient.