Cascading for the Impatient, Part 1
Part 1 - Distributed file copy
The lesson today is how to write a simple Cascading app. The goal is clear and concise: create the simplest application possible in Cascading, while following best practices. No bangs, no whistles, just good solid code.
Here’s a brief Java program, about a dozen lines long. It copies lines of text
A to file
B. It uses 1 mapper in
Apache Hadoop. No reducer needed. A conceptual
diagram for this implementation in Cascading is shown as:
Certainly this same work could be performed in much quicker ways, such as using
cp on Linux. However this Cascading example is merely a starting point. We’ll
build on this example, adding new pieces of code to explore features and
strengths of Cascading. We’ll keep building until we have a MapReduce
implementation of TF-IDF for scoring the
relative ``importance'' of keywords in a set of documents. In other words, Text
Mining 101. What you might find when you peek inside
Lucene for example, or some other text indexing
framework. Moreover, we’ll show how to use
TDD features of Cascading,
to build robust MapReduce apps for scale.
As explained in the introduction of the series, all code is hosted on github. If you have not cloned the repository yet, please do it now:
git clone git://github.com/Cascading/Impatient.git
First, we create a source tap to specify the input data. That data happens to be formatted as tab-separated values (TSV) with a header row:
String inPath = args[ 0 ]; Tap inTap = new Hfs( new TextDelimited( true, "\t" ), inPath );
Next we create a sink tap to specify the output data, which is also TSV:
String outPath = args[ 1 ]; Tap outTap = new Hfs( new TextDelimited( true, "\t" ), outPath );
Then we create a pipe to connect the taps:
Pipe copyPipe = new Pipe( "copy" );
Here comes the fun part. Get your tool belt ready, because we need to do a little plumbing… Connect the taps and pipes into a flow:
FlowDef flowDef = FlowDef.flowDef() .addSource( copyPipe, inTap ) .addTailSink( copyPipe, outTap );
The notion of a workflow lives at the
heart of Cascading. Instead of thinking in terms of mapper and reducer steps in
a MapReduce job, we prefer to think about apps. Real-world apps tend to use lots
of job steps. Those are connected and have dependencies, which are typically
specified by a directed
acyclic graph (DAG). Cascading uses
objects to define how a MapReduce app - a.k.a., a DAG of MapReduce job steps -
must be connected.
Now that we have a flow defined, the last line of code runs it:
flowConnector.connect( flowDef ).complete();
Place those source lines all into a
Main method, then build a JAR file. You
should be good to go.
To build the sample app from the command line use:
gradle clean jar
Before running this sample app, you’ll need to have a supported release of Apache Hadoop installed. Here’s what was used to develop and test our example code:
$ hadoop version Hadoop 2.4.1
Be sure to set your
HADOOP_HOME environment variable. Then clear the output
directory (Apache Hadoop insists, if you’re running in standalone mode) and run
rm -rf output hadoop jar ./build/libs/impatient.jar data/rain.txt output/rain
Notice how those command line arguments align with
args in the source. The
data/rain.txt gets copied, TSV row by TSV row. Output text gets stored in
the partition file
output/rain which you can then verify:
Here’s a log file from our run of the sample app. If your run looks terribly different, something is probably not set up correctly. Drop us a line on the cascading-user email forum. Plenty of experienced Cascading users are discussing taps and pipes and flows there, and eager to help.
That’s it in a nutshell, our simplest app possible in Cascading. Not quite a ``Hello World'', but more like a `Hi there, bus stop''. Or something.
Let us get started accessing your Driven environment. Driven is an application-performance management application that helps you visualize you application development, debugging, performance tuning, monitoring and other operational aspects for data-driven applications built on Cascading.
Driven is free for developer use on driven.io.
To get started, download the Driven plugin
|How does Driven work? When you run your Cascading application, the framework underneath instruments your code and sends these telemetry signals to your Driven server, which can reside in the Cloud (trial.driven.io) or be self-hosted. The Driven plugin, which resides close your application, only sends the operational meta-data to the server for analysis and visualization — no application data is ever transmitted. The operational meta-data includes the graph it constructs to run the application on the native computational fabric (MapReduce, local..) as well as the counter information it collects from the run.|
Since this is a very elementary application, we will not have much to analyze - we will wait for later parts in the tutorial for that. However, it is important at this stage to learn how to setup and access your Driven environment.
When you execute your application, you will see the URL in your console prompt. Your URL, will of course be different than the one in this screenshot.
This URL will give you insights about your application, some of which we will cover as part of the tutorial. If you have not installed the Driven plugin, you can still see how Driven will help you visualize such an application by following the link.
|To take advantage of additional features such as comparing your application statistics with historical runs, collaborating with teams, and managing multiple applications, we suggest that you get a free API key at http://www.driven.io/choose-trial/. If you have registered and finished the instructions to install the key, log in at https://trial.driven.io to visualize your application flow. We will cover cover various features of the Driven application in the following tutorials.|
We will get additional insights in later parts as we create more complex applications. From the screenshot, you will see two key components as part of the application developer view. The top half will help you visualize the graph associated with your application, showing you all the dependencies between different Cascading steps and flows. Clicking on the two taps (green circles) will give you additional attribute information, including reference to the source code where the Tap was defined.
The bottom half of the screen contains the 'Timeline View', which will give details associated with each flow run. You can click on the 'Add Columns' to explore other counters too. As your applications get more complex, these counters will help you gain insights if a particular run-time behavior is caused by code, the infrastructure, or the network.
To understand how best to understand the timing counters, read the following note on timing durations
Learn how to implement the classical word count with Cascading in Part 2 of Cascading for the Impatient.