Data Processing on Amazon Web Services (AWS)

Prerequisites

In these tutorials, you will be developing data flows with the Cascading framework and executing them on Hadoop and AWS Elastic MapReduce; it is important that your environment is configured correctly before we get started. Here are the components that we want to make sure that you have installed correctly before we get started with development:

Environment

This tutorial assumes that you will be running the sample applications from your local machine in a Unix/Linux environment.

JDK

Please run this tutorial using JDK 1.7.

Gradle

To compile your flows, you will need Gradle

Note
Ensure that you install version 1.9 or 1.11. This exercise is currently not compatible with gradle 2.x.

You can check your version of gradle like this:

$ gradle -v
------------------------------------------------------------
    Gradle 1.9 +
------------------------------------------------------------

GitHub

The code of this tutorial is hosted on GitHub. Clone the code onto your local disk and build it like so:

$ mkdir ./[PATH]/[TO]/cascading-tutorials
$ cd ./[PATH]/[TO]/cascading-tutorials
$ git clone https://github.com/Cascading/tutorials
$ cd tutorials/cascading-aws
$ gradle build

These tutorials will also leverage cascading-jdbc which is hosted on GitHub. If you have not set up your AWS Redshift cluster please do so here AWS Setup and then clone and build the code like so:

$ mkdir ./[PATH]/[TO]/cascading-jdbc
$ cd ./[PATH]/[TO]/cascading-jdbc
$ git clone https://github.com/Cascading/cascading-jdbc.git
$ cd cascading-jdbc

After you have your Redshift cluster connection information (see AWS Redshift Setup) please run the following command to build cascading-jdbc-redshift.

$ gradle build -Dcascading.jdbc.url.redshift='jdbc:postgresql://[REDSHIFT_HOST]/[REDSHIFT_DB]?user=[USERNAME]&password=[PASSWORD]' -i

Data

The tutorial requires data files which are hosted on data.cascading.org:

$ cd ./[PATH]/[TO]/cascading-tutorials/cascading-aws/
$ wget http://data.cascading.org/cascading-aws-data.tgz
$ tar xf cascading-aws-data.tgz
$ rm -f cascading-aws-data.tgz

Hadoop

In part 1 of this tutorial we will demostrate data migration from Hadoop to AWS Redshift using Cascading Lingual shell. You may run the sample application for part1 on the Hadoop distribution of your choice. Cascading is supported on all major Hadoop distributions. If you would like to run part 1 (Cascading Lingual shell), please install the latest stable version from the 2.x series.

Parts 2 and 3 of this tutorial will focus on AWS Elastic MapReduce and do not require a Hadoop installation.

$ hadoop version
Hadoop 2.4.1

To run part 1, please add the necessary data to HDFS

$ hadoop dfs -copyFromLocal cascading-aws-data/NASA_access_log_Aug95_short.ssv /[HADOOP]/[FILE]/[PATH]
$ hadoop dfs -ls /[HADOOP]/[FILE]/[PATH]

Driven

Use Driven to develop and debug Cascading flows. You do not need to make any changes to your existing Cascading applications to integrate with the Driven application.

The cloud version of Driven is free to use for development. Visit Try Driven to access all the developer features and gain visibility into your ETL flows. Driven Documentation

Note
We will use the cloud-based Driven product for the purposes of this tutorial. Driven will receive the data from your application to help you monitor and visualize the development and the execution of your application. To use the cloud version of Driven, the Driven plugin must be visible to your Cascading application. In addition, your Cascading application must have network access to the Internet so it may send data to the domain name "trial.driven.io" over port 443 (the default SSL port for HTTP). Note that the client side of Cascading will be the only Java process attempting to make a remote network connection.

part_3

You can learn more about using Driven for your Cascading application at Driven Documentation.

Database Workbench

SQLWorkbenchJ was used to view AWS Redshift databases and cluster created during this tutorial.

IDE Support

While an IDE is not strictly required to follow the tutorials, it is certainly useful. You can easily create an IntelliJ IDEA compatible project in each part of the tutorial like this:

$ cd ./[PATH]/[TO]/cascading-tutorials/cascading-aws/
$ gradle cleanidea idea
$ open *.ipr

If you prefer eclipse, you can run:

$ gradle eclipse

Next:

AWS Setup