Tutorial for using Cascading with Amazon Redshift

Introduction

Amazon Redshift allows you to develop and manage large-scale data warehouses using your existing business intelligence skill sets in the cloud, bringing all the benefits of Amazon’s powerful cloud infrastructure to meet the demands of your BI infrastructure. In this tutorial, you will use Cascading to move data from one Redshift table to another, and then copy it to S3-backed HFS. This example is particularly relevant when one wants to offload data from an enterprise data warehouse to the Hadoop cluster to do bulk transformations or data processing steps with Cascading that allows you to operationalize the entire flow through a single application.

Finally, feel free to contact us through the Cascading User Group for any questions.

Prerequisites

In order to follow the tutorial, you will also have to have Java 7 as well as gradle < v2.0 installed on your computer.

The code of this tutorial is hosted on github. Please clone it onto your local disk:

$ git clone https://github.com/Cascading/tutorials.git
$ cd tutorials
$ gradle :cascading-redshift:jar

Ensure that code compiles.

Setup AWS and Redshift

Redshift is an AWS-specific tool and hence all the example code makes use of AWS. This tutorial does not cover starting up a Redshift Database, AWS permission rules, and general EC2 management. See the Redshift Documentation for details on how to set that up. In particular, if you are using EMR to run the flow your EMR instances will need to be in a security group that has access to the database and it is strongly suggested that you run your EMR instances in the same availability zone your Redshift database is running in.

Signup and create an S3 instance
Ensure that you have launched the Redshift cluster by completing the Getting Started Guide. Note down the JDBC URL, the database user name, and the database password. You will need it later. If you will be using Elastic Map Reduce, please ensure that you have authorized both the EMR master and slave security groups within the Redshift cluster security group.
Create AWS access key for a new user. Ensure that you give the right permissions to the user to run map reduce jobs on the Amazon cluster.
Since Redshift reads the data initially from S3, you have to provide the AWS access-key/secret-key combination obtained from the previous step:
```
$ export AWS_ACCESS_KEY=<your AWS user access key>
$ export AWS_SECRET_KEY=<your AWS user secret key>
```

Next, install and configure EMR Command Line Tools. Add the tool to your path

$ ./elastic-mapreduce --version
$ export PATH=[path]/elastic-mapreduce-cli:$PATH

Finally, install and configure s3cmd. Ensure that you download the latest version of the tool
```
$ s3cmd --configure
```

Running Redshift in Cascading

The file SampleFlow.java contains three flows to demonstrate Redshift’s integration with Cascading and Cascading’s HFS Tap, demonstrating the entire lifecycle when data is moved from EDW to Hadoop for bulk cleansing, and data is loaded back into EDW

Write data from S3-backed HFS to Redshift. This is important in use cases where after all the transformations are done on Hadoop, data is moved to an enterprise database hosting business intelligence applications.
Write data from one Redshift table to another.
Write data from Redshift to S3-based HFS.

Compile the code and launch an EMR instance by running the following from the tutorials base directory.

$ ./cascading-redshift/src/scripts/emrExample.sh [JDBC URL] [Redshift
DB user] [Redshift DB password] [S3 bucket to read and write data in] [AWS availability
zone]

If the task completes successfully, you will have two tables in Redshift "results" and "results2" a file in your S3 bucket name sampleData.csv and a directory in your S3 bucket named sampleData.csv.out containing the part-XXXX files from the M/R job that extracted and transformed the DB data.

Understanding the Code

Let’s start by inspecting the script file ./cascading-redshift/src/scripts/emrExample.sh

We start with putting sampledata.csv and the compiled jar file in the S3 bucket.

s3cmd put cascading-redshift/data/$DATAFILE s3://$BUCKET/$DATAFILE
s3cmd put $BUILD/$NAME s3://$BUCKET/$NAME

Next, the shell invokes the elastic-mapreduce command to run the compiled jar file. Let’s look inside ./cascading-redshift/src/main/java/redshift/SampleFlow.java. We will not cover the basics of Cascading (and recommend that you use the Impatient Series tutorial for that). Instead, we will focus on specifics for creating a Redshift tap.

First, we need to import the following packages.

import cascading.jdbc.AWSCredentials;
import cascading.jdbc.RedshiftScheme;
import cascading.jdbc.RedshiftTableDesc;
import cascading.jdbc.RedshiftTap;

Next, we set the properties from the parameters passed into the application.

String accessKey = args[ 5 ];
String secretKey = args[ 6 ];

Properties properties = new Properties();
properties.setProperty( "fs.s3n.awsAccessKeyId", accessKey );
properties.setProperty( "fs.s3n.awsSecretAccessKey", secretKey );

Finally, we create the Redshift tap.

String targetRedshiftTable = "results";

RedshiftTableDesc redshiftTableDesc = new RedshiftTableDesc( targetRedshiftTable, fieldNames,
                                                             fieldTypes, distributionKey, sortKeys );

RedshiftScheme redshiftScheme = new RedshiftScheme( sampleFields, redshiftTableDesc );

AWSCredentials awsCredentials = new AWSCredentials( accessKey, secretKey );

Tap outputTableTap = new RedshiftTap( redshiftJdbcUrl, redshiftUsername, redshiftPassword,
                                      tempPath, awsCredentials, redshiftTableDesc,
                                      redshiftScheme, SinkMode.REPLACE, true, false );

Once the tap is created, all data-transformation steps remain the same for Cascading!

Reference for Advanced AWS and Cascading Users

Users who are already familiar with Redshift, Cascading and Lingual can make use of this by adding the compiled library to their existing projects. Libraries for cascading-redshift are hosted on conjars.org and can be included in an existing Maven or Gradle project by adding the conjars repo http://conjars.org/repo/ to your repo list and then adding either

Maven:

<dependency>
<groupId>cascading</groupId>
<artifactId>cascading-jdbc-redshift</artifactId>
<version>2.2</version>
</dependency>

Gradle:

compile group: 'cascading', name: 'cascading-redshift', version: '2.2'