cascading-jdbc-redshift

This code provides support for Amazon’s Redshift for Cascading and Lingual. This allows users to treat Redshift as a Cascading Tap or register Redshift as a Provider in Lingual.

This can be used in an EMR-based workflow, as part of a hadoop-based Cascading Tap, in Lingual’s shell mode for standalone SQL queries, or as part of Lingual’s JDBC mechanism to integrate relational databases with Hadoop flows.

See the specific projects for details and see cascding-jdbc for more examples of JDBC connectors.

Quick Start for Advanced Users

Users who are already familiar with Redshift, Cascading and Lingual can make use of this by adding the compiled library to their existing projects.

Libraries for cascading-redshift are hosted on conjars.org and can be included in an existing Maven or Gradle project by adding the conjars repo http://conjars.org/repo/ to your repo list and then adding either

Maven:

<dependency>
<groupId>cascading</groupId>
<artifactId>cascading-jdbc-redshift</artifactId>
<version>2.2</version>
</dependency>

Gradle:

compile group: 'cascading', name: 'cascading-redshift', version: '2.2'

The source code is available via git clone https://github.com/Cascading/cascading-jdbc and the tutorial files from the tutorial project available via git clone https://github.com/Cascading/tutorial should be sufficient for Cascading and Lingual users who want to know cascading-redshift specific features.

All other users should see the steps in the more detailed Example Code sections below.

Running as a Cascading Flow: SampleFlow.java

Running in EMR: emrExample.sh

Running as a Lingual provider: lingualShellExample.sh

Example Code and Detailed Installation

To run the sample code,git clone https://github.com/Cascading/tutorial and compile the code with the command: gradle :lingual-redshift:sampleCode

This will build all the components needed for the examples. The compile has been tested against Gradle 1.9 and 1.6 but since it doesn’t make use of any complex Gradle tasks beyond the basic compile it is likely to compile properly with any Gradle version. The code supports JDK 1.6.

Set up AWS

Redshift is an AWS-specific tool and hence all the example code makes use of AWS. This tutorial does not cover starting up a Redshift Database, AWS permission rules, and general EC2 management. See the Redshift Documentation for details on how to set that up. In particular, if you are using EMR to run the flow your EMR instances will need to be in a security group that has access to the database and it is strongly suggested that you run your EMR instances in the same availability zone your Redshift database is running in.

Before proceeding with the tutorial you should verify your Redshift setup is complete and accessible using the "psql" tool to log in to Redshift and should run a basic EMR job to confirm that your EMR setup is valid.

Set AWS Credentials

Since Redshift reads the data initially from S3, you have to provide a valid aws access-key/secret-key combination. There are multiple options to do that:

Put them in the redshift-protocol.properties file as awsAccessKey and awsSecretKey
Set them as the environment variables AWS_ACCESS_KEY and AWS_SECRET_KEY
Put them in the your mapred-site.xml file as fs.s3n.awsAccessKeyId and fs.s3n.awsSecretAccessKey

If you are running your jobs on Amazon EMR, the credentials will be in the job-conf and will automatically be picked up from there.

Example: Running Redshift in Cascading

An example with Cascading to sink data to Redshift, coping data from one Redshift table to another, and extracting data from Redshift to S3-backed HFS. Before running this code make sure that you have set up AWS and that you have installed the EMR Command Line Tools and have installed and configured s3cmd

The file SampleFlow.java contains three flows to demonstrate integration with Cascading and Cascading’s HFS Tap. You can compile the code and launch an EMR instance by running the following from the cascading-jdbc base directory: ./lingual-redshift/src/main/resources/emrExample.sh [JDBC URL] [Redshift DB user] [Redshift DB password] [S3 bucket to read and write data in] [AWS availability zone]

If the task completes successfully, you will have two tables in S3 "results" and "results2" a file in your S3 bucket name sampleData.csv and a directory in your S3 bucket named sampleData.csv.out containing the part-XXXX files from the M/R job that extracted and transformed the DB data.

Example: Running Redshift as a Lingual Provider

The Redshift code can also be used as a Lingual provider. This requires a install of Lingual , the AWS tools described above, and a run of the Cascading Sample to populate the database with some information.

Running the sample assumes that you do not have an existing Lingual catalog that you need to preserve. If you do, back up your catalog before running it since this sample will re-initialize the catalog from scratch.

To compile and run the lingual example execute the following from the cascading-jdbc base directory: ./lingual-redshift/src/main/resources/lingualShellExample.sh [JDBC URL] [Redshift DB user] [Redshift DB password]

This script registers the provider and issues a "SELECT * FROM results" query to the command line.