Data Processing on Amazon Web Services (AWS)

AWS Setup

Setup AWS and Redshift

In this tutorial we will create end-to-end data processing workflows using the following AWS products:

AWS

If you have not done so already, please signup and create an AWS account

Note
This tutorial does not cover starting up a Redshift Database, AWS permission rules, and general EC2 management. See the Redshift Documentation and AWS CLI Documentation for further details.

Redshift

Launch a Redshift cluster by completing the Getting Started Guide. Note down the JDBC URL, the database user name, the database password and region. You will need them later.

Also note down your AWS Security Credentials.

Since Redshift reads the data initially from S3, you have to provide the AWS access-key/secret-key combination obtained from the previous step.

Reminder, if you have not done so already, please compile cascading-jdbc-redshift by supplying your Redshift connection information to gradle build:

$ cd ./[PATH]/[TO]/cascading-jdbc
$ gradle build -Dcascading.jdbc.url.redshift='jdbc:postgresql://[REDSHIFT_HOST]/[REDSHIFT_DB]?user=[USERNAME]&password=[PASSWORD]' -i
Note
Please ensure that your Redshift security group will accept traffic from exteranl instances. Redshift uses port 5439. For the purposes of this tutorial we have added a Custom TCP rule to allow all traffic to port 5439.

AWS Command Line Interface

Next, install and configure AWS Command Line Interface. The AWS CLI is a unified tool that provides a consistent interface for interacting with all parts of AWS. With it we can control multiple AWS services from the command line and automate them through scripts.

# ensure python installation
$ python --version
Python 2.7.5
# download cli
$ curl "https://s3.amazonaws.com/aws-cli/awscli-bundle.zip" -o "awscli-bundle.zip"
# unzip
$ unzip awscli-bundle.zip
# install
$ sudo ./awscli-bundle/install -i /usr/local/aws -b /usr/local/bin/aws
# verify install
$ aws help
# configure to your aws account - ensure that the default region is the same your Redshift cluster
$ aws configure
# if you have any S3 buckets you should now be able to list them like so:
$ aws s3 ls

Next: