Data Processing on Amazon Web Services (AWS)
AWS Setup
Setup AWS and Redshift
In this tutorial we will create end-to-end data processing workflows using the following AWS products:
-
AWS Command Line Interface - http://aws.amazon.com/cli/
-
AWS S3 - http://aws.amazon.com/s3/
-
AWS Redshift - http://aws.amazon.com/redshift/
-
AWS Elastic MapReduce - http://aws.amazon.com/elasticmapreduce/
AWS
If you have not done so already, please signup and create an AWS account
Note
|
This tutorial does not cover starting up a Redshift Database, AWS permission rules, and general EC2 management. See the Redshift Documentation and AWS CLI Documentation for further details. |
Redshift
Launch a Redshift cluster by completing the Getting Started Guide. Note down the JDBC URL, the database user name, the database password and region. You will need them later.
Also note down your AWS Security Credentials.
Since Redshift reads the data initially from S3, you have to provide the AWS access-key/secret-key combination obtained from the previous step.
Reminder, if you have not done so already, please compile cascading-jdbc-redshift by supplying your Redshift connection information to gradle build:
$ cd ./[PATH]/[TO]/cascading-jdbc
$ gradle build -Dcascading.jdbc.url.redshift='jdbc:postgresql://[REDSHIFT_HOST]/[REDSHIFT_DB]?user=[USERNAME]&password=[PASSWORD]' -i
Note
|
Please ensure that your Redshift security group will accept traffic from exteranl instances. Redshift uses port 5439. For the purposes of this tutorial we have added a Custom TCP rule to allow all traffic to port 5439. |
AWS Command Line Interface
Next, install and configure AWS Command Line Interface. The AWS CLI is a unified tool that provides a consistent interface for interacting with all parts of AWS. With it we can control multiple AWS services from the command line and automate them through scripts.
# ensure python installation
$ python --version
Python 2.7.5
# download cli
$ curl "https://s3.amazonaws.com/aws-cli/awscli-bundle.zip" -o "awscli-bundle.zip"
# unzip
$ unzip awscli-bundle.zip
# install
$ sudo ./awscli-bundle/install -i /usr/local/aws -b /usr/local/bin/aws
# verify install
$ aws help
# configure to your aws account - ensure that the default region is the same your Redshift cluster
$ aws configure
# if you have any S3 buckets you should now be able to list them like so:
$ aws s3 ls