Cascading on Amazon Web Services

Introduction

Modern data workflows never exist in isolation — often, they are written in multiple languages across a variety of platforms and orchestrated using a variety of tools and services that best fit the needs. These applications need to read from and write to a wide array data stores and warehouses. They also need to be deployed across a number of environments, and once deployed they then require performance tuning and monitoring products to provide compliance for SLAs and cluster governance. Cascading and Driven provide a comprehensive solution for building, deploying, running and managing this new class of enterprise applications.

In this multi-part tutorial we will walk through the creation of end-to-end data processing workflows in AWS environments using popular big data processing technologies. AWS users will be able to use Cascading and Lingual (a Cascading DSL that provides an ANSI-SQL compliant interface to run map reduce applications) to orchestrate a workflow that migrates data between S3 and Redshift, and another example using Lingual on AWS EMR for scalable data processing. The tutorial also shows how to quickly bootstrap EMR clusters and how to gain operational visibility into your application with Driven.

What technologies will we be using?

Prerequisites

  • Install Gradle, Lingual, Cascading-Jdbc, Driven, IDE and other software for running the tutorial

Part 1: Lingual Shell with HDFS and Redshift

  • Using Cascading Lingual shell we will migrate data from HDFS to Redshift using SQL

Part 2: Simple ETL using Lingual JDBC with S3, EMR and Redshift

  • Using Lingual JDBC in Java we will perform a simple ETL process using SQL between S3, EMR and Redshift

Part 3: ETL on EMR with Cascading using S3 and Redshift

  • Using Cascading we will perform a more complex ETL process between S3, EMR and Redshift.

The coding examples in each part refers to a commonly-used operation in ETL, and can be referred directly; you do not need to complete previous parts to use a given example.

Also, while we give references to Cascading Users Guide for the APIs used to implement these tasks, this tutorial is not intended to serve as an introduction to Cascading. For that, we recommend that you follow the Java Developers Guide to ETL with Cascading tutorial.

If you have a question or run into any problems send an email to the cascading-user-list.

Next: