Java Developers Guide to ETL with Cascading
Welcome to the Java Developer’s Guide to ETL, a tutorial that will take you through implementing the commonly-operated ETL tasks with Cascading.
The coding examples in each part refers to a commonly-used operation in ETL, and can be referred directly; you do not need to complete previous parts to use a given example.
Also, while we give references to Cascading Users Guide for the APIs used to implement the ETL tasks, this tutorial is not intended to serve as an introduction to Cascading. For that, we recommend that you follow the Cascading for the Impatient tutorial.
If you have a question or run into any problems send an email to the cascading-user-list.
This tutorial discusses the following topics, which include exercises and links to resource material:
Introduction to ETL
-
Discusses key evaluation criteria for deciding your ETL strategy
-
Explains basic Cascading concepts
-
Evaluates Cascading and Driven as a framework for implementing ETL applications
Prerequisites
-
Install Driven, Gradle, IDE and other software for running the tutorial
Part 1: File Copy
-
Simple ETL application that copies a file from one location to another
-
Filters data to exclude it from processing (bad data)
-
Specifies output format (tab delimited)
-
Bins data based on date ranges
Part 2: Filters
-
Separate unwanted data and store it to a different file for separate analysis
-
Implement in-line data quality checks
-
Perform different processing logic based on content
Part 3: Merge
-
Merge records from multiple input files using MultiSourceTap
Part 4: Count
-
Count number of unique records
-
Report "Top xx" occurrences
Part 5: Moving Averages
-
Implement advanced aggregation techniques using GroupBy() in Cascading
Part 6: Joins
-
Split pipe into different branches based on data content
-
Perform different processing on each branch
-
Join the branches using HashJoin() in Cascading