The Scalding QuickStart Tutorial

This tutorial will teach you how to perform common data processing operations using Scalding. Scalding is a DSL written in Scala for the Cascading data processing framework. In this tutorial, we will build a data processing application in five incremental stages. This division is intentional—​we want you to understand each stage carefully before moving to the next. Knowledge of Scala is useful, but not necessary.

What are we building?

For the purposes of this tutorial, we will simulate analytics for an e-commerce website, where we have identified some "power users"-- users who frequently buy our items and generate revenue for our business. We have recently launched a new product, and we want to gauge interest shown in it by these users. We will assume that the web server logs have been collected through some mechanism in HDFS, and now need to be analyzed.

To solve this problem, we will perform the following operations in Scalding:

Prerequisites

  • Install Driven, Gradle, IDE and other software for running the tutorial

Part 1: Importing & Validating Data

  • Read the input file

  • Filter data to exclude it from processing (bad data)

Part 2: Implementing branches

  • Separate data into two streams for processing

Part 3: Filtering Data

  • Perform data operations such as filtering on Stream 1

Part 4: Implementing custom functions

  • Perform custom data operations on Stream 2

Part 5: Joins

  • Join Stream 1 and Stream 2

  • Perform a grouping

  • Write the output to HDFS

Next: Prerequisites