Java Developers Guide to using Apache Hive with Cascading

Prerequisites

In these tutorials, you will be developing Hive flows with Cascading framework and executing them on local instances of Hive and Hadoop; it is important that your environment is configured correctly before we get started. Here are the components that we want to make sure that you have installed correctly before we get started with development:

GitHub

The code of this tutorial is hosted on GitHub. Clone the code onto your local disk like so:

$ git clone https://github.com/Cascading/tutorials

Data

The tutorial requires data files which are hosted on data.cascading.org:

$ cd tutorials
$ mkdir cascading-hive/data
$ wget http://data.cascading.org/cascading-hive-data.tgz
$ tar xf cascading-hive-data.tgz -C cascading-hive/data

Gradle

To compile your flows, you will need Gradle

Note	Ensure that you install version 1.9 or 1.11. This exercise is currently not compatible with gradle 2.x.

You can check your version of gradle like this:

$ gradle -v
------------------------------------------------------------ +
    Gradle 1.9 +
------------------------------------------------------------ +
...

Hadoop

We will be running the demos on the Hadoop distribution of your choice. Cascading is supported on all major Hadoop distributions, Install the latest stable version from the 2.x series.

$ hadoop version +
Hadoop 2.2.0 +
...

Cascading is compatible with a number of hadoop distributions and versions. You can see on the compatibility page to see if your distribution is supported.

Hive

If you do not already have Hive installed on your local machine please follow the instructions here.

If you have not yet configured your local Hive metastore please follow the instructions here.

Driven

Use Driven to develop and debug Cascading flows. You do not need to make any changes to your existing Cascading applications to integrate with the Driven application.

The cloud version of Driven is free to use for development. Visit Try Driven to access all the developer features and gain visibility into your ETL flows.

Note

We will use the cloud-based Driven product for the purposes of this tutorial. Driven will receive the data from your application to help you monitor and visualize the development and the execution of your application. To use the cloud version of Driven, the Driven plugin must be visible to your Cascading application. In addition, your Cascading application must have network access to the Internet so it may send data to the domain name "trial.driven.io" over port 443 (the default SSL port for HTTP). Note that the client side of Cascading will be the only Java process attempting to make a remote network connection.

part3

You can learn more about using Driven for your Cascading application at Getting Started with Driven.

JDK

Make sure that you are running the tutorial with JDK 1.7. Currently, we are deprecating support for version 1.6 and do not support JDK 1.8.

IDE Support

While an IDE is not strictly required to follow the tutorials, it is certainly useful. You can easily create an IntelliJ IDEA compatible project in each part of the tutorial like this:

$ gradle ideaModule

If you prefer eclipse, you can run:

$ gradle eclipse

Part 1 - Simple file copy from HDFS to Hive