Cascading-Hive - Integration for Cascading and Apache Hive

Introduction to Cascading-Hive

Cascading-Hive provides integration between Cascading and Apache Hive.

This project bridges the gap by making it possible to read and write Hive tables from within Cascading flows and having Hive queries participating in a Cascade.

The features of Cascading-Hive include:

run Hive queries within a Cascade (HiveFlow)
read and write Hive tables within a Cascading Flow (HiveTap, HiveTableDescriptor)
read and write partitioned Hive tables from a Cascading Flow (HivePartitionTap)
deconstruct a Hive view into Taps (HiveViewAnalyzer)

With these features it is possible to combine existing Hive queries with new Cascading code and having the Cascading planner run everything in the correct order.

The project maps Hive concepts like tables and running queries onto Cascading concepts like Taps and Flows.

The code of Cascading-Hive is available on github and the jars are deployed to conjars.

Hive dependencies

Cascading-Hive works with hive-0.10+. When using the maven dependency you have to specify the version of Hive you are using as a runtime dependency yourself. This is done to avoid classpath issues with the various Hadoop and Hive distributions in existence. The demo project introduced below contains an example of that.

Demo applications

The demo directory of the project contains 3 applications using Cascading-Hive demonstrating the usage of the core classes to build applications with Cascading and Apache Hive.

Hive MetaStore.

Please note that some features of Cascading-Hive require a hosted Hive MetaStore. Therefore you should configure one before running any of the demo applications.

You can assemble the hadoop ready jar file of the demos by checking out the repository and running the following commands:

> git clone https://github.com/Cascading/cascading-hive.git
> cd cascading-hive/demo
> gradle jar

Running the demo apps follows the standard hadoop way of running applications:

>  hadoop jar build/libs/cascading-hive-demo-1.0.jar <main-class>

cascading.hive.HiveDemo

This demo shows how Cascading-based data processing and Hive-based processing can be combined to build one app based on both technologies. The application does the following list of things:

define and create a table called dual
load existing data into dual
create a second table called keyvalue
populate the second table via SQL
create a third table called keyvalue2
populate keyvalue2 with a pure Cascading flow that reads from keyvalue
use JDBC to read the data back from keyvalue2

All these steps, except the last, are happening in the same Cascade. Cascading can determine the dependencies between the Hive flows and the Cascading flows and enables you to build more complex applications using Cascading and Hive.

cascading.hive.HivePartitionDemo

This demo focusses on the support for partitioned tables. The app is dealing with a tab separated log file of a simulated cloud service logging customers in regions. The goal is to create a Hive table that is partitioned based on the region where a customer interaction happened. The following steps are involved:

load the log file onto HDFS
create a partitioned Hive table, that uses region as the partitioning column
read all data from the log file and populate the Hive table from a pure cascading flow
read all information about customers having interactions in the ASIA region via JDBC.
do the same as above, but in a Cascading flow

The app creates and reads a partitioned table and uses it via JDBC to demonstrate the support for partitioned tables in Cascading-Hive.

cascading.hive.HiveViewDemo

This demo shows how to create a view within a Cascading based application. The application creates a view based on a partitioned table (similar to HivePartitionDemo) and queries the view via JDBC.

Hive views vs. Cascading.

Views are an interesting construct since they form a hybrid of a ressource and computation. When writing SQL, they behave as if they were a ressource, yet in reality they are ad-hoc computation. Mapping these concepts onto the Cascading world is not easy, since a ressource is represented by a Tap and computation is represented by a Flow. Currently there is no clean way of representing views in Cascading, which is why the support is not as prominent as for regular tables.

Nevertheless has Cascading-Hive a set of helper classes, which enable you to let view-based Hive queries be part of a Cascade.

Testing

Cascading-Hive also provides useful classes, which you can use in your tests, when writing hybrid Hive and Cascading applictions. Using HiveTestCase as a base class for such a test is probably a good idea. Most of the tests in Cascading-Hive are using it.