Introduction to Cascading-Hive
Cascading-Hive
provides integration between Cascading and Apache Hive.
This project bridges the gap by making it possible to read and write Hive tables from within Cascading flows and
having Hive queries participating in a Cascade
.
The features of Cascading-Hive
include:
-
run Hive queries within a Cascade (
HiveFlow
) -
read and write Hive tables within a Cascading Flow (
HiveTap
,HiveTableDescriptor
) -
read and write partitioned Hive tables from a Cascading Flow (
HivePartitionTap
) -
deconstruct a Hive view into Taps (
HiveViewAnalyzer
)
With these features it is possible to combine existing Hive queries with new Cascading code and having the Cascading planner run everything in the correct order.
The project maps Hive concepts like tables and running queries onto Cascading concepts like Taps
and Flows
.
Hive dependencies
Cascading-Hive
works with hive-0.10+. When using the maven dependency you have to specify the version of Hive you are
using as a runtime dependency yourself. This is done to avoid classpath issues with the various Hadoop and Hive
distributions in existence. The demo
project introduced below contains an example of that.
Demo applications
The demo directory of the project contains 3 applications
using Cascading-Hive
demonstrating the usage of the core classes to build applications with Cascading and Apache Hive.
You can assemble the hadoop ready jar file of the demos by checking out the repository and running the following commands:
> git clone https://github.com/Cascading/cascading-hive.git > cd cascading-hive/demo > gradle jar
Running the demo apps follows the standard hadoop way of running applications:
> hadoop jar build/libs/cascading-hive-demo-1.0.jar <main-class>
cascading.hive.HiveDemo
This demo shows how Cascading-based data processing and Hive-based processing can be combined to build one app based on both technologies. The application does the following list of things:
-
define and create a table called
dual
-
load existing data into
dual
-
create a second table called
keyvalue
-
populate the second table via SQL
-
create a third table called
keyvalue2
-
populate
keyvalue2
with a pure Cascading flow that reads fromkeyvalue
-
use
JDBC
to read the data back fromkeyvalue2
All these steps, except the last, are happening in the same Cascade
. Cascading can determine the dependencies between
the Hive flows and the Cascading flows and enables you to build more complex applications using Cascading and Hive.
cascading.hive.HivePartitionDemo
This demo focusses on the support for partitioned tables. The app is dealing with a tab separated log file of a simulated cloud service logging customers in regions. The goal is to create a Hive table that is partitioned based on the region where a customer interaction happened. The following steps are involved:
-
load the log file onto HDFS
-
create a partitioned Hive table, that uses
region
as the partitioning column -
read all data from the log file and populate the Hive table from a pure cascading flow
-
read all information about customers having interactions in the
ASIA
region viaJDBC
. -
do the same as above, but in a Cascading flow
The app creates and reads a partitioned table and uses it via JDBC
to demonstrate the support for partitioned tables in
Cascading-Hive.
cascading.hive.HiveViewDemo
This demo
shows how to create a view within a Cascading based application. The application creates a view based on a
partitioned table (similar to HivePartitionDemo
) and queries the view via JDBC
.
Testing
Cascading-Hive
also provides useful classes, which you can use in your tests, when writing hybrid Hive and
Cascading applictions. Using
HiveTestCase
as a base class for such a test is probably a good idea. Most of the tests in Cascading-Hive
are using it.