3.4 Platforms

Cascading supports pluggable planners that allow it to execute on differing platforms. Planners are invoked by an associated FlowConnector subclass. Currently, only two planners are provided, as described below:

LocalFlowConnector

The cascading.flow.local.LocalFlowConnector provides a "local" mode planner for running Cascading completely in memory on the current computer. This allows for fast execution of Flows against local files or any other compatible custom Tap and Scheme classes.

The local mode planner and platform were not designed to scale beyond available memory, CPU, or disk on the current machine. Thus any memory-intensive processes that use GroupBy, CoGroup, or HashJoin are likely to fail against moderately large files.

Local mode is useful for development, testing, and interactive data exploration against sample sets.

HadoopFlowConnector

The cascading.flow.hadoop.HadoopFlowConnector provides a planner for running Cascading on an Apache Hadoop 1 cluster. This allows Cascading to execute against extremely large data sets over a cluster of computing nodes.

Hadoop2MR1FlowConnector

The cascading.flow.hadoop2.Hadoop2MR1FlowConnector provides a planner for running Cascading on an Apache Hadoop 2 cluster. This class is roughly equivalent to the above HadoopFlowConnector except it uses Hadoop 2 specific properties and is compiled against Hadoop 2 API binaries.

Cascading's support for pluggable planners allows a pipe assembly to be executed on an arbitrary platform, using platform-specific Tap and Scheme classes that hide the platform-related I/O details from the developer. For example, Hadoop uses org.apache.hadoop.mapred.InputFormat to read data, but local mode is happy with a java.io.FileInputStream. This detail is hidden from developers unless they are creating custom Tap and Scheme classes.

Copyright © 2007-2012 Concurrent, Inc. All Rights Reserved.