Cascading supports pluggable planners that
allow it to execute on differing platforms. Planners are invoked by an
associated FlowConnector
subclass. Currently,
only two planners are provided, as described below:
The
cascading.flow.local.LocalFlowConnector
provides a "local" mode planner for running Cascading completely
in memory on the current computer. This allows for fast
execution of Flows against local files or any other compatible
custom Tap
and
Scheme
classes.
The local mode planner and platform were not designed to
scale beyond available memory, CPU, or disk on the current
machine. Thus any memory-intensive processes that use
GroupBy
, CoGroup
,
or HashJoin
are likely to fail against
moderately large files.
Local mode is useful for development, testing, and interactive data exploration against sample sets.
The
cascading.flow.hadoop.HadoopFlowConnector
provides a planner for running Cascading on an Apache Hadoop 1
cluster. This allows Cascading to execute against extremely
large data sets over a cluster of computing nodes.
The
cascading.flow.hadoop2.Hadoop2MR1FlowConnector
provides a planner for running Cascading on an Apache Hadoop 2
cluster. This class is roughly equivalent to the above
HadoopFlowConnector
except it uses Hadoop
2 specific properties and is compiled against Hadoop 2 API
binaries.
Cascading's support for pluggable planners allows a
pipe assembly to be executed on an arbitrary platform, using
platform-specific Tap and Scheme classes that hide the platform-related
I/O details from the developer. For example, Hadoop uses
org.apache.hadoop.mapred.InputFormat
to read
data, but local mode is happy with a
java.io.FileInputStream
. This detail is hidden
from developers unless they are creating custom Tap and Scheme
classes.
Copyright © 2007-2012 Concurrent, Inc. All Rights Reserved.