Cascading 3.2 User Guide - Configuring
- 1. Introduction
-
1.1. What Is Cascading?
1.2. Another Perspective
1.3. Why Use Cascading?
1.5. Who Are the Users?
- 2. Diving into the APIs
- 3. Cascading Basic Concepts
-
3.1. Terminology
3.2. Pipe Assemblies
3.3. Pipes
3.4. Platforms
3.6. Sink Modes
3.7. Flows
- 4. Tuple Fields
-
4.1. Field Sets
4.2. Field Algebra
4.3. Field Typing
4.4. Type Coercion
- 5. Pipe Assemblies
-
5.1. Each and Every Pipes
5.2. Merge
5.3. GroupBy
5.4. CoGroup
5.5. HashJoin
- 6. Flows
-
6.1. Creating Flows from Pipe Assemblies
6.2. Configuring Flows
6.3. Skipping Flows
6.6. Runtime Metrics
- 7. Cascades
-
7.1. Creating a Cascade
- 8. Configuring
-
8.1. Introduction
8.2. Creating Properties
8.3. Passing Properties
- 9. Local Platform
-
9.3. Source and Sink Taps
- 10. The Apache Hadoop Platforms
-
10.1. What is Apache Hadoop?
10.4. Configuring Applications
10.5. Building an Application
10.6. Executing an Application
10.8. Source and Sink Taps
10.9. Custom Taps and Schemes
- 11. Apache Hadoop MapReduce Platform
-
11.1. Configuring Applications
11.3. Building
- 12. Apache Tez Platform
-
12.1. Configuring Applications
12.2. Building
- 13. Using and Developing Operations
-
13.1. Introduction
13.2. Functions
13.3. Filters
13.4. Aggregators
13.5. Buffers
- 14. Custom Taps and Schemes
-
14.1. Introduction
14.2. Custom Taps
14.3. Custom Schemes
14.5. Tap Life-Cycle Methods
- 15. Advanced Processing
-
15.1. SubAssemblies
15.2. Stream Assertions
15.3. Failure Traps
15.4. Checkpointing
15.7. PartitionTaps
- 16. Built-In Operations
-
16.1. Identity Function
16.2. Debug Function
16.4. Insert Function
16.5. Text Functions
16.8. XML Operations
16.9. Assertions
16.10. Logical Filter Operators
16.11. Buffers
- 17. Built-in SubAssemblies
-
17.1. Optimized Aggregations
17.2. Stream Shaping
- 18. Cascading Best Practices
-
18.1. Unit Testing
18.2. Flow Granularity
18.7. Optimizing Joins
18.8. Debugging Streams
18.11. Fields Constants
18.12. Checking the Source Code
- 19. Extending Cascading
-
19.1. Scripting
- 20. Cookbook: Code Examples of Cascading Idioms
-
20.1. Tuples and Fields
20.2. Stream Shaping
20.3. Common Operations
20.4. Stream Ordering
20.5. API Usage
- 21. The Cascading Process Planner
-
21.1. FlowConnector
21.2. RuleRegistrySet
21.3. RuleRegistry
Configuring
Introduction
Cascading provides a number of ways to pass properties down for use by built-in or custom operations and integrations, to Cascading internals, or to the underlying platform.
Additionally the scope of the configuration properties can be application-wide, limited to a specific process level (a Flow, a FlowStep, or a FlowNode), or specific to a given Pipe or Tap instance.
Creating Properties
Cascading is very configurable. You can configure aspects of Cascading by creating a property set. Each property consists of a key-value pair. The key is a String value, such as cascading.app.appjar.path. The value is the String path to the "app JAR."
Property sets can be managed directly in a java.util.Properties instance. This class has methods that allow *.property files to be loaded from disk or other locations.
Properties is a subclass of Map<Object,Object>. Any Map type can be used, so long as key-value pairs are stored as type java.util.String. |
Properties can be nested via the constructor creating a hierarchy of default key values. But calling getKeySet() and other Map-specific API calls does not return the nested key values. See Properties.getPropertyNames(). |
Props
All frequently used Cascading properties are encapsulated in subsystem-specific Props subclasses. All Props classes are fluent-style interfaces for creating property sets. For example, the AppProps class can be used for setting application-level configuration settings.
Properties properties = AppProps.appProps()
.setName( "sample-app" )
.setVersion( "1.2.3" )
.addTags( "deploy:prod", "team:engineering" )
.setJarClass( Main.class ) (1)
// ALTERNATIVELY ...
.setJarPath( pathToJar ) (2)
.buildProperties();
The Props interface allows for both setting available values as well as populating either a Properties or ConfigDef instance for use by Cascading.
There are various "Props-based" classes that expose fluent API calls. The following table lists the most commonly used classes for creating property sets with fluent-style interfaces.
cascading.property.AppProps |
Allows for setting application-specific properties. Some properties are required by the underlying platform, like application JAR. Others are simple metadata used by compatible management tools, like tags. |
cascading.flow.FlowConnectorProps |
Allows for setting a DebugLevel or AssertionLevel for a given FlowConnector to target. Also allows for setting intermediate DecoratorTap subclasses to be used, if any. |
cascading.flow.FlowProps |
Allows for setting any Flow-specific properties like the maximum concurrent steps to be scheduled, or changing the default Tuple Comparator class. |
cascading.flow.FlowRuntimeProps |
Allows for setting specific runtime properties, like the level of parallelization to use. |
cascading.cascade.CascadeProps |
Allows for setting any Cascade-specific properties like the maximum concurrent Flows to be scheduled. |
cascading.tap.TrapProps |
Allows for fine-grained configuration of what diagnostic data traps should capture on failures. |
cascading.tuple.collect.SpillableProps |
Allows for fine-grained control over how to manage spilling of data during certain operators that accumulate data. Specifically thresholds and what compression codecs to use. |
cascading.pipe.assembly.AggregateByProps |
Allows for fine-grained control over the underlying caches used. |
Passing Properties
Properties can be applied to the following scopes:
-
Application: through a default, shared Properties instance
-
Flow: through a Flow-specific Properties instance passed to the proper FlowConnector
-
FlowStep: through either Pipe.getStepConfigDef() or Tap.getStepConfigDef()
-
FlowNode: through either Pipe.getNodeConfigDef() or Tap.getNodeConfigDef()
-
Pipe: through Pipe.getConfigDef()
-
Tap: through Tap.getConfigDef()
In the cases of FlowStep and FlowNode, the Pipe and Tap instances that are coded to run on those levels are inspected for property settings. The property settings are merged and, if possible, applied to the underlying configuration for that level.
FlowNode-level properties cannot be applied to the MapReduce platform. |
Planner Properties
Properties passed to a FlowConnector are checked by the Cascading query planner. In addition, these properties are pushed directly to the underlying platform as defaults. Any such properties can be overridden by a given scoped ConfigDef instance.
ConfigDef
The ConfigDef class supports the creation of a configuration properties template. The template can then be applied to an existing properties configuration set.
There are three property mode: Mode.DEFAULT, Mode.REPLACE, and Mode.UPDATE.
-
A DEFAULT property is only applied if there is no existing value in the property set.
-
A REPLACE property is always applied overriding any previous values.
-
An UPDATE property is always applied to an existing property, usually when the property key represents a list of values.
The following examples show using the ConfigDef at different scopes.
Pipe join =
new HashJoin( lhs, common, rhs, common, declared, new InnerJoin() );
SpillableProps props = SpillableProps.spillableProps()
.setCompressSpill( true )
.setMapSpillThreshold( 50 * 1000 );
props.setProperties( join.getStepConfigDef(), ConfigDef.Mode.DEFAULT );
Pipe join =
new HashJoin( lhs, common, rhs, common, declared, new InnerJoin() );
SpillableProps props = SpillableProps.spillableProps()
.setCompressSpill( true )
.setMapSpillThreshold( 50 * 1000 );
props.setProperties( join.getConfigDef(), ConfigDef.Mode.REPLACE );