All input data comes in from, and all output data goes out to,
some instance of cascading.tap.tap
. A tap
represents a data resource - such as a file on the local file system, on
a Hadoop distributed file system, or on Amazon S3. A tap can be read
from, which makes it a source, or
written to, which makes it a sink. Or,
more commonly, taps act as both sinks and sources when shared between
flows.
The platform on which your application is running (Cascading local or Hadoop) determines which specific classes you can use. Details are provided in the sections below.
If the Tap is about where the data is and how to access it, the Scheme is about what the data is and how to read it. Every Tap must have a Scheme that describes the data. Cascading provides four Scheme classes:
TextLine reads and writes raw text files and returns tuples which, by default, contain two fields specfic to the platform used. The first field is either the byte offset or line number, and the second field is the actual line of text. When written to, all Tuple values are converted to Strings delimited with the TAB character (\t). A TextLine scheme is provided for both the local and Hadoop modes.
TextDelimited reads and writes character-delimited files in standard formats such as CSV (comma-separated variables), TSV (tab-separated variables), and so on. When written to, all Tuple values are converted to Strings and joined with the specified character delimiter. This Scheme can optionally handle quoted values with custom quote characters. Further, TextDelimited can coerce each value to a primitive type when reading a text file. A TextDelimited scheme is provided for both the local and Hadoop modes.
SequenceFile is based on the Hadoop Sequence file, which is a binary format. When written to or read from, all Tuple values are saved in their native binary form. This is the most efficient file format - but be aware that the resulting files are binary and can only be read by Hadoop applications running on the Hadoop platform.
Like the SequenceFile Scheme, WritableSequenceFile is
based on the Hadoop Sequence file, but it was designed to read
and write key and/or value Hadoop
Writable
objects directly. This is very
useful if you have sequence files created by other applications.
During writing (sinking), specified key and/or value fields are
serialized directly into the sequence file. During reading
(sourcing), the key and/or value objects are deserialized and
wrapped in a Cascading Tuple object and passed to the downstream
pipe assembly. This class is only available when running on the
Hadoop platform.
There's a key difference between the
TextLine
and
SequenceFile
schemes. With the
SequenceFile
scheme, data is stored as binary
tuples, which can be read without having to be parsed. But with the
TextLine
option, Cascading must parse each line
into a Tuple
before processing it, causing a
performance hit.
Depending on which platform you use (Cascading local or Hadoop), the classes you use to specify schemes will vary. Platform-specific details for each standard scheme are shown below.
Table 3.2. Platform-specific tap scheme classes
Description | Cascading local platform | Hadoop platform |
Package Name | cascading.scheme.local | cascading.scheme.hadoop |
Read lines of text | TextLine | TextLine |
Read delimited text (CSV, TSV, etc) | TextDelimited | TextDelimited |
Cascading proprietary efficient binary | SequenceFile | |
External Hadoop application binary (custom
Writable type) | WritableSequenceFile |
The following sample code creates a new Hadoop FileSystem Tap that can read and write raw text files. Since only one field name is provided, the "offset" field is discarded, resulting in an input tuple stream with only "line" values.
Here are the most commonly-used tap types:
The cascading.tap.local.FileTap
tap
is used with the Cascading local platform to access files on the
local file system.
The cascading.tap.hadoop.Hfs
tap
uses the current Hadoop default file system, when running on the
Hadoop platform.
If Hadoop is configured for "Hadoop local mode" (not to be confused with Cascading local mode), its default file system is the local file system. If configured for distributed mode, its default file system is typically the Hadoop distributed file system.
Note that Hadoop can be forced to use an external file
system by specifying a prefix to the URL passed into a new Hfs
tap. For instance, using "s3://somebucket/path" tells Hadoop to
use the S3 FileSystem
implementation to
access files in an Amazon S3 bucket. More information on this
can be found in the Javadoc.
Also provided are four utility taps:
The cascading.tap.MultiSourceTap
is
used to tie multiple tap instances into a single tap for use as
an input source. The only restriction is that all the tap
instances passed to a new MultiSourceTap share the same Scheme
classes (not necessarily the same Scheme instance).
The cascading.tap.MultiSinkTap
is
used to tie multiple tap instances into a single tap for use as
output sinks. At runtime, for every Tuple output by the pipe
assembly, each child tap to the MultiSinkTap will sink the
Tuple.
The
cascading.tap.hadoop.TemplateTap
is used
to sink tuples into directory paths based on the values in the
Tuple. More can be read below in Template Taps.
The cascading.tap.hadoop.GlobHfs
tap accepts Hadoop style "file globbing" expression patterns.
This allows for multiple paths to be used as a single source,
where all paths match the given pattern. This tap is only
available when running on the Hadoop platform.
Depending on which platform you use (Cascading local or Hadoop), the classes you use to specify file systems will vary. Platform-specific details for each standard tap type are shown below.
Table 3.3. Platform-specific details for setting file system
Description | Either platform | Cascading local platform | Hadoop platform |
Package Name | cascading.tap | cascading.tap.local | cascading.tap.hadoop |
File access | FileTap | Hfs | |
Multiple Taps as single source | MultiSourceTap | ||
Multiple Taps as single sink | MultiSinkTap | ||
Bin/Partition data into multiple files | TemplateTap | TemplateTap | |
Pattern match multiple files/dirs | GlobHfs |
Copyright © 2007-2012 Concurrent, Inc. All Rights Reserved.