Example 3.11. Overwriting An Existing Resource
Tap tap =
new Hfs( new TextLine( new Fields( "line" ) ), path, SinkMode.REPLACE );
All applications created with Cascading read data from one or more
sources, process it, then write data to one or more sinks. This is done
via the various Tap classes, where each class
abstracts different types of back-end systems that store data as files,
tables, blobs, and so on. But in order to sink data, some systems
require that the resource (e.g., a file) not exist before processing
thus must be removed (deleted) before the processing can begin. Other
systems may allow for appending or updating of a resource (typical with
database tables).
When creating a new Tap instance, a
SinkMode may be provided so that the Tap will
know how to handle any existing resources. Note that not all Taps
support all SinkMode values - for example, Hadoop
does not support appends (updates) from a MapReduce job.
The available SinkModes are:
SinkMode.KEEPThis is the default behavior. If the resource exists, attempting to write over it will fail.
SinkMode.REPLACEThis allows Cascading to delete the file immediately after the Flow is started.
SinkMode.UPDATEAllows for new tap types that can update or append - for example, to update or add records in a database. Each tap may implement this functionality in its own way. Cascading recognizes this update mode, and if a resource exists, will not fail or attempt to delete it.
Note that Cascading itself only uses
these labels internally to know when to automatically call
deleteResource() on the
Tap or to leave the Tap alone. It is up the the
Tap implementation to actually perform a write or
update when processing starts. Thus, when
start() or complete()
is called on a Flow, any sink
Tap labeled
SinkMode.REPLACE will have its
deleteResource() method called.
Conversely, if a
Flow is in a Cascade and
the Tap is set to
SinkMode.KEEP or
SinkMode.REPLACE,
deleteResource() will be called if and only if
the sink is stale (i.e., older than the source). This allows a
Cascade to behave like a "make" or "ant" build
file, only running Flows that should be run. For more information, see
Skipping Flows.
It's also important to understand how Hadoop deals with
directories. By default, Hadoop cannot source data from directories with
nested sub-directories, and it cannot write to directories that already
exist. However, the good news is that you can simply point the
Hfs tap to a directory of data files, and they
are all used as input - there's no need to enumerate each individual
file into a MultiSourceTap. If there are nested
directories, use GlobHfs.
Copyright © 2007-2012 Concurrent, Inc. All Rights Reserved.