Example 3.11. Overwriting An Existing Resource
Tap tap =
new Hfs( new TextLine( new Fields( "line" ) ), path, SinkMode.REPLACE );
All applications created with Cascading read data from one or more
sources, process it, then write data to one or more sinks. This is done
via the various Tap
classes, where each class
abstracts different types of back-end systems that store data as files,
tables, blobs, and so on. But in order to sink data, some systems
require that the resource (e.g., a file) not exist before processing
thus must be removed (deleted) before the processing can begin. Other
systems may allow for appending or updating of a resource (typical with
database tables).
When creating a new Tap
instance, a
SinkMode
may be provided so that the Tap will
know how to handle any existing resources. Note that not all Taps
support all SinkMode
values - for example, Hadoop
does not support appends (updates) from a MapReduce job.
The available SinkModes are:
SinkMode.KEEP
This is the default behavior. If the resource exists, attempting to write over it will fail.
SinkMode.REPLACE
This allows Cascading to delete the file immediately after the Flow is started.
SinkMode.UPDATE
Allows for new tap types that can update or append - for example, to update or add records in a database. Each tap may implement this functionality in its own way. Cascading recognizes this update mode, and if a resource exists, will not fail or attempt to delete it.
Note that Cascading itself only uses
these labels internally to know when to automatically call
deleteResource()
on the
Tap
or to leave the Tap alone. It is up the the
Tap
implementation to actually perform a write or
update when processing starts. Thus, when
start()
or complete()
is called on a Flow
, any sink
Tap
labeled
SinkMode.REPLACE
will have its
deleteResource()
method called.
Conversely, if a
Flow
is in a Cascade
and
the Tap
is set to
SinkMode.KEEP
or
SinkMode.REPLACE
,
deleteResource()
will be called if and only if
the sink is stale (i.e., older than the source). This allows a
Cascade
to behave like a "make" or "ant" build
file, only running Flows that should be run. For more information, see
Skipping Flows.
It's also important to understand how Hadoop deals with
directories. By default, Hadoop cannot source data from directories with
nested sub-directories, and it cannot write to directories that already
exist. However, the good news is that you can simply point the
Hfs
tap to a directory of data files, and they
are all used as input - there's no need to enumerate each individual
file into a MultiSourceTap
. If there are nested
directories, use GlobHfs
.
Copyright © 2007-2012 Concurrent, Inc. All Rights Reserved.