3.6 Sink modes

Example 3.11. Overwriting An Existing Resource

Tap tap =
  new Hfs( new TextLine( new Fields( "line" ) ), path, SinkMode.REPLACE );


All applications created with Cascading read data from one or more sources, process it, then write data to one or more sinks. This is done via the various Tap classes, where each class abstracts different types of back-end systems that store data as files, tables, blobs, and so on. But in order to sink data, some systems require that the resource (e.g., a file) not exist before processing thus must be removed (deleted) before the processing can begin. Other systems may allow for appending or updating of a resource (typical with database tables).

When creating a new Tap instance, a SinkMode may be provided so that the Tap will know how to handle any existing resources. Note that not all Taps support all SinkMode values - for example, Hadoop does not support appends (updates) from a MapReduce job.

The available SinkModes are:

SinkMode.KEEP

This is the default behavior. If the resource exists, attempting to write over it will fail.

SinkMode.REPLACE

This allows Cascading to delete the file immediately after the Flow is started.

SinkMode.UPDATE

Allows for new tap types that can update or append - for example, to update or add records in a database. Each tap may implement this functionality in its own way. Cascading recognizes this update mode, and if a resource exists, will not fail or attempt to delete it.

Note that Cascading itself only uses these labels internally to know when to automatically call deleteResource() on the Tap or to leave the Tap alone. It is up the the Tap implementation to actually perform a write or update when processing starts. Thus, when start() or complete() is called on a Flow, any sink Tap labeled SinkMode.REPLACE will have its deleteResource() method called.

Conversely, if a Flow is in a Cascade and the Tap is set to SinkMode.KEEP or SinkMode.REPLACE, deleteResource() will be called if and only if the sink is stale (i.e., older than the source). This allows a Cascade to behave like a "make" or "ant" build file, only running Flows that should be run. For more information, see Skipping Flows.

It's also important to understand how Hadoop deals with directories. By default, Hadoop cannot source data from directories with nested sub-directories, and it cannot write to directories that already exist. However, the good news is that you can simply point the Hfs tap to a directory of data files, and they are all used as input - there's no need to enumerate each individual file into a MultiSourceTap. If there are nested directories, use GlobHfs.

Copyright © 2007-2012 Concurrent, Inc. All Rights Reserved.