10.9 Handling Good and Bad Data

It's very common when processing raw data streams to encounter data that is corrupt or malformed in some way. For instance, bad content may be fetched from the web via a crawler upstream, or a bug may have leaked into a browser widget somewhere that sends user behavior information back for analysis. Whatever the cause, it's a good practice to define a set of rules for identifying and discarding questionable records.

It is tempting to simply throw an exception and have a Trap capture the offending Tuple, but Traps were not designed as a filtering mechanism, and consequently much valuable information would be lost.

Instead of traps, use filters. Create a SubAssembly that applies rules to the stream by setting a binary field that marks the tuple as good or bad. After all the rules are applied, split the stream based on the value of the good or bad Boolean value. Consider setting a reason field that states why the Tuple was marked bad.

Copyright © 2007-2012 Concurrent, Inc. All Rights Reserved.