8.9 Handling Good and Bad Data

It is very common when processing raw data streams to encounter data that is corrupt or malformed in some way. This may be because bad content was fetched off the web via a crawler/fetcher upstream. Or a bug leaked into a browser widget that sends user behavior information back for analysis. Whatever the use-case, there is likely a set of rules that govern when to identify and choose to keep or discard a questionable record.

It is tempting to simiply throw an exception and have a Trap capture the offendingTuple, but Traps were not designed as a filtering mechanism, and subsequently much valuable information would be lost.

Instead create a SubAssembly that applies rules to the stream by setting a binary field that marks the tuple as good or bad. After all the rules are applied, split the stream based on the value of the good/bad boolean value. Optionally, set a reason field as to why the Tuple was marked bad.

Copyright © 2007-2008 Concurrent, Inc. All Rights Reserved.