It is very common when processing raw data streams to encounter data that is corrupt or malformed in some way. This may be because bad content was fetched off the web via a crawler/fetcher upstream. Or a bug leaked into a browser widget that sends user behavior information back for analysis. Whatever the use-case, there is likely a set of rules that govern when to identify and choose to keep or discard a questionable record.
It is tempting to simiply throw an exception and have a Trap
capture the offendingTuple
, but Traps were not
designed as a filtering mechanism, and subsequently much valuable
information would be lost.
Instead create a SubAssembly
that applies
rules to the stream by setting a binary field that marks the tuple as
good or bad. After all the rules are applied, split the stream based on
the value of the good/bad boolean value. Optionally, set a reason field
as to why the Tuple was marked bad.
Copyright © 2007-2008 Concurrent, Inc. All Rights Reserved.