To use XML Operations in a Cascading application, include the
cascading-xml-x.y.z.jar
in the project. When using
the TagSoupParser
operation, this module requires
the TagSoup library, which provides support for HTML and XML "tidying".
More information is available at the TagSoup website, http://home.ccil.org/~cowan/XML/tagsoup/.
The
cascading.operation.xml.XPathParser
function uses one or more XPath expressions, passed into the
constructor, to extract one or more node values from an XML
document contained in the passed Tuple argument, and places the
result(s) into one or more new fields in the current Tuple. In
this way, it effectively parses an XML document into a table of
fields, creating one Tuple field value for every given XPath
expression. The Node
is converted to a
String type containing an XML document. If only the text values
are required, search on the text()
nodes, or
consider using XPathGenerator to handle multiple
NodeList
values. If the returned result
of an XPath expression is a NodeList
,
only the first Node
is used for the field
value and the rest are ignored.
Similar to XPathParser, the
cascading.operation.xml.XPathGenerator
function emits a new Tuple
for every
Node
returned by the given XPath
expression from the XML in the current Tuple.
The filter
cascading.operation.xml.XPathFilter
removes a Tuple if the specified XPath expression returns
false
. Set the removeMatch parameter to
true
if the filter should be reversed, i.e., to
keep only those Tuples where the XPath expression returns
true
.
The
cascading.operation.xml.TagSoupParser
function uses the TagSoup library to convert incoming HTML to
clean XHTML. Use the setFeature( feature, value )
method to set TagSoup-specific features, which are documented on
the TagSoup website.
Copyright © 2007-2012 Concurrent, Inc. All Rights Reserved.