9.8 XML Operations

To use XML Operations in a Cascading application, include the cascading-xml-x.y.z.jar in the project. When using the TagSoupParser operation, this module requires the TagSoup library, which provides support for HTML and XML "tidying". More information is available at the TagSoup website, http://home.ccil.org/~cowan/XML/tagsoup/.

XPathParser

The cascading.operation.xml.XPathParser function uses one or more XPath expressions, passed into the constructor, to extract one or more node values from an XML document contained in the passed Tuple argument, and places the result(s) into one or more new fields in the current Tuple. In this way, it effectively parses an XML document into a table of fields, creating one Tuple field value for every given XPath expression. The Node is converted to a String type containing an XML document. If only the text values are required, search on the text() nodes, or consider using XPathGenerator to handle multiple NodeList values. If the returned result of an XPath expression is a NodeList, only the first Node is used for the field value and the rest are ignored.

XPathGenerator

Similar to XPathParser, the cascading.operation.xml.XPathGenerator function emits a new Tuple for every Node returned by the given XPath expression from the XML in the current Tuple.

XPathFilter

The filter cascading.operation.xml.XPathFilter removes a Tuple if the specified XPath expression returns false. Set the removeMatch parameter to true if the filter should be reversed, i.e., to keep only those Tuples where the XPath expression returns true.

TagSoupParser

The cascading.operation.xml.TagSoupParser function uses the TagSoup library to convert incoming HTML to clean XHTML. Use the setFeature( feature, value ) method to set TagSoup-specific features, which are documented on the TagSoup website.

Copyright © 2007-2012 Concurrent, Inc. All Rights Reserved.