Using the XML Normalizer Operator

Introduction

The XML Normalizer operator is a global Java operator that parses a designated field containing a string in XML format, and emits one tuple for each top-level element extracted from the XML string field. Each emitted tuple contains a user-defined set of string fields parsed from the input XML string, plus an optional field that reports any XML parsing errors. All fields in the input tuple other than the XML string field are optionally passed through unchanged to each emitted tuple, except input fields of type tuple or list, which are not supported and are emitted as null.

In the operator's Properties view, you define the top level XML node to be parsed. This allows you to specify any XML node at any level for parsing operations, so that you are not restricted to parsing from the field's top-level XML node on down. For example, in the following XML fragment (taken from this operator's sample), the fragment's true top-level XML node is <transactions>, but you are interested in parsing various <trade> nodes. You would therefore specify trade in the XML element being parsed field in the operator's Properties view.

<?xml version="1.0" encoding="UTF-8"?>
<transactions>
  <trade>
    <symbol market="NASDAQ">MSFT</symbol>
    <price>
      <value>25.48</value>
      <currency>USD</currency>
    </price>
    <volume>2000</volume>
  </trade>
  <trade>
    <symbol market="NYSE">IBM</symbol>
    <price>
      <value>164.25</value>
      <currency>USD</currency>
    </price>
    <volume>5000</volume>
  </trade>
  ...
</transactions>

You specify the set of XML element values to be extracted based on XPath selection specifications made in the operator's Properties view. The operator emits one tuple per designated top-level node, with each extracted field emitted as a string value. You can optionally append a field to contain the text of any XML parsing error encountered. To that, you can optionally append all other non-XML fields in the input tuple, the same values appended for each emitted tuple.

Placing an XML Normalizer Operator on the Canvas

Select the XML Normalizer operator from the Insert an Operator or Adapter dialog, which you invoke with one of the following methods:

Drag the Adapters, Java Operators token from the Operators and Adapters drawer of the Palette view to the canvas.
Click in the canvas where you want to place the operator, and invoke the keyboard shortcut O V
From the top-level menu, invoke Insert → Operator → Java.

From the Insert an Operator or Adapter dialog that opens, select XML Normalizer and double-click or press OK.

Limitations

Consider the following limitations when planning to use the XML Normalizer operator.

No Line Breaks in Incoming XML String Field Under Many Circumstances

For some input environments, you must strip line ending characters from the incoming XML string before it arrives at the XML Normalizer operator. Thus, for example, the XML field shown above must be input as one long string: <application><trade><symbol market="NASDAQ">MSFT</symbol><price><value>... and so on.

These environments include the Manual Input view on Windows, the sbc enqueue command line, and in XML fields embedded as part of a CSV input file.

The Manual Input view on Windows sees the first line ending character as the end of the current field. (The same is not true of the Manual Input view in Studio on Linux, which accepts line ending characters as part of the field.) CSV files are by definition line-oriented, with one line for each CSV record; thus any XML field embedded in a CSV file must have its line ending characters stripped. In general, the sbc enqueue command line is also line-oriented, and sees the first line ending character as the end of the first input line.

No Support for List or Tuple Data Types

The XML Operator does not support tuples with fields of list or tuple data types. The consequences are:

The input field containing the XML data to be parsed must be in the format of a string field before passing into the XML Normalizer operator. If a market data feed has XML data in tuple or list format, you must first convert such data into a string field.
If you elect to pass the input tuple's non-XML fields to the output tuple, any fields of type list or tuple are stripped of their contents and passed as null.
The operator does not emit a tuple field containing the parsed XML elements. Instead, it emits a series of string fields, one per requested element. In downstream processing, you can map those string fields into a single tuple field.

All XML Fields Emitted as Strings

Each field parsed from the incoming XML string is emitted as a string field, including fields that contain numeric data. In downstream processing, you can use a Map operator to convert numeric data in XML fields to a StreamBase numeric data type.

Input Tuple Field Order is Rearranged on Output

The operator always emits a tuple in the following field order, independent of the placement of the XML string field in the input tuple:

Each requested field parsed from the XML string in requested order.
Optionally, a field you name to contain the text of any XML parsing error encountered.
Optionally, all other non-XML fields in the input tuple, in incoming field order.

This means that if you start out with fields A, B, and C in the input tuple, of which C is the XML string field, and you elect to include the non-XML fields in the output, the output tuple's fields are in the following order: C1, C2, ... Cn, A, B, where C1 through Cn are the requested XML elements originally in field C. You can use a Map operator downstream to restore the original field order, if your application requires it.

Parsing Stops at the First XML Error

The XML Normalizer operator stops after encountering the first XML parsing error in the input field. Make sure the input field contains well-formed, valid XML in an upstream test before passing it to the XML Normalizer operator for parsing.

Properties View Settings

This section describes the properties you can set for an XML Normalizer operator, using the various tabs of the Properties view in StreamBase Studio.

General Tab

Name: Use this field to specify or change the component's name, which must be unique in the application. The name must contain only alphabetic characters, numbers, and underscores, and no hyphens or other special characters. The first character must be alphabetic or an underscore.

Operator: A read-only field that shows the formal name of the operator.

Class: A field that shows the fully qualified class name that implements the functionality of this operator. Use this class name when loading the operator in StreamSQL programs with the APPLY JAVA statement. You can right-click this field and select Copy from the context menu to place the full class name in the system clipboard.

Start with application: If this field is set to Yes or to a module parameter that evaluates to true, an instance of this operator starts as part of the containing StreamBase Server. If this field is set to No or to a module parameter that evaluates to false, the adapter is loaded with the server, but does not start until you send an sbadmin resume command, or until you start the component with StreamBase Manager. With this option set to No or false, the operator does not start even if the application as a whole is suspended and later resumed. The recommended setting is selected by default.

Enable Error Output Port: Select this check box to add an Error Port to this component. In the EventFlow canvas, the Error Port shows as a red output port, always the last port for the component. See Using Error Ports and Error Streams to learn about Error Ports.

Description: Optionally enter text to briefly describe the component's purpose and function. In the EventFlow canvas, you can see the description by pressing Ctrl while the component's tooltip is displayed.

Operator Properties Tab

This section describes the properties on the Operator Properties tab in the Properties view for the XML Normalizer operator. Enter all text fields as string literals, not as expressions.

Input field

Specifies the name of the string field in the input tuple that contains a string of well-formed, valid XML to be parsed.

XML element being parsed

Specifies the name (without angle brackets) of the XML element in the incoming XML field for which you want one tuple emitted for each occurrence in the field.

List of elements to be returned

A grid in which you enter a sequence of XPath-compliant specifications for selecting values from the incoming XML string. Specify each XPath specification in the order in which you want them to appear in the output tuple. The operator supports a subset of XPath-compliant parsing strings, in one of the following formats:

Simple node: The name of an XML element one level down from the top-level element specified in the XML element being parsed field. For the XML example shown above, symbol, price, and volume are simple nodes.
Hierarchical node: The path to an XML element farther down the XML hierarchy than a simple node, using a slash as a path separator. For the example above, price/value and price/currency are hierarchical nodes.
Attribute node: The path to an attribute for an element at any level, using an at-sign to designate the attribute name. For the example, above symbol@market is a valid attribute node.
Attribute predicate: The XPath notation for an attribute with a particular value. For the example above, symbol[@market="NYSE"] and symbol[@market="NASDAQ"] are valid attribute predicates. These specifications return the symbol whose market attribute matches the specification, or returns null for non-matching nodes. Attribute predicates do not return the specified value of the attribute ("NYSE" or "NASDAQ" for the example), but return the element whose attribute matches ("MSFT" or "IBM" for the example).

List of output fields for parsed elements

A grid in which you specify the field name in which to emit the corresponding XPath-selected value specified in the previous grid. You must have one field name in this grid for each line in the List of elements grid.

Pass input fields

Select this check box (the default state) to include all non-XML fields from the input tuple in each output tuple. Clear the check box to send only the extracted XML fields to each output tuple.

Field for per-tuple error message (Optional)

If blank (the default state), XML parsing errors halt further parsing, but go unreported. Enter the name of a field to be appended to the output tuple after the parsed XML fields. This creates an appended string field that contains null for successfully parsed XML nodes. In the event of an XML parsing error, the output tuple contains null for all requested XML element values (even for fields that were successfully parsed), and this error message field contains the text of the XML parsing error. Since the operator halts after the first parsing error, in practice, this field is only non-null for the last emitted tuple, which is the one with the parsing error.

Concurrency Tab

Use the Concurrency tab to specify parallel regions for this instance of this component, or multiplicity options, or both. The Concurrency tab settings are described in Concurrency Options, and dispatch styles are described in Dispatch Styles.

Caution

Concurrency settings are not suitable for every application, and using these settings requires a thorough analysis of your application. For details, see Execution Order and Concurrency, which includes important guidelines for using the concurrency options.