Using the TERR Operator

Introduction

The TIBCO StreamBase® TIBCO Enterprise Runtime for R operator (hereafter, the TERR operator) allows StreamBase to use TIBCO's implementation of the R language to analyse and manipulate data.

Placing a TERR Operator on the Canvas

The TERR operator is a member of the Java Operators group in the Palette view in StreamBase Studio. Select the operator from the Insert an Operator or Adapter dialog. Invoke the dialog with one of the following methods:

Drag the Adapters, Java Operators token from the Operators and Adapters drawer of the Palette view to the canvas.
Click on the canvas where you want to place the operator, then invoke the keyboard shortcut O V.
From the top-level menu, invoke Insert → Operator → Java.

Prerequisites

In order run correctly, the operator assumes that the machine running StreamBase Server and your application has a 64-bit version of TERR version 2.7 or later installed locally. The TERR operator has been tested and validated with TERR versions 2.7, 3.0, 3.1, and 3.2.

The TERR bin directory does not need to be in the system PATH, and no environment variables are required. The TERR operator recognizes and honors the TERR_HOME environment variable if set, and if it points to the local TERR installation directory; however, setting TERR_HOME is not required.

TIBCO customers can download TERR from http://edelivery.tibco.com, or download an evaluation copy of TERR from the TIBCO Access Point.

For Linux

TERR is only provided for 64-bit Linux. Download the tar file provided; untar the file into a temporary local directory, and run the ./INSTALL file provided. The default installation directory is /opt/tibco/terrver, where ver is the TERR version number.

For Windows

Download the zip file provided; unzip the file to find a single installer executable. Run this installer and accept its suggested default location (C:\Program Files\TIBCO\terrver) or install into the currently recommended location (C:\TIBCO\terrver), where ver is the TERR version number.

On Windows, the TERR installer provides both 32-bit and 64-bit versions of the TERR runtime code. When run on 64-bit Windows, the 64-bit version of TERR is automatically used. Since StreamBase now supports only 64-bit Windows, it uses the 64-bit version of TERR.

To connect StreamBase and its TERR operator to your local TERR installation, you must either:

Set the TERR Home property in the Engine Options tab of each TERR operator's Properties view, providing the full, absolute path to the TERR installation directory.
Set the TERR_HOME environment variable to point to the full, absolute path to the TERR installation directory. Use this method if you anticipate using many TERR operator instances in your StreamBase applications.

How the Operator Works

This operator creates an external TERR process which it then uses to run R scripts and retrieve the results. With each input tuple, the values in the tuple are merged with the supplied script and the script sent to the TERR process for execution. When the script finishes, the result is retrieved and translated into an output tuple.

The script can be supplied as data in the operator itself or can be read from a resource file. The script is a standard R script with one addition: fields in the script that are to be replaced by tuple data must have the format $[name] where name is the name of an input tuple field. The data in the tuple can be any supported simple data type or list of a supported simple data type. The supported StreamBase data types for input are: string, int, boolean, double, list, and timestamp (which is converted to a string). TERR result types supported are: string, int, boolean, double, array (list), byte, factor, and dataFrame.

Script substitution is done on a purely textual basis and it is up to R to parse the results. For example, the string "3", the integer 3 and the float 3 all appear the same when inserted into a script. On input, lists are handled a little differently, with the outer brackets removed. For example, the integer list [1,2,3] is converted to the string "1,2,3". Any substitution variables should be carefully looked at as to context. In general, if a substitution can be either a scalar value or a list, it should probably be placed in a c() construct such as c($[var]), which handles both scalar and list cases.

Booleans are treated specially. On input, a true is converted to TRUE, false is converted to FALSE and null is converted to NA; on output, the opposite conversion is done. For numbers, null is converted to NA.

The operator can be configured to watch the script and dataset files. If either one changes, it is reloaded before the next tuple is processed. This is useful for script development as the script can be in an editor, changes made and the file saved before sending a new tuple.

Note

The file system monitoring feature of the TERR operator is supported on local file systems only, and not on remote mounts. This feature is based on code internal to the TERR operator, and does not depend on the TIBCO StreamBase® File Monitor Adapter.

A single TERR instance can be shared among multiple operators within a container, or each operator can have its own instance. If sharing, take care to ensure that the same initial dataset is specified for each use. This is because the TERR instance is started and initialized by the first instance of the operator to run and the other operators will use the already existing TERR instance. The various parameters used in starting the TERR instance should also be the same for the same reason. Startup parameters are discussed later.

An example script along with some input tuples and results are shown next. First, a TERR script:

n <- $[names]
s <- $[vals]
b <- $[bools]
result <- data.frame(n, s, b)

The input tuple (names="one", vals=2, bools=true) results in:

n <- "one"
s <- 2
b <- TRUE
result <- data.frame(n, s, b)

The input tuple (names=["one","two"], vals=[2,4], bools=[true,null]) results in a TERR parse error, because by default TERR considers each comma as the start of a new field. The solution is to enclose the field variables in c() constructs in your TERR script, like so:

n <- c($[names])
s <- c($[vals])
b <- c($[bools])
result <- data.frame(n, s, b)

In this case, with the same input tuple as above, the results are:

n <- c("one", "two")
s <- c(2, 4)
b <- c(TRUE, NA)
result <- data.frame(n, s, b)

Properties View Settings

This section describes the properties you can set for the TERR operator, using the various tabs of the Properties view in StreamBase Studio.

In the tables in this section, the Property column shows each property name as found in the one or more adapter properties tabs of the Properties view for this adapter.

Use the StreamSQL names of the adapter's properties when using this adapter in a StreamSQL program with the APPLY JAVA statement.

General Tab

Name: Use this field to specify or change the component's name, which must be unique in the application. The name must contain only alphabetic characters, numbers, and underscores, and no hyphens or other special characters. The first character must be alphabetic or an underscore.

Operator: A read-only field that shows the formal name of the operator.

Class: A field that shows the fully qualified class name that implements the functionality of this operator. Use this class name when loading the operator in StreamSQL programs with the APPLY JAVA statement. You can right-click this field and select Copy from the context menu to place the full class name in the system clipboard.

Start with application: If this field is set to Yes or to a module parameter that evaluates to true, an instance of this operator starts as part of the containing StreamBase Server. If this field is set to No or to a module parameter that evaluates to false, the adapter is loaded with the server, but does not start until you send an sbadmin resume command, or until you start the component with StreamBase Manager. With this option set to No or false, the operator does not start even if the application as a whole is suspended and later resumed. The recommended setting is selected by default.

Enable Error Output Port: Select this check box to add an Error Port to this component. In the EventFlow canvas, the Error Port shows as a red output port, always the last port for the component. See Using Error Ports and Error Streams to learn about Error Ports.

Description: Optionally enter text to briefly describe the component's purpose and function. In the EventFlow canvas, you can see the description by pressing Ctrl while the component's tooltip is displayed.

Operator Properties Tab

Property	Data Type	Default	Description	StreamSQL Property
Log Level	drop-down list	INFO	Controls the level of verbosity the adapter uses to send notifications to the console. This setting can be higher than the containing application's log level. If set lower, the system log level will be used. Available values, in increasing order of verbosity, are: OFF, ERROR, WARN, INFO, DEBUG, TRACE, and ALL.	LogLevel
Reload files when changed	check box	Cleared	If selected, the data file and the script file (if selected) are monitored for changes. If either one changes, it is loaded the next time a tuple is to be processed.	WatchFiles
Output Status Tuples	check box	Cleared	Select this check box to have a status tuple emitted on the status output stream for each input tuple.The status tuple includes any errors generated by the script for this tuple.	SendStatusTuples
TERR instance to use	string	Cleared	The name of the TERR instance to use in this operator.	WhichTERRInstances

Model Tab

Property	Data Type	Default	Description	StreamSQL Property
Load saved R datasets from file into engine	check box	Cleared	Determines whether an initial dataset is loaded into the TERR instance when started. If one is initially loaded and Reload files when changed is also selected, the dataset is reloaded if changed on disk.	LoadModel
Data file	drop-down list	Cleared	The name of a resource file to load on TERR initialization. The drop-down contains all the files that are resources to choose from on the current project's resource search path.	ScriptModel

Script Options Tab

Property	Data Type	Default	Description	StreamSQL Property
Result variable	string	Cleared	The name of the R variable that will be retrieved as the result of the script.	ResultVarName
Character Set	Drop-down list	UTF-8	Specifies the character set to be used when reading a script file from disk.	TerrCharset
Script Source	Radio button	Script Text	The source from which to get the script to send to TERR. File Read script from a resource file. Script text Use the local text box to enter the script.	ScriptSource
Script file	Drop-down list	Cleared	Active only when ScriptSource is File. In the drop-down list, select the resource file that contains the script.	ScriptLocation
Script text	string	Empty	Active only when Script Source is Script text. Specifies content of the script to be used to process input tuples.	ScriptText

Engine Options Tab

These options are used to setup the TERR instance.

Property	Data Type	Default	Description	StreamSQL Property
TERR Home	string	Cleared	The directory in which TERR is installed. If no value is specified, the operator uses the value of the TERR_HOME environment value, if present. If neither this field nor TERR_HOME is specified, the result it a typecheck error.	TerrHome
Processor Affinity	string	Cleared	A zero-based, comma-separated list of integers representing processor cores that the TERR process should execute on. Hyperthreaded cores count as cores. An invalid core number for the current CPU causes a typecheck error. For example, on a four-core (2C, 2T) machine, the entry `2,3` specifies an affinity to run on the third and fourth cores.	ProcessorAffinity
TERR engine parameters	string	Cleared	A parameter string sent to the TERR engine on initialization. See the TERR documentation for usage. Either type a value in the field, or select from a list of values you have entered in the containing project's sbconf file as `<adapter-configurations>` elements.	TerrEngineParameters
TERR environment	table	Cleared	A set of Key-Value pairs that are used in the initial startup environment for the TERR engine. See the TERR documentation for usage.	TerrEnvironment

Java Options Tab

These options are used to adjust the Java environment for the TERR instance.

Property	Data Type	Default	Description	StreamSQL Property
Java installation to use	Radio button	StreamBase	Determines where to get the Java version to use to run TERR. In most cases the default StreamBase is appropriate, but it is not required that the TERR operator uses the same Java version as StreamBase, because they run in different JVM instances. StreamBase Use the same Java version as StreamBase. Custom Use the directory entered in the next option as the JavaHome for TERR.	UseSBJava
Java home	string	Cleared	Active only when Java installation to use is set to Custom. Specifies the directory containing the Java installation to use to run the TERR instance.	JavaHOME

Schemas Tab

Use the Schemas tab to specify the schema of the output tuple for this adapter.

For general instructions on using the Edit Schema tab, see the Properties: Edit Schema Tab section of the Defining Input Streams page.

The custom schema should use the same names for fields as does the generic schema as those are what are looked at when filling the output tuple.

Concurrency Tab

Use the Concurrency tab to specify parallel regions for this instance of this component, or multiplicity options, or both. The Concurrency tab settings are described in Concurrency Options, and dispatch styles are described in Dispatch Styles.

Caution

Concurrency settings are not suitable for every application, and using these settings requires a thorough analysis of your application. For details, see Execution Order and Concurrency, which includes important guidelines for using the concurrency options.

Input Port

The TERR operator has one input port, whose schema is determined by the TERR script loaded at operator start time. Any updated script loaded dynamically during run time must specify inputs that have at least the same field names and data types as the initally loaded script.

At run time, input tuples must have at least one field whose name and data type exactly match one of the fields specified by the TERR script. Input tuples do not need to fill all fields in the TERR script, and the field order of input tuples does not need to match the TERR script's field order.

Output Ports

The TERR operator has two output ports: an optional status port, and a result port.

Status Port

The status port emits tuples that describe the status of processing each input tuple. It is only present when the Output Status Tuples option is selected. The schema of the output tuple consists of four strings:

Field Name	Field Type	Description
Type	String	The type of report, usually `Status`.
Action	String	The action that caused the report.
Message	String	The result reported by TERR of running the script. Examples: `Success`, `Parse Error`.
Object	String	Any extra information about the operator.

Result Port

The default schema of the result port consists of:

The input tuple, which is passed through unchanged.
One added field named terrResult, which a tuple of tuples, and contains the results of TERR processing the provided script for each input tuple.

Because the TERR operator cannot know the data type of the TERR result in advance, the terrResult field contains a subfield for each possible TERR result data type. Only one terrResult subfield is filled in per input tuple, depending on the value assigned to the result variable by the script. The other subfields are left empty (null) for each input tuple.

The top-level schema of the terrResult field is shown in the following table:

Subfield Name	Field Type	Description
double	tuple	The result was a double or array of doubles.
integer	tuple	The result was an integer or array of integers.
boolean	tuple	The result was a boolean or array of booleans,
string	tuple	The result was a string or list of strings.
dataFrame	tuple	The result was an R dataFrame, which is comparable to a StreamBase tuple.
byte	tuple	The result was a byte or array of bytes (returned as StreamBase ints).
list	tuple	The result was an R array, comparable to a StreamBase list.
factor	tuple	The result was an R factor.

Each terrResult subfield is a tuple of lists. The first third-level field of every terrResult subfield is named names. This is a list of one or more names of the result fields of the input script.

The scalar subfields

The five scalar subfields of terrResult are double, integer, boolean, string, and byte. The schema of each of these subfields is the same: a list of returned script field names and a list of returned values corresponding to each of the names. If the result is a scalar, it is still returned as a list of one item. If the result is a multi-dimentional array, it is flattend to a vector.

Third-level Field Name	Type	Description
names	list of strings	List of one or more TERR script field names.
values	list of `type`	List of returned values for each script field in `names`.

The dataFrame subfield

The dataFrame subfield of terrResult consists of a list of names, plus zero or more lists of integers, doubles, logicals (booleans), factors. strings, or bytes. If more than one of a type occurs, the resulting list contains the concatenation of the lists.

Third-level Field Name	Type	Description
names	list of strings	List of one or more TERR script field names.
integers	list of tuples	List of zero or more returned integer names-values pairs.
doubles	list of tuples	List of zero or more returned double names-values pairs.
logicals	list of tuples	List of zero or more returned boolean names-values pairs.
factors	list of tuples	List of zero or more returned R factor names-indexes-levels triplets.
strings	list of tuples	List of zero or more returned string names-values pairs.
bytes	list of tuples	List of zero or more returned byte names-values pairs.

The list subfield

The list subfield of terrResult consists of a list of names, plus zero or more lists of tuples of the five scalar types plus factors. The schema of the list subfield is the same as for the dataFrame subfield.

The factor subfield

The factor subfield of terrResult consists of three lists: names, indexes, and levels. See the TERR documentation for an explanation of the R factor datatype.

Third-level Field Name	Type	Description
names	list of strings	List of one or more TERR script field names.
indexes	list of integers	List of zero or more returned index values.
levels	list of strings	List of zero or more returned level values.

You can specify a custom schema that contains only the result data you know to expect. For a custom schema, the names of the fields must be the same as in the default schema described above, and the sub-schemas must also match exactly.

Typechecking and Error Handling

Typechecking fails if any required fields are not filled in. It also fails if the input schema does not contain all the replacement variables that the script needs. All specified dataset and script files are checked for existence and typechecking fails if any file is not accessible. A result variable must be present, although the TERR script is not checked to see if it uses it. A script must be specified either as local data or as a resource file. The TerrHome parameter must be set so that the process can be started.

All errors in the execution of the script are logged and an optional status tuple is emitted.

Specifying Custom Default Property Values

The TERR operator uses ConfigurationChooserPropertyDescriptors for some of its properties. This means that it can read default values for these properties from the containing application's StreamBase configuration file.

To use this feature, an <adapter-configurations> section must be present in the configuration file, with at least one child element of the form <configuration type="terrstring">, where terrstring is one of terrConfigHome, terrConfigEngineParams, terrConfigJavaHome, terrConfigJavaOptions, or terrConfigInstance. In each of these configurations, list the choices to be presented in the form <choice id="valueX">valueX</choice>. Alternatively, indirection can be used. See the Javadoc documentation in the StreamBase Client API on ConfigurationChooserPropertyDescriptors for further information.

See the sbd.sbconf file in the TERR Operator sample for an example of this feature.

Suspend and Resume Behavior

On suspend, the TERR operator finishes processing the current tuple, outputs the result tuple, then pauses waiting for input.

On resumption, the TERR operator continues processing with the next input tuple.

The TERR instance remains running during suspend.

TERR Operator Sample

The StreamBase installation includes a sample demonstrating the use of this operator. To load the sample in StreamBase, select File → Load StreamBase Sample and look under the Extending StreamBase section for an entry called TERR Operator.