Using the TERR Operator

Introduction

The TIBCO StreamBase® TIBCO Enterprise Runtime for R operator (hereafter, the "TERR operator") allows StreamBase to use TIBCO's implementation of the R language to analyse and manipulate data.

Placing a TERR Operator on the Canvas

The TERR operator is a member of the Java Operators group in the Palette view in StreamBase Studio. Select the operator from the Insert an Operator or Adapter dialog. Invoke the dialog with one of the following methods:

  • Drag the Adapters, Java Operators token from the Operators and Adapters drawer of the Palette view to the canvas.

  • Click on the canvas where you want to place the operator, then invoke the keyboard shortcut O V.

  • From the top-level menu, invoke InsertOperatorJava.

When the dialog is open, enter terr in the search field to narrow the list of operators. Then select the Enterprise Runtime for R operator.

Prerequisites

In order to run correctly, the operator assumes that the machine running StreamBase Server and your application has a 64-bit version of TERR version 2.7 or later installed locally. The TERR operator has been tested and validated with TERR versions 2.7, 3.0, 3.1, and 3.2.

The TERR bin directory does not need to be in the system PATH, and no environment variables are required. The TERR operator recognizes and honors the TERR_HOME environment variable if set, and if it points to the local TERR installation directory; however, setting TERR_HOME is not required.

TIBCO customers can download TERR from http://edelivery.tibco.com, or download an evaluation copy of TERR from the TIBCO Access Point.

For Linux

TERR is only provided for 64-bit Linux. Download the tar file provided; untar the file into a temporary local directory, and run the ./INSTALL file provided. The default installation directory is /opt/tibco/terrver, where ver is the TERR version number.

For Windows

Download the zip file provided; unzip the file to find a single installer executable. Run this installer and accept its suggested default location (C:\Program Files\TIBCO\terrver) or install into the currently recommended location (C:\TIBCO\terrver), where ver is the TERR version number.

On Windows, the TERR installer provides both 32-bit and 64-bit versions of the TERR runtime code. When run on 64-bit Windows, the 64-bit version of TERR is automatically used. Since StreamBase now supports only 64-bit Windows, it uses the 64-bit version of TERR.

To connect StreamBase and its TERR operator to your local TERR installation, you must either:

  • Set the TERR Home property in the Engine Options tab of each TERR operator's Properties view, providing the full, absolute path to the TERR installation directory.

  • Set the TERR_HOME environment variable to point to the full, absolute path to the TERR installation directory. Use this method if you anticipate using many TERR operator instances in your StreamBase applications.

How the TERR Operator Works

This operator allows a stream of tuples to be operated on by an external TERR process, with the results then passed on as another stream of tuples. The operator can work in two different modes, static and dynamic. These mode names come from the way the input script is treated during the operation.

Under both methods of operation, the operator can be configured to watch the script and dataset files. If either one changes, it is reloaded before the next tuple is processed. This is useful for script development as the script can be in an editor, changes made and the file saved before sending a new tuple. If a directory rather than a single file is specified for the datasets, the directory is watched for additions and creations of any file that has the .RData extension. If any such change is seen, all affected files are reloaded. Note that if the file system is case sensitive (such as on OS/X and Linux), then the extension must be exactly .RData. Files with extensions .rdata, .Rdata, and so on are not loaded.

Note

The file system monitoring feature of the TERR operator is supported on local file systems only, and not on remote mounts. This feature is based on code internal to the TERR operator, and does not depend on the TIBCO StreamBase® File Monitor Adapter.

A single TERR instance can be shared among multiple operators within a container, or each operator can have its own instance. If sharing, take care to ensure that the same initial dataset is specified for each use. This is because the TERR instance is started and initialized by the first instance of the operator to run, and the other operators will use the already existing TERR instance. The various parameters used in starting the TERR instance should also be the same, for the same reason. Startup parameters are discussed later.

Loading Files or Directories

The TERR Operator includes the ability to specify whether the item to watch is a directory or file. A specified file to be loaded can have any file name. If it can loaded, it is; otherwise, an error is logged.

If you specify a directory, only files of with .RData extensions (note capitalization) are monitored; other file types in the specified directory that change, are added, or are deleted are ignored.

If a change is detected, the changed files are added to the changed list for all TERR instances that are available to the operator. Files can be added multiple times, but there can be at most one copy in the list at any time. When a TERR instance is acquired by an operator, the list is checked, the files loaded, and the list cleared. If a file is modified after it has been loaded, it loads again the next time the TERR instance is acquired.

Dynamic Mode Operation

Dynamic mode is the default mode of operation. In dynamic mode, input tuples are merged with the supplied script via textual replacement. With each input tuple, the values in the tuple are merged into the script by replacing a tag with the text value of the corresponding tuple. This is done for each available tag in the script, and the script is then sent to the TERR process for execution. When the script finishes, the result is retrieved and translated into an output tuple. Tags are of the form $[tagname] where tagname is the name of the StreamBase field.

The script can be supplied as data in the operator itself, or it can be read from a resource file. The script is a standard R script with one addition: fields in the script that are to be replaced by tuple data must have the format $[name] where name is the name of an input tuple field. The data in the tuple can be any supported simple data type or a list of a supported simple data type. The supported StreamBase data types for input are: string, int, boolean, double, list, and timestamp (which is converted to a string), although any type can be used as long as it converts to a string. TERR result types supported are: string, int, boolean, double, array (list), byte, factor, and dataFrame.

Script substitution is done on a purely textual basis and it is up to R to parse the results. For example, the integer 3 and the float 3 all appear the same when inserted into a script. On input, lists are handled a little differently, with the outer brackets removed. For example, the list of integers [1,2,3] is converted to the string "1,2,3". Any substitution variables should be carefully looked at as to context. In general, if a substitution can be either a scalar value or a list, it should probably be placed in a c() construct such as c($[var]), which handles both scalar and list cases.

In addition, remember that the TERR parser will be parsing the new script, and when TERR sees a number, it defaults to treating it as a double. Because of this behavior, you must prepare your script with some defensive actions. If you do want an int to be returned, you must inform the parser; do this by using the integer() function. The same is true for any numeric type: unless you do want a double, coerce the returned value to the desired type.

Strings and TimeStamps are pasted with their surrounding quotes, "string", and so will be usable immediately as strings. Other non-numeric value, such as blob, functions, capture and tuple, will come as a string but will not have the surrounding quotes and may need them to be added before use.

Booleans are treated specially. On input, a true is converted to TRUE, false is converted to FALSE and null is converted to NA; on output, the opposite conversion is done. For numbers, null is converted to NA.

An example script along with some input tuples and results are shown next. First, a TERR script:

n <- $[names]
s <- $[vals]
b <- $[bools]
result <- data.frame(n, s, b)

The input tuple (names="one", vals=2, bools=true) results in:

n <- "one"
s <- 2
b <- TRUE
result <- data.frame(n, s, b)

Here, the type of s will be double for numeric StreamBase types, double or int. To make sure s is an integer, the script would be:

s <- integer($[vals])
s <- c(integer($[vals]))

The input tuple (names=["one","two"], vals=[2,4], bools=[true,null]) results in a TERR parse error, because by default TERR considers each comma as the start of a new field. The solution is to enclose the field variables in c() constructs in your TERR script, like so:

n <- c($[names])
s <- c($[vals])
b <- c($[bools])
result <- data.frame(n, s, b)

In this case, with the same input tuple as above, the results are:

n <- c("one", "two")
s <- c(2, 4)
b <- c(TRUE, NA)
result <- data.frame(n, s, b)

Static Mode Operation

Static mode is similar to dynamic operation, the difference being in how tuples are sent to TERR. Where dynamic mode uses text substitution, in static mode, the script is not modified. Instead, the fields in the input tuple are converted directly to global TERR variables. The script is then run in that environment and the result variable retrieved and converted to the output tuple. This allows the script to be very short; a simple function call is sufficient as long as the function is defined in the initially loaded model. Having the values directly converted to TERR variables greatly increases both the speed of processing and the size of the input that can be processed for each tuple.

A change in the input schema is required to implement this. All the tuple entries that are to be read into the TERR process must be in a top level tuple named terrVars. Each element in this tuple are converted to a TERR variable. If it is a simple type (int, long, double, string, bool) or a list of a simple type, it will be automatically converted. If it is a complex or enhanced type (DataFrame, Factor, or a simple type with a names column), a tuple using a special schema must be provided.

A list of ints can be sent using the tuple (1) or (list (1, 2, 3)) or the enhanced form (tuple myInts (names = "one", "two", vals=[1,2], terrType="integer"). The supported terrType values are integer, double, logical, string, list, byte, dataFrame and factor. Note that timestamp is not available in this mode and would have to be converted to strings in StreamBase before being sent to the operator. Use the Input proposed schemas button to see the actual formats.

Once the variables have been sent to the TERR process, the script is executed and the result is retrieved.

Properties View Settings

This section describes the properties you can set for the TERR operator, using the various tabs of the Properties view in StreamBase Studio.

In the tables in this section, the Property column shows each property name as found in the one or more adapter properties tabs of the Properties view for this adapter.

Use the StreamSQL names of the adapter's properties when using this adapter in a StreamSQL program with the APPLY JAVA statement.

General Tab

Name: Use this field to specify or change the component's name, which must be unique in the application. The name must contain only alphabetic characters, numbers, and underscores, and no hyphens or other special characters. The first character must be alphabetic or an underscore.

Operator: A read-only field that shows the formal name of the operator.

Class: A field that shows the fully qualified class name that implements the functionality of this operator. Use this class name when loading the operator in StreamSQL programs with the APPLY JAVA statement. You can right-click this field and select Copy from the context menu to place the full class name in the system clipboard.

Start with application: If this field is set to Yes or to a module parameter that evaluates to true, an instance of this operator starts as part of the containing StreamBase Server. If this field is set to No or to a module parameter that evaluates to false, the adapter is loaded with the server, but does not start until you send an sbadmin resume command, or until you start the component with StreamBase Manager. With this option set to No or false, the operator does not start even if the application as a whole is suspended and later resumed. The recommended setting is selected by default.

Enable Error Output Port: Select this check box to add an Error Port to this component. In the EventFlow canvas, the Error Port shows as a red output port, always the last port for the component. See Using Error Ports and Error Streams to learn about Error Ports.

Description: Optionally enter text to briefly describe the component's purpose and function. In the EventFlow canvas, you can see the description by pressing Ctrl while the component's tooltip is displayed.

Operator Properties Tab

Property Data Type Default Description StreamSQL Property
Log Level drop-down list INFO Controls the level of verbosity the adapter uses to send notifications to the console. This setting can be higher than the containing application's log level. If set lower, the system log level will be used. Available values, in increasing order of verbosity, are: OFF, ERROR, WARN, INFO, DEBUG, TRACE, and ALL. LogLevel
Reload files when changed check box Cleared If selected, the data file and the script file (if selected) are monitored for changes. If either one changes, it is loaded the next time a tuple is to be processed. WatchFiles
Output Status Tuples check box Cleared

Select this check box to have a status tuple emitted on the status output stream for each input tuple.The status tuple includes any errors generated by the script for this tuple.

SendStatusTuples
Enable command port check box Cleared Enables an input stream that allows control of the TERR instance. The input schema is a string item named command. The only currently accepted value is reset, which causes the TERR instance to be restarted. EnableCommands
Enable passthrough check box Cleared When enabled, the input tuple is mirrored to the output unless a custom output schema is used. EnablePassthrough
Enable telemetry check box Cleared When enabled, a tuple is emitted on the status output giving timings for processing the current tuple. EnableTelemetry
Processing mode Radio button Dynamic Controls which mode is used to process the tuple. ProcessingMode
TERR instance to use string None

A comma-delimited list of the TERR instances assigned to this operator instance.

WhichTERRInstances

Model Tab

Property Data Type Default Description StreamSQL Property
Load saved R datasets from file into engine check box Cleared Determines whether an initial dataset is loaded into the TERR instance when started. If one is initially loaded and Reload files when changed is also selected, the dataset is reloaded if changed on disk. LoadModel
Load entire Directory check box Cleared Determines whether the initial dataset is a single file or all the .RData files in the specified directory. Note that if a directory is specified, only files with the extension .RData (the capitalization on file system that support is is important) are loaded or watched. DataIsDir
RData file drop-down list None The name of a resource file to load on TERR initialization. The drop-down contains all the files that are resources to choose from on the current project's resource search path. Only available if Load Entire Directory is false. ScriptModel
Directory of RData files drop-down list None The directory containing the files to be loaded on TERR initialization. The drop-down contains all the resource directories in the project. All files with .RData extensions in this directory are loaded into the TERR process on initialization. ModelDirectory

Script Options Tab

Property Data Type Default Description StreamSQL Property
Result variable string None The name of the R variable that is to be retrieved as the result of the script. ResultVarName
Character Set Drop-down list UTF-8 Specifies the character set to be used when reading a script file from disk. TerrCharset
Script Source Radio button Script Text

The source from which to get the script to send to TERR.

File

Read script from a resource file.

Script text

Use the local text box to enter the script.

ScriptSource
Script file Drop-down list None Active only when Script Source is File. In the drop-down list, select the resource file that contains the script. ScriptLocation
Script text string None Active only when Script Source is Script text. Specifies content of the script to be used to process input tuples. ScriptText

Engine Options Tab

These options are used to setup the TERR instance.

Property Data Type Default Description StreamSQL Property
TERR Home string None Specifies the full path to the directory in which TERR is installed. If no value is specified, the operator uses the value of the TERR_HOME environment value, if present. If neither this field nor TERR_HOME is specified, the result it a typecheck error. TerrHome
Processor Affinity string None A zero-based, comma-separated list of integers representing processor cores that the TERR process should execute on. Hyperthreaded cores count as cores. An invalid core number for the current CPU causes a typecheck error.

For example, on a four-core (2C, 2T) machine, the entry 2,3 specifies an affinity to run on the third and fourth cores.

ProcessorAffinity
TERR engine parameters string None A parameter string sent to the TERR engine on initialization. See the TERR documentation for usage.

Either type a value in the field, or select from a list of values you have entered in the containing project's sbconf file as <adapter-configurations> elements.

TerrEngineParameters
TERR environment grid None A set of Key-Value pairs that are used in the initial startup environment for the TERR engine. See the TERR documentation for usage. TerrEnvironment

Java Options Tab

These options are used to adjust the Java environment for the TERR instance.

Property Data Type Default Description StreamSQL Property
Java installation to use Radio button StreamBase Determines where to get the Java version to use to run TERR. In most cases the default StreamBase is appropriate, but it is not required that the TERR operator uses the same Java version as StreamBase, because they run in different JVM instances.
StreamBase

Use the same Java version as StreamBase.

Custom

Use the directory entered in the next option as the JavaHome for TERR.

UseSBJava
Java home string Cleared Active only when Java installation to use is set to Custom. Specifies the directory containing the Java installation to use to run the TERR instance. JavaHOME

Schemas Tab

Use the Schemas tab to specify the schema of the output tuple for this adapter.

For general instructions on using the Edit Schema tab, see the Properties: Edit Schema Tab section of the Defining Input Streams page.

The Import proposed schemas link can be used to import schemas for the various TERR output types.

A custom schema should use the same field names for fields as the generic schema does, because those field names are what are looked at when filling the output tuple.

Concurrency Tab

Use the Concurrency tab to specify parallel regions for this instance of this component, or multiplicity options, or both. The Concurrency tab settings are described in Concurrency Options, and dispatch styles are described in Dispatch Styles.

Caution

Concurrency settings are not suitable for every application, and using these settings requires a thorough analysis of your application. For details, see Execution Order and Concurrency, which includes important guidelines for using the concurrency options.

The TERR operator has been enhanced to work with a multiplicity greater than 1 and parallelism enabled. If you assign multiple named instances of TERR to an single operator and set the component to run in a parallel region, then each instance of the operator will use its own TERR instance, if sufficient instances were assigned. If fewer TERR instances were assigned than there are parallel operator instances, the ones available are shared among the operator instances. If more are assigned, the excess instances are unused (but will have been created). It is therefor important to assign the correct number of TERR instances.

Input Ports

The TERR operator has two input ports: a data port, and an optional command port.

Data Port

There are two possible schemas for the data port. The first is for when the operator is in dynamic mode, where the schema is determined by the TERR script loaded at operator start time. Any updated script loaded dynamically during run time must specify inputs that have at least the same field names and data types as the initially loaded script. There may be additional fields, which are ignored.

At run time, input tuples must have at least one field whose name and data type exactly match one of the fields specified by the TERR script. Input tuples do not need to fill all fields in the TERR script, and the field order of input tuples does not need to match the TERR script's field order.

The second schema for the the data port must have a tuple field named terrVars which contains the variables to be sent to the TERR process. Any other fields are ignored.

Command Port

The command port has one required item, a string named command which currently accepts a single value, reset. This causes the TERR instance to be stopped and restarted.

Output Ports

The TERR operator has two output ports: an optional status port, and a result port.

Status Port

The status port emits tuples that describe the processing status for each input tuple. It is only present when the Output Status Tuples property is selected. The schema of the output tuple consists of four strings:

Field Name Field Type Description
Type String The type of report, usually Status.
Action String The action that caused the report.
Message String The result reported by TERR of running the script. Examples: Success, Parse Error.
Object String Any extra information about the operator.

Result Port

The default schema of the result port consists of:

  • The input tuple, which is passed through unchanged if the Enable Passthrough property is set.

  • A field named terrResult, which a tuple of tuples, and contains the results of TERR processing the provided script for each input tuple.

Because the TERR operator cannot know the data type of the TERR result in advance, the terrResult field contains a subfield for each possible TERR result data type. Only one terrResult subfield is filled in per input tuple, depending on the value assigned to the result variable by the script. The other subfields are left empty (null) for each input tuple.

The top-level schema of the terrResult field is shown in the following table:

Subfield Name Field Type Description
double tuple The result was a double or array of doubles.
integer tuple The result was an integer or array of integers.
boolean tuple The result was a boolean or array of booleans,
string tuple The result was a string or list of strings.
dataFrame tuple The result was an R dataFrame, which is comparable to a StreamBase tuple.
byte tuple The result was a byte or array of bytes (returned as StreamBase ints).
list tuple The result was an R array, comparable to a StreamBase list.
factor tuple The result was an R factor.

Each terrResult subfield is a tuple of lists. The first third-level field of every terrResult subfield is named names. This is a list of one or more names of the result fields of the input script.

The scalar subfields

The five scalar subfields of terrResult are double, integer, boolean, string, and byte. The schema of each of these subfields is the same: a list of returned script field names and a list of returned values corresponding to each of the names. If the result is a scalar, it is still returned as a list of one item. If the result is a multi-dimensional array, it is flattened to a vector.

Third-level Field Name Type Description
names list of strings List of one or more TERR script field names.
values list of type List of returned values for each script field in names.
The dataFrame subfield

The dataFrame subfield of terrResult consists of a list of names, plus zero or more lists of integers, doubles, logicals (booleans), factors. strings, or bytes. If more than one of a type occurs, the resulting list contains the concatenation of the lists.

Third-level Field Name Type Description
names list of strings List of one or more TERR script field names.
integers list of tuples List of zero or more returned integer name-value pairs.
doubles list of tuples List of zero or more returned double name-value pairs.
logicals list of tuples List of zero or more returned boolean name-value pairs.
factors list of tuples List of zero or more returned R factor name-index-level triplets.
strings list of tuples List of zero or more returned string name-value pairs.
bytes list of tuples List of zero or more returned byte name-value pairs.
The list subfield

The list subfield of terrResult consists of a list of names, plus zero or more lists of tuples of the five scalar types plus factors. The schema of the list subfield is the same as for the dataFrame subfield.

The factor subfield

The factor subfield of terrResult consists of three lists: names, indexes, and levels. See the TERR documentation for an explanation of the R factor data type.

Third-level Field Name Type Description
names list of strings List of one or more TERR script field names.
indexes list of integers List of zero or more returned index values.
levels list of strings List of zero or more returned level values.

You can specify a custom schema that contains only the result data you know to expect. For a custom schema, the names of the fields must be the same as in the default schema described above, and the sub-schemas must also match exactly.

Typechecking and Error Handling

Typechecking fails if any required fields are not filled in. It also fails if the input schema does not contain all the replacement variables that the script needs. All specified dataset and script files and directories are checked for existence and typechecking fails if any file or directory is not accessible. A result variable must be present, although the TERR script is not checked to see if it uses it. A script must be specified either as local data or as a resource file. The TerrHome parameter must be set so that the process can be started.

All errors in the execution of the script are logged and an optional status tuple is emitted.

Specifying Custom Default Property Values

The TERR operator uses ConfigurationChooserPropertyDescriptors for some of its properties. This means that it can read default values for these properties from the containing application's StreamBase configuration file.

To use this feature, an <adapter-configurations> section must be present in the configuration file, with at least one child element of the form <configuration type="terrstring">, where terrstring is one of terrConfigHome, terrConfigEngineParams, terrConfigJavaHome, terrConfigJavaOptions, or terrConfigInstance. In each of these configurations, list the choices to be presented in the form <choice id="valueX">valueX</choice>. Alternatively, indirection can be used. See the Javadoc documentation in the StreamBase Client API on ConfigurationChooserPropertyDescriptors for further information.

See the sbd.sbconf file in the TERR Operator sample for an example of this feature.

Suspend and Resume Behavior

On suspend, the TERR operator finishes processing the current tuple, outputs the result tuple, then pauses waiting for input.

On resumption, the TERR operator continues processing with the next input tuple.

The TERR instance remains running during suspend.

TERR Operator Sample

The StreamBase installation includes a sample demonstrating the use of this operator. To load the sample in StreamBase, select FileLoad StreamBase Sample and look under the Extending StreamBase section for an entry called TERR Operator.