Contents
The TIBCO StreamBase® TIBCO Enterprise Runtime for R operator (hereafter, the "TERR operator") allows StreamBase to use TIBCO's implementation of the R language to analyse and manipulate data.
The TERR operator is a member of the Java Operators group in the Palette view in StreamBase Studio. Select the operator from the Insert an Operator or Adapter dialog. Invoke the dialog with one of the following methods:
-
Drag the Adapters, Java Operators token from the Operators and Adapters drawer of the Palette view to the canvas.
-
Click on the canvas where you want to place the operator, then invoke the keyboard shortcut
O V
. -
From the top-level menu, invoke
→ → .
When the dialog is open, enter terr
in the search
field to narrow the list of operators. Then select the Enterprise Runtime for R operator.
In order to run correctly, the operator assumes that the machine running StreamBase Server and your application has a 64-bit version of TERR version 2.7 or later installed locally. The TERR operator has been tested and validated with TERR versions 2.7, 3.0, 3.1, and 3.2.
The TERR bin
directory does not need to be in the
system PATH, and no environment variables are required. The TERR operator
recognizes and honors the TERR_HOME environment variable if set, and if it points
to the local TERR installation directory; however, setting TERR_HOME is not
required.
TIBCO customers can download TERR from http://edelivery.tibco.com, or download an evaluation copy of TERR from the TIBCO Access Point.
- For Linux
-
TERR is only provided for 64-bit Linux. Download the tar file provided; untar the file into a temporary local directory, and run the
./INSTALL
file provided. The default installation directory is/opt/tibco/terr
, wherever
ver
is the TERR version number. - For Windows
-
Download the zip file provided; unzip the file to find a single installer executable. Run this installer and accept its suggested default location (
C:\Program Files\TIBCO\terr
) or install into the currently recommended location (ver
C:\TIBCO\terr
), wherever
ver
is the TERR version number.On Windows, the TERR installer provides both 32-bit and 64-bit versions of the TERR runtime code. When run on 64-bit Windows, the 64-bit version of TERR is automatically used. Since StreamBase now supports only 64-bit Windows, it uses the 64-bit version of TERR.
To connect StreamBase and its TERR operator to your local TERR installation, you must either:
-
Set the TERR Home property in the Engine Options tab of each TERR operator's Properties view, providing the full, absolute path to the TERR installation directory.
-
Set the TERR_HOME environment variable to point to the full, absolute path to the TERR installation directory. Use this method if you anticipate using many TERR operator instances in your StreamBase applications.
This operator allows a stream of tuples to be operated on by an external TERR process, with the results then passed on as another stream of tuples. The operator can work in two different modes, static and dynamic. These mode names come from the way the input script is treated during the operation.
Under both methods of operation, the operator can be configured to watch the script
and dataset files. If either one changes, it is reloaded before the next tuple is
processed. This is useful for script development as the script can be in an editor,
changes made and the file saved before sending a new tuple. If a directory rather
than a single file is specified for the datasets, the directory is watched for
additions and creations of any file that has the .RData
extension. If any such change is seen, all affected files are reloaded. Note that if
the file system is case sensitive (such as on OS/X and Linux), then the extension
must be exactly .RData
. Files with extensions
.rdata
, .Rdata
, and so on
are not loaded.
Note
The file system monitoring feature of the TERR operator is supported on local file systems only, and not on remote mounts. This feature is based on code internal to the TERR operator, and does not depend on the TIBCO StreamBase® File Monitor Adapter.
A single TERR instance can be shared among multiple operators within a container, or each operator can have its own instance. If sharing, take care to ensure that the same initial dataset is specified for each use. This is because the TERR instance is started and initialized by the first instance of the operator to run, and the other operators will use the already existing TERR instance. The various parameters used in starting the TERR instance should also be the same, for the same reason. Startup parameters are discussed later.
The TERR Operator includes the ability to specify whether the item to watch is a directory or file. A specified file to be loaded can have any file name. If it can loaded, it is; otherwise, an error is logged.
If you specify a directory, only files of with .RData
extensions (note capitalization) are monitored; other file types in the specified
directory that change, are added, or are deleted are ignored.
If a change is detected, the changed files are added to the changed list for all TERR instances that are available to the operator. Files can be added multiple times, but there can be at most one copy in the list at any time. When a TERR instance is acquired by an operator, the list is checked, the files loaded, and the list cleared. If a file is modified after it has been loaded, it loads again the next time the TERR instance is acquired.
Dynamic mode is the default mode of operation. In dynamic mode, input tuples are
merged with the supplied script via textual replacement. With each input tuple, the
values in the tuple are merged into the script by replacing a tag with the text
value of the corresponding tuple. This is done for each available tag in the
script, and the script is then sent to the TERR process for execution. When the
script finishes, the result is retrieved and translated into an output tuple. Tags
are of the form $[
where tagname
]tagname
is the name of the StreamBase field.
The script can be supplied as data in the operator itself, or it can be read from a
resource file. The script is a standard R script with one addition: fields in the
script that are to be replaced by tuple data must have the format $[name]
where name
is the name of an
input tuple field. The data in the tuple can be any supported simple data type or a
list of a supported simple data type. The supported StreamBase data types for input
are: string, int, boolean, double, list, and timestamp (which is converted to a
string), although any type can be used as long as it converts to a string. TERR
result types supported are: string, int, boolean, double, array (list), byte,
factor, and dataFrame.
Script substitution is done on a purely textual basis and it is up to R to parse
the results. For example, the integer 3 and the float 3 all appear the same when
inserted into a script. On input, lists are handled a little differently, with the
outer brackets removed. For example, the list of integers [1,2,3]
is converted to the string "1,2,3"
. Any substitution variables should be carefully looked at
as to context. In general, if a substitution can be either a scalar value or a
list, it should probably be placed in a c()
construct
such as c($[var])
, which handles both scalar and list
cases.
In addition, remember that the TERR parser will be parsing the new script, and when TERR sees a number, it defaults to treating it as a double. Because of this behavior, you must prepare your script with some defensive actions. If you do want an int to be returned, you must inform the parser; do this by using the integer() function. The same is true for any numeric type: unless you do want a double, coerce the returned value to the desired type.
Strings and TimeStamps are pasted with their surrounding quotes, "string", and so will be usable immediately as strings. Other non-numeric value, such as blob, functions, capture and tuple, will come as a string but will not have the surrounding quotes and may need them to be added before use.
Booleans are treated specially. On input, a true is converted to TRUE
, false is converted to FALSE
and
null
is converted to NA
;
on output, the opposite conversion is done. For numbers, null
is converted to NA
.
An example script along with some input tuples and results are shown next. First, a TERR script:
n <- $[names] s <- $[vals] b <- $[bools] result <- data.frame(n, s, b)
The input tuple (names="one", vals=2, bools=true)
results in:
n <- "one" s <- 2 b <- TRUE result <- data.frame(n, s, b)
Here, the type of s will be double for numeric StreamBase types, double or int. To make sure s is an integer, the script would be:
s <- integer($[vals]) s <- c(integer($[vals]))
The input tuple (names=["one","two"], vals=[2,4],
bools=[true,null])
results in a TERR parse error, because by default TERR
considers each comma as the start of a new field. The solution is to enclose the
field variables in c()
constructs in your TERR script,
like so:
n <- c($[names]) s <- c($[vals]) b <- c($[bools]) result <- data.frame(n, s, b)
In this case, with the same input tuple as above, the results are:
n <- c("one", "two") s <- c(2, 4) b <- c(TRUE, NA) result <- data.frame(n, s, b)
Static mode is similar to dynamic operation, the difference being in how tuples are sent to TERR. Where dynamic mode uses text substitution, in static mode, the script is not modified. Instead, the fields in the input tuple are converted directly to global TERR variables. The script is then run in that environment and the result variable retrieved and converted to the output tuple. This allows the script to be very short; a simple function call is sufficient as long as the function is defined in the initially loaded model. Having the values directly converted to TERR variables greatly increases both the speed of processing and the size of the input that can be processed for each tuple.
A change in the input schema is required to implement this. All the tuple entries
that are to be read into the TERR process must be in a top level tuple named
terrVars.
Each element in this tuple are converted to
a TERR variable. If it is a simple type (int, long, double, string, bool) or a list
of a simple type, it will be automatically converted. If it is a complex or
enhanced type (DataFrame, Factor, or a simple type with a names column), a tuple
using a special schema must be provided.
A list of ints can be sent using the tuple (1) or (list (1, 2, 3)) or the enhanced
form (tuple myInts (names = "one", "two", vals=[1,2], terrType="integer"). The
supported terrType values are integer, double, logical,
string, list, byte, dataFrame and factor.
Note that timestamp
is not available in this mode and would have to be
converted to strings in StreamBase before being sent to the operator. Use the Input
proposed schemas button to see the actual formats.
Once the variables have been sent to the TERR process, the script is executed and the result is retrieved.
This section describes the properties you can set for the TERR operator, using the various tabs of the Properties view in StreamBase Studio.
In the tables in this section, the Property column shows each property name as found in the one or more adapter properties tabs of the Properties view for this adapter.
Use the StreamSQL names of the adapter's properties when using this adapter in a StreamSQL program with the APPLY JAVA statement.
Name: Use this field to specify or change the component's name, which must be unique in the application. The name must contain only alphabetic characters, numbers, and underscores, and no hyphens or other special characters. The first character must be alphabetic or an underscore.
Operator: A read-only field that shows the formal name of the operator.
Class: A field that shows the fully qualified class name that implements the functionality of this operator. Use this class name when loading the operator in StreamSQL programs with the APPLY JAVA statement. You can right-click this field and select Copy from the context menu to place the full class name in the system clipboard.
Start with application: If this field is set to Yes or to a module parameter that evaluates to true, an instance of this operator starts as part of the containing StreamBase Server. If this field is set to No or to a module parameter that evaluates to false, the adapter is loaded with the server, but does not start until you send an sbadmin resume command, or until you start the component with StreamBase Manager. With this option set to No or false, the operator does not start even if the application as a whole is suspended and later resumed. The recommended setting is selected by default.
Enable Error Output Port: Select this check box to add an Error Port to this component. In the EventFlow canvas, the Error Port shows as a red output port, always the last port for the component. See Using Error Ports and Error Streams to learn about Error Ports.
Description: Optionally enter text to briefly describe the component's purpose and function. In the EventFlow canvas, you can see the description by pressing Ctrl while the component's tooltip is displayed.
Property | Data Type | Default | Description | StreamSQL Property |
---|---|---|---|---|
Log Level | drop-down list | INFO | Controls the level of verbosity the adapter uses to send notifications to the console. This setting can be higher than the containing application's log level. If set lower, the system log level will be used. Available values, in increasing order of verbosity, are: OFF, ERROR, WARN, INFO, DEBUG, TRACE, and ALL. | LogLevel |
Reload files when changed | check box | Cleared | If selected, the data file and the script file (if selected) are monitored for changes. If either one changes, it is loaded the next time a tuple is to be processed. | WatchFiles |
Output Status Tuples | check box | Cleared |
Select this check box to have a status tuple emitted on the status output stream for each input tuple.The status tuple includes any errors generated by the script for this tuple. |
SendStatusTuples |
Enable command port | check box | Cleared |
Enables an input stream that allows control of the TERR instance. The input
schema is a string item named command. The only currently accepted value is
reset , which causes the TERR instance to be
restarted.
|
EnableCommands |
Enable passthrough | check box | Cleared | When enabled, the input tuple is mirrored to the output unless a custom output schema is used. | EnablePassthrough |
Enable telemetry | check box | Cleared | When enabled, a tuple is emitted on the status output giving timings for processing the current tuple. | EnableTelemetry |
Processing mode | Radio button | Dynamic | Controls which mode is used to process the tuple. | ProcessingMode |
TERR instance to use | string | None |
A comma-delimited list of the TERR instances assigned to this operator instance. |
WhichTERRInstances |
Property | Data Type | Default | Description | StreamSQL Property |
---|---|---|---|---|
Load saved R datasets from file into engine | check box | Cleared | Determines whether an initial dataset is loaded into the TERR instance when started. If one is initially loaded and Reload files when changed is also selected, the dataset is reloaded if changed on disk. | LoadModel |
Load entire Directory | check box | Cleared |
Determines whether the initial dataset is a single file or all the
.RData files in the specified directory. Note
that if a directory is specified, only files with the extension
.RData (the capitalization on file system
that support is is important) are loaded or watched.
|
DataIsDir |
RData file | drop-down list | None | The name of a resource file to load on TERR initialization. The drop-down contains all the files that are resources to choose from on the current project's resource search path. Only available if Load Entire Directory is false. | ScriptModel |
Directory of RData files | drop-down list | None |
The directory containing the files to be loaded on TERR initialization. The
drop-down contains all the resource directories in the project. All files
with .RData extensions in this directory are
loaded into the TERR process on initialization.
|
ModelDirectory |
Property | Data Type | Default | Description | StreamSQL Property |
---|---|---|---|---|
Result variable | string | None | The name of the R variable that is to be retrieved as the result of the script. | ResultVarName |
Character Set | Drop-down list | UTF-8 | Specifies the character set to be used when reading a script file from disk. | TerrCharset |
Script Source | Radio button | Script Text |
The source from which to get the script to send to TERR.
|
ScriptSource |
Script file | Drop-down list | None | Active only when Script Source is File. In the drop-down list, select the resource file that contains the script. | ScriptLocation |
Script text | string | None | Active only when Script Source is Script text. Specifies content of the script to be used to process input tuples. | ScriptText |
These options are used to setup the TERR instance.
Property | Data Type | Default | Description | StreamSQL Property |
---|---|---|---|---|
TERR Home | string | None | Specifies the full path to the directory in which TERR is installed. If no value is specified, the operator uses the value of the TERR_HOME environment value, if present. If neither this field nor TERR_HOME is specified, the result it a typecheck error. | TerrHome |
Processor Affinity | string | None |
A zero-based, comma-separated list of integers representing processor cores
that the TERR process should execute on. Hyperthreaded cores count as
cores. An invalid core number for the current CPU causes a typecheck error.
For example, on a four-core (2C, 2T) machine, the entry |
ProcessorAffinity |
TERR engine parameters | string | None |
A parameter string sent to the TERR engine on initialization. See the TERR
documentation for usage.
Either type a value in the field, or select from a list of values you
have entered in the containing project's sbconf file as |
TerrEngineParameters |
TERR environment | grid | None | A set of Key-Value pairs that are used in the initial startup environment for the TERR engine. See the TERR documentation for usage. | TerrEnvironment |
These options are used to adjust the Java environment for the TERR instance.
Property | Data Type | Default | Description | StreamSQL Property |
---|---|---|---|---|
Java installation to use | Radio button | StreamBase |
Determines where to get the Java version to use to run TERR. In most cases
the default StreamBase is appropriate, but it
is not required that the TERR operator uses the same Java version as
StreamBase, because they run in different JVM instances.
|
UseSBJava |
Java home | string | Cleared | Active only when Java installation to use is set to Custom. Specifies the directory containing the Java installation to use to run the TERR instance. | JavaHOME |
Use the Schemas tab to specify the schema of the output tuple for this adapter.
For general instructions on using the Edit Schema tab, see the Properties: Edit Schema Tab section of the Defining Input Streams page.
The Import proposed schemas link can be used to import schemas for the various TERR output types.
A custom schema should use the same field names for fields as the generic schema does, because those field names are what are looked at when filling the output tuple.
Use the Concurrency tab to specify parallel regions for this instance of this component, or multiplicity options, or both. The Concurrency tab settings are described in Concurrency Options, and dispatch styles are described in Dispatch Styles.
Caution
Concurrency settings are not suitable for every application, and using these settings requires a thorough analysis of your application. For details, see Execution Order and Concurrency, which includes important guidelines for using the concurrency options.
The TERR operator has been enhanced to work with a multiplicity greater than 1 and parallelism enabled. If you assign multiple named instances of TERR to an single operator and set the component to run in a parallel region, then each instance of the operator will use its own TERR instance, if sufficient instances were assigned. If fewer TERR instances were assigned than there are parallel operator instances, the ones available are shared among the operator instances. If more are assigned, the excess instances are unused (but will have been created). It is therefor important to assign the correct number of TERR instances.
The TERR operator has two input ports: a data port, and an optional command port.
There are two possible schemas for the data port. The first is for when the operator is in dynamic mode, where the schema is determined by the TERR script loaded at operator start time. Any updated script loaded dynamically during run time must specify inputs that have at least the same field names and data types as the initially loaded script. There may be additional fields, which are ignored.
At run time, input tuples must have at least one field whose name and data type exactly match one of the fields specified by the TERR script. Input tuples do not need to fill all fields in the TERR script, and the field order of input tuples does not need to match the TERR script's field order.
The second schema for the the data port must have a tuple field named terrVars
which contains the variables to be sent to the TERR
process. Any other fields are ignored.
The TERR operator has two output ports: an optional status port, and a result port.
The status port emits tuples that describe the processing status for each input tuple. It is only present when the Output Status Tuples property is selected. The schema of the output tuple consists of four strings:
Field Name | Field Type | Description |
---|---|---|
Type | String |
The type of report, usually Status .
|
Action | String | The action that caused the report. |
Message | String |
The result reported by TERR of running the script. Examples: Success , Parse Error .
|
Object | String | Any extra information about the operator. |
The default schema of the result port consists of:
-
The input tuple, which is passed through unchanged if the Enable Passthrough property is set.
-
A field named
terrResult
, which a tuple of tuples, and contains the results of TERR processing the provided script for each input tuple.
Because the TERR operator cannot know the data type of the TERR result in advance,
the terrResult
field contains a subfield for each
possible TERR result data type. Only one terrResult
subfield is filled in
per input tuple, depending on the value assigned to the result variable by
the script. The other subfields are left empty (null) for each input tuple.
The top-level schema of the terrResult
field is shown
in the following table:
Subfield Name | Field Type | Description |
---|---|---|
double | tuple | The result was a double or array of doubles. |
integer | tuple | The result was an integer or array of integers. |
boolean | tuple | The result was a boolean or array of booleans, |
string | tuple | The result was a string or list of strings. |
dataFrame | tuple | The result was an R dataFrame, which is comparable to a StreamBase tuple. |
byte | tuple | The result was a byte or array of bytes (returned as StreamBase ints). |
list | tuple | The result was an R array, comparable to a StreamBase list. |
factor | tuple | The result was an R factor. |
Each terrResult
subfield is a tuple of lists. The
first third-level field of every terrResult
subfield
is named names
. This is a list of one or more names of
the result fields of the input script.
- The scalar subfields
-
The five scalar subfields of
terrResult
aredouble
,integer
,boolean
,string
, andbyte
. The schema of each of these subfields is the same: a list of returned script fieldnames
and a list of returnedvalues
corresponding to each of thenames
. If the result is a scalar, it is still returned as a list of one item. If the result is a multi-dimensional array, it is flattened to a vector.Third-level Field Name Type Description names list of strings List of one or more TERR script field names. values list of type
List of returned values for each script field in names
. - The dataFrame subfield
-
The dataFrame subfield of
terrResult
consists of a list ofnames
, plus zero or more lists ofintegers
,doubles
,logicals
(booleans),factors
.strings
, orbytes
. If more than one of a type occurs, the resulting list contains the concatenation of the lists.Third-level Field Name Type Description names list of strings List of one or more TERR script field names. integers list of tuples List of zero or more returned integer name-value pairs. doubles list of tuples List of zero or more returned double name-value pairs. logicals list of tuples List of zero or more returned boolean name-value pairs. factors list of tuples List of zero or more returned R factor name-index-level triplets. strings list of tuples List of zero or more returned string name-value pairs. bytes list of tuples List of zero or more returned byte name-value pairs. - The list subfield
-
The
list
subfield ofterrResult
consists of a list ofnames
, plus zero or more lists of tuples of the five scalar types plus factors. The schema of thelist
subfield is the same as for thedataFrame
subfield. - The factor subfield
-
The
factor
subfield ofterrResult
consists of three lists:names
,indexes
, andlevels
. See the TERR documentation for an explanation of the R factor data type.Third-level Field Name Type Description names list of strings List of one or more TERR script field names. indexes list of integers List of zero or more returned index values. levels list of strings List of zero or more returned level values.
You can specify a custom schema that contains only the result data you know to expect. For a custom schema, the names of the fields must be the same as in the default schema described above, and the sub-schemas must also match exactly.
Typechecking fails if any required fields are not filled in. It also fails if the input schema does not contain all the replacement variables that the script needs. All specified dataset and script files and directories are checked for existence and typechecking fails if any file or directory is not accessible. A result variable must be present, although the TERR script is not checked to see if it uses it. A script must be specified either as local data or as a resource file. The TerrHome parameter must be set so that the process can be started.
All errors in the execution of the script are logged and an optional status tuple is emitted.
The TERR operator uses ConfigurationChooserPropertyDescriptors for some of its properties. This means that it can read default values for these properties from the containing application's StreamBase configuration file.
To use this feature, an <adapter-configurations>
section must be present in the configuration file, with at least one child element of
the form <configuration type="terr
, where terrstring
">string
is one of terrConfigHome
, terrConfigEngineParams
,
terrConfigJavaHome
, terrConfigJavaOptions
, or terrConfigInstance
. In each of these configurations, list the
choices to be presented in the form <choice
id="valueX">valueX</choice>
. Alternatively, indirection can be used.
See the Javadoc documentation in the StreamBase Client API on
ConfigurationChooserPropertyDescriptors for further information.
See the sbd.sbconf
file in the TERR Operator sample for
an example of this feature.
On suspend, the TERR operator finishes processing the current tuple, outputs the result tuple, then pauses waiting for input.
On resumption, the TERR operator continues processing with the next input tuple.
The TERR instance remains running during suspend.
The StreamBase installation includes a sample demonstrating the use of this operator. To load the sample in StreamBase, select Extending StreamBase section for an entry called TERR Operator.
→ and look under the