The TIBCO Streaming® Classification Trees Operator is used to build classification tree models.
This section describes the properties you can set for this adapter, using the various tabs of the Properties view in StreamBase Studio.
Name: Use this required field to specify or change the name of this instance of this component, which must be unique in the current EventFlow module. The name must contain only alphabetic characters, numbers, and underscores, and no hyphens or other special characters. The first character must be alphabetic or an underscore.
Operator: A read-only field that shows the formal name of the operator.
Class name: Shows the fully qualified class name that implements the functionality of this adapter. If you need to reference this class name elsewhere in your application, you can right-click this field and select Copy from the context menu to place the full class name in the system clipboard.
Start options: This field provides a link to the Cluster Aware tab, where you configure the conditions under which this adapter starts.
Enable Error Output Port: Select this check box to add an Error Port to this component. In the EventFlow canvas, the Error Port shows as a red output port, always the last port for the component. See Using Error Ports to learn about Error Ports.
Description: Optionally enter text to briefly describe the component's purpose and function. In the EventFlow Editor canvas, you can see the description by pressing Ctrl while the component's tooltip is displayed.
Property | Description |
---|---|
Log Level | Controls the level of verbosity the adapter uses to send notifications to the console. This setting can be higher than the containing application's log level. If set lower, the system log level is used. Available values, in increasing order of verbosity, are: OFF, ERROR, WARN, INFO, DEBUG, TRACE. |
Max depth | Specify the maximum depth of the tree. |
Maximum number of leaf nodes | Specify the maximum number of leaf nodes allowed for the tree. |
Minimum size of leaf nodes | Specify the minimum size of leaf nodes. If a split results in leaf nodes with sizes smaller than this value, then the split will not be considered. |
Prune tree | If enabled, the tree will be pruned. This can be useful to avoid overfitting. |
Split rule | Specify one of the following as the split rule for the decision trees which will be used to evaluate and select the best split variables for each node: Gini, Entropy, and Classification Error. |
Property | Description |
---|---|
Use last k rows for testing, k=: | Specify the last 'k' rows at the end of the incoming data to use as a hold-out or test sample, that is, the last 'k' rows will not be used in estimation of the model's parameters. Predictions and model summary measures for the test data are available. |
Property | Description |
---|---|
How to handle unmatched categories: | If Set predictions to missing data is specified, then cases with new categorical levels that were not observed in the training data are ignored. If Stop scoring and display error is selected, then if a new categorical level is observed in the scoring data that was not observed in the training data,
then no predictions will be created and error will be displayed.
|
Property | Description |
---|---|
Predictors | Specify the list of predictor variables. Regular expression matching is
supported. Integral and string variables are assumed to be categorical variables and
will be coded according to the option specified on the Operator
properties tab.
|
Response | Specify the single categorical response variable. |
The Code Select tab
allows you to run a statistical operator on a
subset of the incoming data where the selected cases take on specific values of one or more
categorical variables.
Column | Description |
---|---|
Field | The name of the categorical variable |
Codes | Specify the codes to use in the analysis. |
Use the settings in this tab to allow this operator or adapter to start and stop based on conditions that occur at runtime in a cluster with more than one node. During initial development of the fragment that contains this operator or adapter, and for maximum compatibility with TIBCO Streaming releases before 10.5.0, leave the Cluster start policy control in its default setting, Start with module.
Cluster awareness is an advanced topic that requires an understanding of StreamBase Runtime architecture features, including clusters, quorums, availability zones, and partitions. See Cluster Awareness Tab Settings on the Using Cluster Awareness page for instructions on configuring this tab.
Use the Concurrency tab to specify parallel regions for this instance of this component, or multiplicity options, or both. The Concurrency tab settings are described in Concurrency Options, and dispatch styles are described in Dispatch Styles.
Caution
Concurrency settings are not suitable for every application, and using these settings requires a thorough analysis of your application. For details, see Execution Order and Concurrency, which includes important guidelines for using the concurrency options.
The operator expects that the response variable to be analyzed is of type 'int', 'long', or 'string', continuous predictors are of type 'double', and categorical predictors are of type 'int', 'long', or 'string'. The two input ports are described below.
Port | Description |
---|---|
Training/Testing | Input data associated with the training and testing data. Incoming training data will be used to estimate model parameters, whereas incoming testing data will be used as a hold-out to evaluate the performance of the trained model. |
Scoring | Incoming scoring data will be scored by the trained model. |
The three output ports are described below.
Port | Description |
---|---|
Model summary |
The output tuple will consist of the incoming data passed through along with a list of the following analytic results:
|
Train/Test predictions | Predictions and predicted probabilities for both training and testing data. |
Scoring predictions | Predictions and predicted probabilities for the scoring data. |