Workspace
The Statistica Workspace contains the symbolic representation of the flow of the input data through the data filtering or cleaning and analytic nodes to the results nodes.
The Node Browser contains many nodes that can be connected to one or more data sources and moved (dragged-and-dropped) from one workspace to another. These nodes can be viewed as individual analysis objects that encapsulate the complete functionality of the Statistica analysis and data management routines.
Data sources can be connected to data cleaning and filtering nodes as well as analysis nodes for subsequent transformations, subsampling, or analyses.
Although the provided workspace nodes already offer many features, you can also enhance them. This enables you to configure analysis, graph, and data cleansing nodes to better suit your needs.
Let us begin by opening a new workspace. Click the Node Browser button to open the Node Browser. Ensure that the current selection in the box at the top is All Procedures. Select the User-Defined Subset node in the Data Cleaning and Filtering folder and click Insert into workspace.
Begin customizing the node by renaming it. Right-click the User-Defined Subset node and select Rename. Name the node Recode MD and click OK. The node will be now be renamed. Right-click on the Recode MD node and select Edit Code. A macro window will open with the default code for the User-Defined Subset node:
'This function accepts as input an 'InputDescriptor (DataIn), 'applies a filter/subset operation, and returns the 'filtered data in DataOut Private Sub SubsetNode( _ DataIn() As InputDescriptor, _ DataOut() As InputDescriptor) ReDim DataOut(LBound(DataIn()) To UBound(DataIn())) As InputDescriptor For i=LBound(DataIn()) To UBound(DataIn()) Set DataOut(i)=DataIn(i).Clone() Next i End Sub
Because this node is a subset node, it appears in the Data Preparation, Cleaning and Transformation pane of the workspace. When the project is ran, the SubsetNode() function is automatically called, instead of the Main() function. The SubsetNode() function takes as its arguments DataIn() and DataOut().
DataIn() represents the collection of documents which are connected to this node in the workspace. For example, if three spreadsheets were inserted into the workspace and connected to this node, then when the node is executed, DataIn() will be an array of three InputDescriptor objects.
An InputDescriptor is an object which represents the dataset in a workspace. The InputDescriptor object contains a Datasource property that returns the actual spreadsheet document that the node represents. The InputDescriptor object also contains configuration information for the node as well. This information includes variable selections, code selections, case selection conditions, case weights, etc.
The DataOut() parameter of the SubsetNode() function is the output from the node. Depending on the purpose of the node, the number of items in the DataOut() collection may match the number of items in the DataIn() collection. For example, if you are merging spreadsheets, you might have three inputs to a node and only one output. In our example, we are going to perform a missing data replacement operation on each input spreadsheet. For each member in the DataIn() collection, we will have a respective item in the DataOut() collection.
As written, the code for this node matches our needs. It makes a clone of each InputDescriptor. When the InputDescriptor.Clone() function is called, a duplicate copy of the InputDescriptor is created, including the data source, variable selections, etc. We will need to add code after the clone process to modify the data source of the cloned InputDescriptor.
We will be using the ReplaceMD() function of the spreadsheet object to replace the missing data within the selected variables with the mean of the variable(s). We will make the assumption that the variables we will perform this operation on are selected as continuous dependent variables.
Modify the macro by adding the “Dim spr As Spreadsheet” line (as indicated below). Then, start typing the “spr.ReplaceMD(” line.
Private Sub SubsetNode( _ DataIn() As InputDescriptor, _ DataOut() As InputDescriptor) ReDim DataOut(LBound(DataIn()) To UBound(DataIn())) _ As InputDescriptor Dim spr As Spreadsheet For i=LBound(DataIn()) To UBound(DataIn()) Set DataOut(i)=DataIn(i).Clone() Set spr = DataOut(i).DataSource spr.ReplaceMD( Next i End Sub
Once you type the left parenthesis, the macro editor will display tooltip text that shows the expected parameters for the ReplaceMD() function. Notice that the parameters include the variable list, inclusion statement, includes list, exclusion statement, exclude list, and weight. Of all these parameters, only the variable list is a required parameter (as indicated by the brackets around the remaining parameters). For this example, we will use all parameters except the weight. Leave the “spr.ReplaceMD(” line of code incomplete and add the declarations for the new variables to store the needed values:
Private Sub SubsetNode( _ DataIn() As InputDescriptor, _ DataOut() As InputDescriptor)ReDim DataOut(LBound(DataIn()) To UBound(DataIn())) _ As InputDescriptor Dim spr As Spreadsheet Dim VariableString As String Dim SCIncludeExpression As String Dim SCIncludeList As String Dim SCExcludeExpression As String Dim SCExcludeList As String For i=LBound(DataIn()) To UBound(DataIn()) Set DataOut(i)=DataIn(i).Clone() Set spr = DataOut(i).DataSource spr.ReplaceMD(Next i End Sub
For each InputDescriptor, we will need to retrieve the values for the newly added variables before calling the ReplaceMD() function:
Private Sub SubsetNode( _ DataIn() As InputDescriptor, _ DataOut() As InputDescriptor) ReDim DataOut(LBound(DataIn()) To UBound(DataIn())) _ As InputDescriptor Dim spr As Spreadsheet Dim VariableString As String Dim SCIncludeExpression As String Dim SCIncludeList As String Dim SCExcludeExpression As String Dim SCExcludeList As String For i=LBound(DataIn()) To UBound(DataIn()) Set DataOut(i)=DataIn(i).Clone() Set spr = _ DataOut(i).DataSource VariableString = _ DataOut(i).ContinuousDependentVariablesString SCIncludeExpression = _ DataOut(i).SelectionCondition.IncludeExpression SCIncludeList = _ DataOut(i).SelectionCondition.IncludeList SCExcludeExpression = _ DataOut(i).SelectionCondition.ExcludeExpression SCExcludeList = _ DataOut(i).SelectionCondition.ExcludeList spr.ReplaceMD( Next i End Sub
The last change will be to pass these parameters into the ReplaceMD() function and reset them after each iteration:
Private Sub SubsetNode( _ DataIn() As InputDescriptor, _ DataOut() As InputDescriptor) ReDim DataOut(LBound(DataIn()) To UBound(DataIn())) As InputDescriptor Dim spr As Spreadsheet Dim VariableString As String Dim SCIncludeExpression As String Dim SCIncludeList As String Dim SCExcludeExpression As String Dim SCExcludeList As String For i=LBound(DataIn()) To UBound(DataIn()) Set DataOut(i)=DataIn(i).Clone() Set spr = DataOut(i).DataSource VariableString = _ DataOut(i).ContinuousDependentVariablesString SCIncludeExpression = _ DataOut(i).SelectionCondition.IncludeExpression SCIncludeList = _ DataOut(i).SelectionCondition.IncludeList SCExcludeExpression = _ DataOut(i).SelectionCondition.ExcludeExpression SCExcludeList = _ DataOut(i).SelectionCondition.ExcludeList spr.ReplaceMD (VariableString,SCIncludeExpression, _ SCIncludeList,SCExcludeExpression,SCExcludeList) VariableString = "" SCIncludeExpression = "" SCIncludeList = "" SCExcludeExpression = "" SCExcludeList = "" Next i End Sub
As written above, the selection conditions will always be applied in the ReplaceMD() function call. To change this, apply the assignment only if the selection conditions are enabled in the InputDescriptor:
Private Sub SubsetNode( _ DataIn() As InputDescriptor, _ DataOut() As InputDescriptor) ReDim DataOut(LBound(DataIn()) To UBound(DataIn())) _ As InputDescriptor Dim spr As Spreadsheet Dim VariableString As String Dim SCIncludeExpression As String Dim SCIncludeList As String Dim SCExcludeExpression As String Dim SCExcludeList As String For i=LBound(DataIn()) To UBound(DataIn()) Set DataOut(i)=DataIn(i).Clone() Set spr = DataOut(i).DataSource VariableString = _ DataOut(i).ContinuousDependentVariablesString If DataOut(i).SelectionCondition.Enabled = True Then SCIncludeExpression = _ DataOut(i).SelectionCondition.IncludeExpression SCIncludeList = _ DataOut(i).SelectionCondition.IncludeList SCExcludeExpression = _ DataOut(i).SelectionCondition.ExcludeExpression SCExcludeList = _ DataOut(i).SelectionCondition.ExcludeList End If spr.ReplaceMD(VariableString,SCIncludeExpression, _ SCIncludeList,SCExcludeExpression,SCExcludeList) VariableString = "" SCIncludeExpression = "" SCIncludeList = "" SCExcludeExpression = "" SCExcludeList = "" Next i End Sub
The code is now complete. To test the node, insert a dataset with missing data into the workspace, connect it to the node, select the variable containing missing data as a continuous, dependent variable, and then run the project. With the variables selected as continuous dependents, their missing data will be replaced with their respective means. The node also accommodates multiple InputDescriptors and selection conditions.