Structure and user interface of Statistica Data Miner
Statistica Data Miner is based on libraries of more than 250 different nodes that contain the complete functionality of Statistica, as well as specialized methods and functions for data mining.
Input data
First, variables are selected from a standard variable selection dialog box. You can specify continuous and categorical dependent and predictor variables, codes, case selection conditions, case weights, etc. Thus, the description of the input, or what is referred to as the Data Input Descriptor in the subsequent text, is serviced by a common dialog box.
A note on variable names
In Statistica, ( in macros, spreadsheet formulas, you can refer to variables by their names or numbers ( v1, v2, v3, ...); v0 is the case number. For some design-syntax based modules [ GLM], the Vx convention is ambiguous as it is used generically to reference variables by number. Hence variable names such as V1, V2, etc., are not suitable when those modules will be referenced in the Data Miner analysis). Additionally, repeated variable names (within one spreadsheet) is not recommended for syntax-based modules, either.
The Node Browser: selecting analyses
Next, select from the library of available scripts the desired type of analysis. Data Miner uses a flexible node browser for this purpose that is fully customizable for particular projects or jobs.
The Parameters dialog box
Statistica Data Miner also contains dialog boxes for communicating with the analysis scripts and property files, for example, in order to modify the parameters of the analysis. Those dialog boxes use the information in the .dmi files regarding defaults, types, enumerated constants, and user access ( full access, read-only access, or hidden no access).
Nodes. Nodes are the individual icons that connect the input (data) to the output (results). Data flows through the nodes, where they are transformed, analyzed, etc. In Statistica Data Miner, the nodes are classified according to how they function: Data Acquisition nodes (specification of input data), Data preparation, cleaning, and transformation nodes, and Data analysis, modeling, classification, and forecasting nodes.
The general purpose of Statistica Data Miner is to connect to one or more data sources, select variables or cases (observations) of interest, apply various data verification and cleaning methods, and then to perform analyses on those data sources and produce final results tables or graphs that extract essential information from the input data. All this can be accomplished in a very efficient and convenient user interface that provides the means to quickly change analytic options, switch input data, or move from a project to estimate the best parameters ( rules for classification of some observed data) to one that deploys those estimates in order to, for example, classify new data.
The general architecture of Statistica Data Miner
- Data acquisition
- Each analysis starts with the definition of the input data. Click the New Data Source button to select the data for the analyses. You can specify Statistica input data spreadsheets or data sources representing connections to databases on remote servers (Streaming Database Connector). To specify connections to external databases, in the Create New Document dialog box, select the Streaming DB Connector tab, and click OK to display the StreamingDB spreadsheet/interface, where you can specify queries to the data.
The data InputDescriptor object and external databases
What is placed into the Data Acquisition box is actually a descriptor of the input data, and not necessarily the data spreadsheet itself. This distinction is very important, as it holds one of the keys to the power and versatility of the Statistica Data Miner system. Any input data source that can be mapped into the data InputDescriptor object can be used in Data Miner. The data InputDescriptor object is further described in How to Write .svx Scripts for Data Miner, for example, in the context of analysis nodes.
Dirty nodes: specifying variables, case selections, etc
In addition to the actual data, the InputDescriptor contains information about the types and nature of the variables that will be used in subsequent analyses. Initially, when you first specify an input data file, those variables are not known yet. Hence, and this is a convention used throughout Statistica Data Miner, the input to the analysis is not fully specified, and the respective icon in the project workspace is marked as dirty by showing a red box around it.
The icon shown to the left is a dirty input data icon; no variables have been specified for the analyses yet. When you double-click on the dirty input data icon, a variable selection dialog box is displayed in which you can specify various lists of variables and codes for the analyses. Once the variable selection is complete, the red box around the icon will be removed (see the icon to the right in the illustration), and you have a clean icon that is updated and ready to be connected to subsequent analysis nodes.
Data preparation, cleaning, transformation
Data cleaning is an often neglected but extremely important step in the data mining process. The old adage garbage in, garbage out is particularly applicable to typical data mining projects where large data sets collected via some automatic methods ( via the web) serve as the input into the analyses. Often, the method by which the data were gathered was not tightly controlled, and so the data may contain out-of-range values ( Income: -100), impossible data combinations ( Gender: Male, Pregnant: Yes), etc. Analyzing data that has not been carefully screened for such problems can produce highly misleading results. You can access numerous nodes on the Data tab or in the Data folder in the Node Browser for variable transformations, filtering, recoding, subsets, sampling, etc. The Data Health Check Summary node is also a useful tool to examine the data.
Statistica Feature Selection and Variable Screening
This tool is indispensable for very large input data sets (containing hundreds of thousands of potential predictor variable). The interactive Statistica Feature Selection and Variable Screening facility is a unique tool for mining huge terabyte-sized databases typically connected to Statistica via the streaming database connector so that the data do not have to be copied onto your local machine, and so that all queries of the large database to retrieve individual records can be performed on the server, using the database-specific optimized tools. The Feature Selection/Screening module quickly searches through thousands or hundreds of thousands of predictors for regression or classification problems to find those that are likely best suited for this task. The algorithms implemented in the program are general and they do not assume any particular type or nature of relationships (e.g., linear, quadratic, monotone, non-monotone, etc.). The Statistica Feature Selection and Variable Screening module is a unique, extremely powerful tool for mining huge databases.
Data analysis, modeling, classification, forecasting
This constitutes the meat of the analysis: Input data from any source, after appropriate data cleaning and transformations have been applied, are used as the input into subsequent analysis nodes, which extract the nuggets of information contained in the data. All Statistica analytic procedures can be used for this purpose, from simple descriptive statistics, tabulation, or graphical analyses, to complex neural network algorithms, general linear, generalized linear, generalized additive models, etc. Even survival analysis techniques for censored observations can be incorporated, as can quality control charting procedures for monitoring ongoing active data streams.
Data analysis nodes can be selected from among the large library of analytic routines contained in the Data Miner directories of your Statistica installation. They connect to a complete data InputDescriptor, and produce either results nuggets, or data InputDescriptors that can serve as the source for subsequent analyses. For example, you can generate predicted and residual values via multiple regression, and those predicted values can then be connected to subsequent data cleaning or analytic nodes.
Analysis nodes with automatic deployment
- Specialized analytic nodes are available that will automatically generate information for deployment. After these nodes have estimated a model, they make available to all other nodes in the current Data Miner project the information necessary to produce predicted values for new data. Nodes are available that will combine the deployment information to compute, for example, a single predicted classification via voting (bagging, averaging), etc.
- Nodes are available that will combine the deployment information to compute, for example, a single predicted classification via voting (bagging, averaging), etc.
How deployment information is stored
The deployment information, for the nodes located in the Deployment folder of the Node Browser, is stored in various forms locally along with each node, as well as globally, visible to other nodes in the same project. This is an important point to remember, because for Classification and Regression, the Node Browser contains a Compute Prediction from All Models node. This node computes predictions based on all deployment information currently available in the global dictionary, which can be reviewed via the Edit Global Dictionary Parameters dialog box). Therefore, when building models for deployment using these options, ensure that all deployment information is up to date,(based on models trained on the most current set of data). See Examples 3 and 4 for illustrations on how to deploy projects.
Predicting new observations, when observed values are not (yet) available
One of the main purposes of predictive data mining (see Concepts in Data Mining) is to allow for accurate prediction (predicted classification) of new observations, for which observed values or classifications are not (yet) available. An example of such an application is presented in Example 3 (see also Example 4). When connecting data for deployment (prediction or predicted classification), ensure that the structure of the input file for deployment is the same as that used for building the models (see also option Data for deployed project; do not re-estimate models in the Select dependent variables and predictors dialog box). Specifically, ensure that the same numbers and types of predictor variables are specified, that a (continuous or categorical) dependent variable is specified (even if all values for that variable are missing), and that the variable names match those in the data file used to build the models (this is particularly important for the deployment of neural networks, which will rely on this information).
Using text variables or text values in data miner projects
When using text variables or variables with text values in data miner projects with deployment, or projects that compare predicted classifications from different nodes via Goodness of Fit nodes, you should be careful to review the results to ensure that the coding of categorical variables (with text values) is consistent across nodes. Generally, the program will automatically ensure that identical coding is used across all nodes in the same data miner project. However, when using numeric variables with text labels as input data sources marked for deployment (see the Select dependent variables and predictors topic), or in some special data analysis scenarios, it is important that you understand how such values are handled in Statistica Data Miner. For additional details, refer to Working with Text Variables and Text Values: Ensuring Consistent Coding (see also, How to Write .svx Scripts for Data Miner).
Numeric variables with text values
With Statistica data spreadsheets, you can specify numeric variables (of type integer, double, etc.), and attach to specific values certain text labels. All analyses will be performed based on the numeric representations, and not on the text representations (which are only used to label results). Therefore, when using numeric variables with text labels as categorical predictors (or dependent variables) in input data sources marked for deployment, you must use the same coding (number-label associations) as was used when the analysis nodes (modules) generated the respective deployment information ( as was used in the training data). For example, suppose a training data set contained a categorical predictor variable Gender, with Male coded as 1 and Female is coded as 2, and you computed a linear model based on that coding. When applying the linear model to a new data set to compute predicted values, the same coding must be used. Otherwise, misleading results may be computed.
Text variables
Text variables (containing text values only) may exist in Statistica data spreadsheets, and they commonly occur in streaming database connectors. Any module (or spreadsheet function) executing inside a data miner workspace will use a (generated by the program) coding scheme that will be consistently applied to all nodes in the same project. Therefore, when using variables of type text as categorical predictors (or dependent variable) in data mining projects that generate deployment information, any input data sources from which you may want to compute predicted values (deployment) also have to use text variables in the respective places of the categorical predictor list. For example, if you computed a linear model based on a list of predictors that included a text variable Gender (Male, Female), then, when applying this model (during deployment), the program also expects a variable of type text, with the text values Male or Female.
To summarize these rules, numeric variables with text labels are always treated as numeric variables consistently by all Data Miner nodes and Statistica modules. When using variables of type text, the coding of individual text values (as levels of a categorical predictor variable) are consistent for all nodes inside a particular Data Miner project; note however, that this coding might be different when you perform interactive analyses using any of the Statistica modules. Again, for additional detailed information about these issues, refer to Working with Text Variables and Text Values: Ensuring Consistent Coding (see also, How to Write .svx Scripts for Data Miner).
Reports
Finally, the Data Miner project reveals the nuggets of information that heretofore lay undetected in the data. Reports are produced by practically all analytic nodes of Statistica Data Miner: Parameter estimates, classification statistics, descriptive statistics and graphics, graphical summaries, etc. These results are placed into the Reporting Documents folder in the workspace. You can use the options available in the Options dialog box - Data Miner tab to customize your program.
See Statistica Data Miner Summary, Data Mining with Statistica Data Miner, and Getting Started with Statistica Data Miner. See also, Using Statistica Data Miner with Extremely Large Data Sets, How to Write .svx Scripts for Data Miner, and Global Dictionary.