Data Mining Very Large Data Sets (Databases): Scalability of Statistica Data Miner

An important issue in data mining is how the various techniques for exploratory (EDA), visual, and particularly predictive data mining (see Concepts in Data Mining) perform when applied to extremely large data sets. It is not uncommon in many domains of application to deal with data sets in the multiple gigabyte range, with tens of millions of observations. Analyzing data sets of this size will require some planning to avoid unnecessary performance bottlenecks, and inappropriate analytic choices. For example, using advanced neural network techniques to analyze all of 20 million observations is simply inappropriate, because a) in some cases it could take as long as several days to complete even using a designated supercomputer-class mainframe, and b) the same information can be quickly extracted from the data by applying an appropriate sub-sampling  method first, and then analyzing a reasonable subset of the input data.

Statistica Data Miner uses a number of technologies specifically developed to optimize processing large data sets, and it is designed to handle even largest scale computational problems based on very large databases. However, in order to take best advantage of the computational power of Statistica Data Miner, a number of issues still need to be considered when planning data mining projects for very large data sets. The following paragraphs discuss various strategies and (unique) tools available in Statistica Data Miner that you can use to analyze and build models for huge source data.

Connecting to Data

Most likely, data sets that are very large (in the gigabyte range) will reside on a (remote) server, and it is not practical or even desirable to copy those data onto a designated computer for data mining. As an alternative, you can use the facilities for Streaming Database Connector to connect to the data. These unique tools enable you to select the variables (fields) of interest from the database, and many subsequent analyses (including many graphs) will be able to process those data in a single pass through all observations in the database without the need to create a local (on your computer or server) copy of the data. See Select a New Data Source and Streaming Database Connector for additional details.

Random Sub-Sampling

We cannot stress enough the importance and utility of random sub-sampling. For example, by properly sampling only 100 observations (from millions of observations) you can compute a very reliable estimate of the mean. One of the rules of statistical sampling that is often not intuitively understood by untrained observers is the fact that the reliability and validity of results depend, among many other things, on the size of a random sample, and not on the size of the population from which it is taken. In other words, the mean estimated from 100 randomly sampled observations is as accurate (i.e., falls within the same confidence limits) regardless of whether the sample was taken from 1000 cases or 100 billion cases. Put another way, given a certain (reasonable) degree of accuracy required, there is absolutely no need to process and include all observations in the final computations (for estimating the mean, fitting models, etc.).

Statistica Data Miner contains nodes in the Data Cleaning and Filtering folder to draw a random sample from the original input data (database connection). Note that Statistica employs a very high-quality (validated  using the DIEHARD suite of tests) random number generator algorithm that ensures that the selection of observations will not be biased.

Statistica Power Analysis

For detailed planning of sub-sampling in predictive data mining, you can also use the Statistica Power Analysis facilities, which can provide very valuable information regarding the relationship between sample sizes, effect sizes (that would be of interest to you), and statistical power (to detect effects) given different statistical techniques. Power analysis methods have been popular in applied and survey research for a number of years, but have not yet been popularized in the area of data mining, even though these methods can be potentially extremely useful here, in particular in the context of extremely large data sets.

Algorithms for Incremental (vs. Non-Incremental) Learning

Statistica Data Miner contains a large selection of (learning) algorithms for regression and classification problems. These algorithms can be divided into those that require one or perhaps two complete passes through the input data, and those that require iterative multiple access to the data to complete the estimation. The former type of algorithms are also sometimes referred to as incremental learning algorithms, because they will complete the computations necessary to fit the respective models by processing one case at a time, each time refining the solution; then, when all cases have been processed, only few additional computations are necessary to produce the final results. Non-incremental learning algorithms are those that need to process all observations in each iteration of an iterative procedure for refining a final solution. Obviously, incremental learning algorithms are usually much faster than non-incremental algorithms, and for extremely large data sets, non-incremental algorithms may not be applicable at all (without first sub-sampling).

However, as explained in the previous paragraphs on Random Sub-Sampling, in most, if not all cases, it is not useful anyway to process every single observations in a very large database; it simply is a waste of data processing resources and time, and by carefully planning and sub-sampling, the same information can be extracted in a much shorter time, or much more information can be extracted in the same amount of time that it would require to include all observations in the analyses.

Incremental algorithms

Statistica Data Miner includes several extremely powerful, efficient, and fast algorithms for regression and classification that will analyze the data in a single pass through all (or a sub-sample of) observations: For example, Statistica GDA is an extension of the general linear model to classification problems (General Discriminant Function Analysis models). This method (unique to Statistica Data Miner) is the fastest algorithm for classification available and does not require that the source data be copied to a local computer or server (and huge databases can be processed in-place). This method yields outstanding accuracy for predictive classification in most cases; various options are available for this technique to request best subset or stepwise selections of predictor effects. The implementation of stepwise and best-subset selection of predictors for regression problems using the general linear model (GRM, GLM) is equally unique. It also is an incremental learning algorithm that will perform stepwise and best-subset selection of predictor effects, etc. ( categorical/class variables will be moved in/out of models as multiple-degree-of-freedom effects).

Feature selection and variable screening

The Statistica Feature Selection and Variable Screening methods require two complete passes through the data. However, after completing those, the program can select likely predictors (for classification or regression problems) from among possibly 100's of thousands or millions of continuous or categorical candidate predictors. The screening is performed by applying a grid to each predictor, and then computing statistical indicators of relationship (e.g., Chi-square) that are not assuming any particular functional relationship between predictors and the outcome variable of interest. In other words, this method will not bias the selection in favor or disfavor of any subsequent analytic techniques that may be applied. The method is based on the algorithm also implemented (tested, and proven) in CHAID, and as applied in the Feature Selection and Variable Screening module and node, it can be considered a very efficient incremental learning algorithm as well (hence, huge data sets do not pose a problem). Once an initial selection of predictors from a (huge) list of candidates has been determined, various options are available for refining this selection further, for example, by applying Multivariate Adaptive Regression Splines (MARSplines), C&RT methods, etc.

Non-incremental algorithms

Statistica Data Miner includes a large number of very advanced algorithms for fitting complex, highly nonlinear, models, such as neural networks, tree building methods ( C&RT, CHAID), etc. (see Data Mining Tools). These methods require multiple passes through the data, and some reasonable sub-sampling is usually necessary to apply these techniques to huge data sets. To reiterate (see Random sub-sampling), applying, for example, neural networks methods to data sets with millions of observations is, obviously, not only impractical, but also not useful even if the computing resources to do so were available. The additional information that could be gained by processing 10 million observations, as compared to 1,000 randomly sampled observations, is marginal at best, and typically not useful. However, such computations may tie up your computer resources for days, or even weeks, without generating useful information.

Choosing between algorithms

There is another issue that you should consider: Do the more complex non-incremental methods for learning actually produce better (more accurate, interpretable) predictions? For example, if the relationships in your data are usually of the kind "the more of x, the more of y", then there is little reason to use anything else but linear models. Of course, it could be argued that strong curve-linearity might be present, but in our experience, 1) such higher-order polynomial effects are usually negligible, and 2) they can very well be modeled as quadratic effects using linear (incremental) methods. Advanced, highly nonlinear methods (such as Neural Networks or Multivariate Adaptive Regression Splines) usually only do better in situations where strong non-monotone relationships exist ("the more of x, the more of y, but only up to a point, then the relationship is reversed..."), perhaps with many breakpoints, etc. Given the domain-specific knowledge that you may have acquired through your specific experience, how many of such highly non-linear, non-monotone relationships would you expect to find in your data? In our experience, it is generally the exception when highly complex, non-linear, non-monotone relationships between predictors and outcome variables emerge; exceptions to this general rule, however, do exist in chemical engineering, electronic manufacturing, etc., and in those domains, random sampling followed by sophisticated learning algorithms are routinely used in predictive quality/process control.

To summarize, Statistica Data Miner includes very fast, efficient, unique (but proven) incremental learning algorithms that you can use to process your entire database (in-place) to identify likely predictors, and detect important (interpretable) relationships between variables. For complex, difficult, estimation and prediction situations, Statistica Data Miner includes the most comprehensive selection of techniques and supplementary utilities (e.g., random sub-sampling), to build models of various types and complexities. In general, you should always consider random sub-sampling regardless of what type of methods you are using, because of the significant savings in processing resources which can be realized without sacrificing much, if any, useful information.

See also, Data Mining Definition, Data Mining with Statistica Data Miner, Structure and User Interface of Statistica Data Miner, Statistica Data Miner Summary, and Getting Started with Statistica Data Miner.