Data Mining Overview

Data mining is an analytic process designed to explore large amounts of data (typically business or market related) in search of consistent patterns and systematic relationships between variables, and then to validate the findings by applying the detected patterns to new subsets of data. The ultimate goal of data mining is prediction; predictive data mining is the most common type of data mining and has the most direct business applications.

The process consists of three basic stages:

  1. Initial exploration,
  2. Model building or pattern identification with validation/verification, and concluded with
  3. Deployment (that is, the application of the model to new data in order to generate predictions).

Stage 1: Exploration

This stage usually starts with data preparation that may involve cleaning data, data transformations, selecting subsets of records, and, in case of data sets with large numbers of variables (fields), performing some preliminary feature selection operations to bring the number of variables to a manageable range (depending on the statistical methods being considered). Then, depending on the nature of the analytic problem, this first stage of the process of data mining can involve anywhere from a simple choice of straightforward predictors for a regression model, to elaborate exploratory analyses using a wide variety of graphical and statistical methods in order to identify the most relevant variables and determine the complexity and the general nature of models that can be taken into account in the next stage.

Stage 2: Model building and validation

This stage involves considering various models and choosing the best one based on their predictive performance (i.e., explaining the variability in question and producing stable results across samples). This may sound like a simple operation, but in fact, it sometimes involves a very elaborate process. There are a variety of techniques developed to achieve this goal, many of  which are based on so-called "competitive evaluation of models," that is, applying different models to the same data set and then comparing their performance to choose the best. These techniques, which are often considered the core of Predictive Data Mining, include: Bagging (Voting, Averaging), Boosting, Stacking (Stacked Generalizations), and Meta-Learning.

Stage 3: Deployment.

This final stage involves using the model selected as best in the previous stage and applying it to new data in order to generate predictions or estimates of the expected outcome.

The concept of data mining is becoming increasingly popular as a business information management tool where it is expected to reveal knowledge structures that can guide decisions in conditions of limited certainty. Recently, there has been increased interest in developing new analytic techniques specifically designed to address the issues relevant to business data mining, but data mining is still based in large part on the conceptual principles of statistics including the traditional Exploratory Data Analysis (EDA) and modeling, and it shares with them both general approaches and specific techniques.

However, an important general difference in the focus and purpose between data mining and the traditional Exploratory Data Analysis (EDA) is that data mining is more oriented toward applications than the basic nature of the underlying phenomena. In other words, data mining is relatively less concerned with identifying the specific relations between the involved variables. For example, uncovering the nature of the underlying functions or the specific types of interactive, multivariate dependencies between variables are not the main goal of data mining. Instead, the focus is on producing a solution that can generate useful predictions. Therefore, data mining accepts, among others, a black box approach to data exploration or knowledge discovery and uses not only the traditional exploratory data analysis techniques, but also such techniques as Neural Networks, which can generate valid predictions but are not capable of identifying the specific nature of the interrelations between the variables on which the predictions are based.

Data mining is often considered to be "a blend of statistics, AI [artificial intelligence], and database research" (Pregibon, 1997, p. 8), which until very recently was not commonly recognized as a field of interest for statisticians, and was even considered by some "a dirty word in Statistics" (Pregibon, 1997, p. 8). Due to its applied importance, however, the field emerges as a rapidly growing and major area (also in statistics) where important theoretical advances are being made (see, for example, the recent annual International Conferences on Knowledge Discovery and Data Mining, cohosted in 1997 by the American Statistical Association).

For information on data mining techniques, see Exploratory Data Analysis (EDA) and Data Mining Techniques. Representative selections of articles on Data Mining can be found in Proceedings from the American Association of Artificial Intelligence Workshops on Knowledge Discovery in Databases published by AAAI Press (e.g., Piatetsky-Shapiro, 1993; Fayyad & Uthurusamy, 1994).

There are numerous books that review the theory and practice of data mining; the following books offer a sample of recent general books on data mining, representing a variety of approaches and perspectives:

Berry, M., J., A., & Linoff, G., S., (2000). Mastering data mining. New York: Wiley.

Edelstein, H., A. (1999). Introduction to data mining and knowledge discovery (3rd ed). Potomac, MD: Two Crows Corp.

Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., & Uthurusamy, R. (1996). Advances in knowledge discovery & data mining. Cambridge, MA: MIT Press.

Han, J., Kamber, M. (2000). Data mining: Concepts and Techniques. New York: Morgan-Kaufman.

Hastie, T., Tibshirani, R., & Friedman, J. H. (2001). The elements of statistical learning : Data mining, inference, and prediction. New York: Springer.

Pregibon, D. (1997). Data Mining. Statistical Computing and Graphics, 7, 8.

Weiss, S. M., & Indurkhya, N. (1997). Predictive data mining: A practical guide. New York: Morgan-Kaufman.

Westphal, C., Blaxton, T. (1998). Data mining solutions. New York: Wiley.

Witten, I. H., & Frank, E. Data mining. New York: Morgan-Kaufmann.

Crucial Concepts in Data Mining

Stacked Generalization

See Stacking.

Voting

See Bagging.

Feature Selection

One of the preliminary stages in the process of data mining applicable when the data set includes more variables than can be included (or would be efficient to include) in the actual model building phase (or even in initial exploratory operations).

Bagging (Voting, Averaging)

The concept of bagging (voting for classification, averaging for regression-type problems with continuous dependent variables of interest) applies to the area of predictive data mining, to combine the predicted classifications (prediction) from multiple models, or from the same type of model for different learning data. It is also used to address the inherent instability of results when applying complex models to relatively small data sets. Suppose your data mining task is to build a model for predictive classification, and the data set from which to train the model (learning data set, which contains observed classifications) is relatively small. You could repeatedly sub-sample (with replacement) from the data set, and apply, for example, a tree classifier (C&RT or CHAID) to the successive samples. In practice, very different trees will often be grown for the different samples, illustrating the instability of models often evident with small data sets. One method of deriving a single prediction (for new observations) is to use all trees found in the different samples, and to apply some simple voting: The final classification is the one most often predicted by the different trees. Note that some weighted combination of predictions (weighted vote, weighted average) is also possible, and commonly used. A sophisticated (machine learning) algorithm for generating weights for weighted prediction or voting is the Boosting procedure.

Boosting

The concept of boosting applies to the area of Predictive Data Mining, to generate multiple models or classifiers (for prediction or classification), and to derive weights to combine the predictions from those models into a single prediction or predicted classification (see also Bagging).

A simple algorithm for boosting works like this: Start by applying some method (a tree classifier such as C&RT or CHAID) to the learning data, where each observation is assigned an equal weight. Compute the predicted classifications, and apply weights to the observations in the learning sample that are inversely proportional to the accuracy of the classification. In other words, assign greater weight to those observations that are difficult to classify (where the misclassification rate is high), and lower weights to those that are easy to classify (where the misclassification rate is low). In the context of C&RT for example, different misclassification costs (for the different classes) can be applied, inversely proportional to the accuracy of prediction in each class. Then apply the classifier again to the weighted data (or with different misclassification costs), and continue with the next iteration (application of the analysis method for classification to the re-weighted data).

Boosting will generate a sequence of classifiers, where each consecutive classifier in the sequence is an expert in classifying observations that were not well classified by those preceding it. During deployment (for prediction or classification of new cases), the predictions from the different classifiers can then be combined (e.g., via voting, or some weighted voting procedure) to derive a single best prediction or classification.

Note that boosting can also be applied to learning methods that do not explicitly support weights or misclassification costs. In that case, random subsampling can be applied to the learning data in the successive steps of the iterative boosting procedure, where the probability for selection of an observation into the subsample is inversely proportional to the accuracy of the prediction for that observation in the previous iteration (in the sequence of iterations of the boosting procedure).

Stacking (Stacked Generalization)

The concept of stacking (short for Stacked Generalization) applies to the area of Predictive Data Mining, to combine the predictions from multiple models. It is particularly useful when the types of models included in the project are very different.

Suppose your data mining project includes tree classifiers, such as C&RT or CHAID, linear discriminant analysis, and Neural Networks. Each computes predicted classifications for a cross-validation sample, from which overall goodness-of-fit statistics (misclassification rates) can be computed. Experience has shown that combining the predictions from multiple methods often yields more accurate predictions than can be derived from any one method ( see Witten and Frank, 2000). In stacking, the predictions from different classifiers are used as input into a meta-learner, which attempts to combine the predictions to create a final best predicted classification. So, for example, the predicted classifications from the tree classifiers, linear model, and the neural network classifier(s) can be used as input variables into a neural network meta-classifier, which will attempt to "learn" from the data how to combine the predictions from the different models to yield maximum classification accuracy.

Other methods for combining the prediction from multiple models or methods ( from multiple data sets used for learning) are Boosting and Bagging (Voting).

Meta-Learning

The concept of meta-learning applies to the area of Predictive Data Mining, to combine the predictions from multiple models. It is particularly useful when the types of models included in the project are very different. In this context, this procedure is also referred to as Stacking (Stacked Generalization).

Suppose your data mining project includes tree classifiers, such as C&RT and CHAID, linear discriminant analysis, and Neural Networks. Each computes predicted classifications for a cross-validation sample, from which overall goodness-of-fit statistics (misclassification rates) can be computed. Experience has shown that combining the predictions from multiple methods often yields more accurate predictions than can be derived from any one method ( see Witten and Frank, 2000). The predictions from different classifiers can be used as input into a meta-learner, which will attempt to combine the predictions to create a final best predicted classification. So, for example, the predicted classifications from the tree classifiers, linear model, and the neural network classifier(s) can be used as input variables into a neural network meta-classifier, which will attempt to "learn" from the data how to combine the predictions from the different models to yield maximum classification accuracy.

One can apply meta-learners to the results from different meta-learners to create meta-meta-learners, and so on; however, in practice such exponential increase in the amount of data processing, in order to derive an accurate prediction, will yield less and less marginal utility.

Drill-Down Analysis

The concept of drill-down analysis applies to the area of data mining, to denote the interactive exploration of data, in particular of large databases. The process of drill-down analyses begins by considering some simple break-downs of the data by a few variables of interest (gender, geographic region, etc.). Various statistics, tables, histograms, and other graphical summaries can be computed for each group. Next, one may want to drill-down to expose and further analyze the data underneath one of the categorizations, for example, one might want to further review the data for males from the mid-west. Again, various statistical and graphical summaries can be computed for those cases only, which might suggest further break-downs by other variables (income, age, etc.). At the lowest (bottom) level are the raw data: For example, you may want to review the addresses of male customers from one region, for a certain income group, etc., and to offer to those customers some particular services of particular utility to that group.

Deployment

The concept of deployment in Predictive Data Mining refers to the application of a model for prediction or classification to new data. After a satisfactory model of set of models have been identified (trained) for a particular application, one usually wants to deploy those models so that predictions or predicted classifications can quickly be obtained for new data. For example, a credit card company may want to deploy a trained model or set of models (example, neural networks, meta-learner) to quickly identify transactions which have a high probability of being fraudulent. Statistica Data Miner includes a complete deployment engine that consists of various options for deploying solutions derived from data mining projects.

Predictive Data Mining

The term Predictive Data Mining is usually applied to identify data mining projects with the goal to identify a statistical or neural network model or set of models that can be used to predict some response of interest. For example, a credit card company may want to engage in predictive data mining, to derive a (trained) model or set of models (neural networks, meta-learner) that can quickly identify transactions which have a high probability of being fraudulent. Other types of data mining projects may be more exploratory in nature (to identify cluster or segments of customers), in which case drill-down descriptive and exploratory methods would be applied. Data reduction is another possible objective for data mining (to aggregate or amalgamate the information in very large data sets into useful and manageable chunks).

Data Preparation (in Data Mining)

Data preparation and cleaning is an often neglected but extremely important step in the data mining process. The old saying "garbage-in-garbage-out" is particularly applicable to the typical data mining projects where large data sets collected via some automatic methods ( via the Web) serve as the input into the analyses. Often, the method by which the data were gathered was not tightly controlled, and so the data may contain out-of-range values ( Income: -100), impossible data combinations ( Gender: Male, Pregnant: Yes), and the like. Analyzing data that has not been carefully screened for such problems can produce highly misleading results, in particular in Predictive Data Mining.

Data Reduction (for Data Mining)

The term Data Reduction in the context of data mining is usually applied to projects where the goal is to aggregate or amalgamate the information contained in large data sets into manageable (smaller) information nuggets. Data reduction methods can include simple tabulation, aggregation (computing descriptive statistics) or more sophisticated techniques such as clustering, principal components analysis, etc.
  

Text Mining

While Data Mining is typically concerned with the detection of patterns in numeric data, very often important (critical to business) information is stored in the form of text. Unlike numeric data, text is often amorphous, and difficult to deal with. Text mining generally consists of the analysis of (multiple) text documents by extracting key phrases, concepts, etc. and the preparation of the text processed in that manner for further analyses with numeric data mining techniques (to determine co-occurrences of concepts, key phrases, names, addresses, product names, etc.).

Machine Learning

Machine learning, computational learning theory, and similar terms are often used in the context of Data Mining to denote the application of generic model-fitting or classification algorithms for Predictive Data Mining. Unlike traditional statistical data analysis, which is usually concerned with the estimation of population parameters by statistical inference, the emphasis in data mining (and machine learning) is usually on the accuracy of prediction (predicted classification), regardless of whether or not the models or techniques that are used to generate the prediction is interpretable or open to simple explanation. A good example of this type of technique often applied to predictive data mining are neural networks or meta-learning techniques such as boosting. These methods usually involve the fitting of very complex generic models, that are not related to any reasoning or theoretical understanding of underlying causal processes; instead, these techniques can be shown to generate accurate predictions or classification in cross-validation samples.

Data mining is often treated as the natural extension of the data warehousing concept.