Developing Credit Scoring Model for Data Miner Recipe - Example

The purpose of this example is to explore the use of Statistica Data Miner Recipes for Credit Scoring applications. The example is based on the data file CreditScoring.sta, which contains observations on 18 variables for 1,000 past applicants for credit. Each applicant is rated as good credit (700 cases) or bad credit (300 cases). We want to develop a credit scoring model that can be used to determine if a new applicant is a good credit risk or a bad credit risk, based on the values of one or more of the predictor variables. An additional Train/Test indicator variable is also included in the data file for validation purposes.

Procedure

  1. Start Data Miner Recipes.
  2. On Ribbon bar select the Data Mining tab.
  3. To display the Data miner recipes dialog box, click Data Miner Recipes, in the Recipes group.
  4. To create a new project, click New button.

    The step-node panel is located in the upper-left area of the Steps tab.

    It contains four major nodes:
    • Data preparation
    • Data for analysis
    • Data redundancy
    • Target variable

Nodes (steps)

Each node (or step) can exist in one of three states at most (depending on whether its completion is optional). Each state is represented by an icon: a red indicates a wait state, meaning a step cannot be started because it is dependent on a previous step that has not been completed; a yellow indicates a ready state, meaning you are ready to start the step because previous steps have been completed; a green indicates a completed step. To change the yellow (ready state) to the green (completed state), click the Next step button . The change is made only if the step is successfully completed.

Preparing Data Step for Data Miner Recipe

Procedure

  1. On the Data preparation tab, click the Open/Connect data file button.
  2. In the Select Data Source dialog box, click Files button and locate and open the CreditScoring.sta data file (located in the Examples/Datasets folder installed with Statistica - on most computers C/Program Files/Statistica/Statistica/Examples).
  3. Click the Select variables button. In the Select variables dialog box, select the Show appropriate variables only check box. Then, select:
    • Variable 1 (Credit Rating) as the Target, categorical variable,
    • Variables 3, 6, and 14 as Input, continuous (continuous predictors)
    • Variables 2, 4-5, 7-13, and 15-18 as Input, categorical (categorical predictors)
    • Variable 19 (TrainTest) as the Testing sample (validation sample variable)
  4. Click the OK button in the variable selection dialog box.
  5. In the Data miner recipes dialog box, select the Advanced tab.
  6. Select the Use sample data check box. Select the Stratified random sampling option button as the sampling strategy to ensure that each class of the dependent variable Credit Rating is represented with approximately equal numbers of cases in train and validation sets.
  7. To display the Stratified sampling dialog box, click the More options button.
  8. Click the Strata variables button, select Credit Rating as the strata variable, and click OK in this dialog box and click OK in the Stratified sampling dialog box.
  9. To ensure that the step is successfully completed (in the step-node panel next to Data preparation, the yellow changes to a green ), click the Next step button for the Data preparation step.

Analyzing Data Step for Data Miner Recipe

After the Data preparation step is completed, the Data for analysis step is selected automatically.

Procedure

  1. On the Data for analysis tab, click the Select testing sample button.
  2. In the Testing Sample Specifications dialog box, select the Variable option button. Verify that the category (value) Train is selected in the Code for training sample field and Test is selected in the Code for testing sample field.
  3. Click the OK button.
    The models are fitted using the training sample and evaluated using the observations in the testing sample. By using observations that did not participate in the model fitting computations, the goodness-of-fit statistics computed for (predicted values derived from) the different data mining models (algorithms) are used to evaluate the predictive validity of each model and, hence, are used to compare models and to choose one or more over others.

Eliminating Data Redundancy Step for Data Miner Recipe

After the Data for analysis step is completed, the Data redundancy step is selected. The purpose of the Data redundancy step is to eliminate highly redundant predictors. For example, if the data set contained two measures for weight, one in kilograms the other in pounds, those two measures are redundant.

Procedure

  1. On the Data redundancy tab, select the Correlation coefficient option button.
  2. Specify the Criterion value as 0.8.
  3. To eliminate the redundant predictors that are highly correlated (r≥0.8), click the Next step button. Since there is no redundancy in the data set we are using in this example, a message dialog box is displayed stating this.
  4. Click the OK button.
    The data cleaning and preprocessing for model building is now complete.

Target Variable Step for Data Miner Recipe

Next, we need to build predictive models for the target in this example. In the step-node panel, the Target variable node has a branching structure with the parent node connecting to four child nodes including:

Selecting Important Variables for Target Variables Step

The Important variables node is selected automatically. In this step, the goal is to reduce the dimensionality of the prediction problem, to select a subset of inputs that is most likely related to the target variable (in this example, Credit rating) and, thus, is most likely to yield accurate and useful predictive models. This type of analytic strategy is also sometimes called feature selection.

Two strategies are available. If the Fast predictor screening option button is selected, the program screens through thousands of inputs and find the ones that are strongly related to the dependent variable of interest. If the Advanced screening option button is selected, tree methods are used to detect important interactions among the predictors.

Procedure

  1. Select the Advanced screening option button as the feature selection strategy.
  2. To display the Advanced screening dialog box, click the Advanced screening button. Enter 12 in the Number of predictors to extract field.
  3. Click the OK button in this dialog box, and then click the Next step button to complete this step.
  4. To review a summary of the analysis thus far, on the Steps tab, click the Report button, and from the drop-down list, select Summary report to display the Results workbook.

    These predictors are further examined using various cutting-edge data mining and machine learning algorithms available in DMR.

Building Models for Target Variables Step

The Data miner recipe dialog box is minimized so that the Results workbook dialog box is visible. To display the dialog box again, click the Data miner recipes button located on the Analysis Bar at the bottom of the application.

Next, the Model building node is selected. In this step, you can build a variety of models for the selected inputs.

On the Model building tab, the C&RT, Boosted tree, and Neural network check boxes are selected by default as the models or algorithms that are automatically be tried against the data.

The computations for building predictive models are performed either locally (on your computer) or on the Statistica Enterprise Server. However, the latter option is available only if you have a valid Statistica Enterprise Server account and you are connected to the server installation at your site.

For this example, to perform the computations locally on your computer, click the Build model button. This takes a few moments; when finished, click the Next step button to complete this step.

Evaluating and selecting models for Target Variables

Procedure

  1. Now, the Evaluation node is selected. To perform the competitive evaluation of models for identifying the best performing model in terms of performance in the validation sample, on the Evaluation tab, click the Evaluate models button.

    Notice that the Neural network model has the minimum error rate of 35.75% (exact results may vary). In other words, 64.25% of the cases in the validation sample are correctly predicted by this model. Your results (the best model and the percentages) might vary because these advanced data mining methods randomly split the data into subsets during training to produce reliable estimates of the error rates.

  2. On the Steps tab, click the Report button, and from the drop-down list, select Summary report to display the Results workbook.
    Review the Summary Frequency table (predictions) output for the best model.

    This spreadsheet shows the classification performance of the best model on the validation data set. The columns represent the predicted class frequencies, as predicted by the Neural network model, and the rows represent the actual or observed classes in the validation sample. In this matrix, you can see that this model predicted 145 out of 197 bad credit risks correctly, but misclassified 52 of them. This information is usually much more informative than the overall misclassification rate, which simply tells us that the overall accuracy is 76.61%.

  3. Display the Data miner recipes dialog box again, and click the Next step button to complete this step.

Deploying for Target Variables Step

The final Deployment step involves using the best model and applying it to new data in order to predict the good or bad customers. In this case, deploy the Neural network model that gave us the best predictive accuracy on the test sample when compared to the other models. This step also provides the option for writing back the scoring information (classification probabilities computed by the best model, predicted classification) to the original input data file or database. This is extremely useful for deploying models on very large data sets to score databases.

Procedure

  1. On the Deployment tab, click the Data file for deployment button and double-click on the CreditScoring.sta data file (located in the Examples/Datasets folder installed with StatisticaA). For demonstration purposes, we are using the same data file for deployment of the best model.
  2. Click the Next step button to score this data file using the best model. The scored file with classifications and prediction probabilities (titled Summary of Deployment) is located in the Deployment folder in the project workbook is shown as follows: