Developing Credit Scoring Model for Data Miner Recipe - Example
The purpose of this example is to explore the use of Statistica Data Miner Recipes for Credit Scoring applications. The example is based on the data file CreditScoring.sta, which contains observations on 18 variables for 1,000 past applicants for credit. Each applicant is rated as good credit (700 cases) or bad credit (300 cases). We want to develop a credit scoring model that can be used to determine if a new applicant is a good credit risk or a bad credit risk, based on the values of one or more of the predictor variables. An additional Train/Test indicator variable is also included in the data file for validation purposes.
- Procedure
- Start Data Miner Recipes.
- On Ribbon bar select the Data Mining tab.
- To display the Data miner recipes dialog box, click Data Miner Recipes, in the Recipes group.
- To create a new project, click
New button.
The step-node panel is located in the upper-left area of the Steps tab.
It contains four major nodes:
- Data preparation
- Data for analysis
- Data redundancy
- Target variable
Nodes (steps)





Preparing Data Step for Data Miner Recipe
- Procedure
- On the Data preparation tab, click the Open/Connect data file button.
- In the Select Data Source dialog box, click Files button and locate and open the CreditScoring.sta data file (located in the Examples/Datasets folder installed with Statistica - on most computers C/Program Files/Statistica/Statistica/Examples).
- Click the
Select variables button. In the Select variables dialog box, select the
Show appropriate variables only check box. Then, select:
Variable 1 (Credit Rating) as the Target, categorical variable,Variables 3, 6, and 14 as Input, continuous (continuous predictors)Variables 2, 4-5, 7-13, and 15-18 as Input, categorical (categorical predictors)Variable 19 (TrainTest) as the Testing sample (validation sample variable)
- Click the OK button in the variable selection dialog box.
- In the Data miner recipes dialog box, select the Advanced tab.
- Select the Use sample data check box. Select the Stratified random sampling option button as the sampling strategy to ensure that each class of the dependent variable Credit Rating is represented with approximately equal numbers of cases in train and validation sets.
- To display the Stratified sampling dialog box, click the More options button.
- Click the Strata variables button, select Credit Rating as the strata variable, and click OK in this dialog box and click OK in the Stratified sampling dialog box.
- To ensure that the step is successfully completed (in the step-node panel next to
Data preparation, the yellow
changes to a green
), click the
Next step button for the
Data preparation step.
Analyzing Data Step for Data Miner Recipe
- Procedure
- On the Data for analysis tab, click the Select testing sample button.
- In the
Testing Sample Specifications dialog box, select the
Variable option button. Verify that the category (value)
Train is selected in the Code for
training sample field and
Test is selected in the
Code for testing sample field.
- Click the
OK button.
The models are fitted using the training sample and evaluated using the observations in the testing sample. By using observations that did not participate in the model fitting computations, the goodness-of-fit statistics computed for (predicted values derived from) the different data mining models (algorithms) are used to evaluate the predictive validity of each model and, hence, are used to compare models and to choose one or more over others.
Eliminating Data Redundancy Step for Data Miner Recipe
After the Data for analysis step is completed, the Data redundancy step is selected. The purpose of the Data redundancy step is to eliminate highly redundant predictors. For example, if the data set contained two measures for weight, one in kilograms the other in pounds, those two measures are redundant.
- Procedure
- On the Data redundancy tab, select the Correlation coefficient option button.
- Specify the Criterion value as 0.8.
- To eliminate the redundant predictors that are highly correlated (r≥0.8), click the
Next step button. Since there is no redundancy in the data set we are using in this example, a message dialog box is displayed stating this.
- Click the
OK button.
The data cleaning and preprocessing for model building is now complete.
Target Variable Step for Data Miner Recipe
Next, we need to build predictive models for the target in this example. In the step-node panel, the Target variable node has a branching structure with the parent node connecting to four child nodes including:
Selecting Important Variables for Target Variables Step
The Important variables node is selected automatically. In this step, the goal is to reduce the dimensionality of the prediction problem, to select a subset of inputs that is most likely related to the target variable (in this example, Credit rating) and, thus, is most likely to yield accurate and useful predictive models. This type of analytic strategy is also sometimes called feature selection.
Two strategies are available. If the Fast predictor screening option button is selected, the program screens through thousands of inputs and find the ones that are strongly related to the dependent variable of interest. If the Advanced screening option button is selected, tree methods are used to detect important interactions among the predictors.
- Procedure
- Select the
Advanced screening option button as the feature selection strategy.
- To display the
Advanced screening dialog box, click the
Advanced screening button. Enter 12 in the Number of predictors to extract field.
- Click the OK button in this dialog box, and then click the Next step button to complete this step.
- To review a summary of the analysis thus far, on the
Steps tab, click the
Report button, and from the drop-down list, select
Summary report to display the
Results workbook.
These predictors are further examined using various cutting-edge data mining and machine learning algorithms available in DMR.
Building Models for Target Variables Step
The Data miner recipe dialog box is minimized so that the Results workbook dialog box is visible. To display the dialog box again, click the Data miner recipes button located on the Analysis Bar at the bottom of the application.
Next, the Model building node is selected. In this step, you can build a variety of models for the selected inputs.
On the Model building tab, the C&RT, Boosted tree, and Neural network check boxes are selected by default as the models or algorithms that are automatically be tried against the data.
The computations for building predictive models are performed either locally (on your computer) or on the Statistica Enterprise Server. However, the latter option is available only if you have a valid Statistica Enterprise Server account and you are connected to the server installation at your site.
For this example, to perform the computations locally on your computer, click the Build model button. This takes a few moments; when finished, click the Next step button to complete this step.
Evaluating and selecting models for Target Variables
- Procedure
- Now, the Evaluation node is selected. To perform the competitive evaluation of models for identifying the best performing model in terms of performance in the validation sample, on the
Evaluation tab, click the
Evaluate models
button.
Notice that the Neural network model has the minimum error rate of 35.75% (exact results may vary). In other words, 64.25% of the cases in the validation sample are correctly predicted by this model. Your results (the best model and the percentages) might vary because these advanced data mining methods randomly split the data into subsets during training to produce reliable estimates of the error rates.
- On the
Steps tab, click the
Report button, and from the drop-down list, select
Summary report to display the
Results workbook.
Review the Summary Frequency table (predictions) output for the best model.
This spreadsheet shows the classification performance of the best model on the validation data set. The columns represent the predicted class frequencies, as predicted by the Neural network model, and the rows represent the actual or observed classes in the validation sample. In this matrix, you can see that this model predicted 145 out of 197 bad credit risks correctly, but misclassified 52 of them. This information is usually much more informative than the overall misclassification rate, which simply tells us that the overall accuracy is 76.61%.
- Display the Data miner recipes dialog box again, and click the Next step button to complete this step.
Deploying for Target Variables Step
- Procedure
- On the Deployment tab, click the Data file for deployment button and double-click on the CreditScoring.sta data file (located in the Examples/Datasets folder installed with StatisticaA). For demonstration purposes, we are using the same data file for deployment of the best model.
- Click the
Next step button to score this data file using the best model. The scored file with classifications and prediction probabilities (titled Summary of Deployment) is located in the Deployment folder in the project workbook is shown as follows: