Decision Tree Use Case

This use case model demonstrates using a decision tree model to assess the characteristics of the client that leads to the purchase of a new product targeted in a direct marketing campaign.

Direct Marketing Campaign Success Classification

Datasets
In this case, the dependent variable is y column of whether the client that was called during a direct marketing campaign subscribed to a term deposit (yes or no). This dataset comes from the UC Irvine site (http://archive.ics.uci.edu/ml/machine-learning-databases/00222/), and the exact version used for this case is bank_marketing_data.csv (3.7MB).

The dataset contains 45,211 observations and 17 columns (16 attributes and 1 dependent value). The attributes analyzed includes the client's age, job type, marital status, education, account default status, housing loan status, personal loan status, contact type, last contact day, last contact month, last contact duration (seconds), # of contacts, # of days since contacted, # of days passed since last contact, previous number of contacts, outcome of previous campaign.

The first few rows are shown below, with y as the dependent variable:


bank marketing data table

Workflow

The overall configuration for this Analytic Flow might be something like the following:


Analytic flow for decision tree

For purposes of this example, assume the modeler is trying to understand the question, "Which sub-group of clients should be targeted for the highest subscription rate of a term deposit product?"

If the modeler selects all the variables as independent variables, the results are quite complicated, repetitive and do not seem to reveal any immediate business insights.


Decision tree example

In this particular example, the variable month is exploding the Decision Tree and not seeming to add much business decision value.

One approach for avoiding such over-fitting might be to test some initial assumptions the modeler might have regarding the independent variables.

For example, perhaps different clients are more likely to subscribe to a term deposit if they have been well informed on the product during the campaign (got a long enough contact), or already subscribed in the past after a successful marketing campaign.

With this hypothesis in mind, the modeler might reduce the potential independent variables analyzed and select only the contact, duration, campaign, pdays, previous, poutcome variables to include in the analysis. The overall Decision Tree Operator configuration properties can be kept at their default values.

Results

The results in this case are more understandable, revealing the insight that clients for whom the last campaign was successful AND for whom the last call was between 132 and 410 seconds seem to be more likely to subscribe to the term deposit product.


Desicion tree results

Including the Predictor Operator provides additional model results:


Use Case table

The prediction P(y) value (yes or no) uses a threshold assumption of greater than 50% confidence that the prediction will happen.

Hooking in additional model diagnostic tools, such as the ROC graph, Confusion Matrix, and Goodness Of Fit Operators, provides additional feedback on the quality of the model.


model

The ROC (Receiver Operating Characteristic) Curve plots the sensitivity (% of correctly classified results) vs. false positives. The larger the AUC value, the more predictive the model is. Any AUC value over .80 is typically considered a "good" model. A value of .5 just means the model is no better than a "dumb" model that can guess the right answer half of the time.


ROC curve

The Goodness of Fit Operator provides model accuracy, error and other validation data.


Goodness of Fit

The following confusion matrix table summarizes model accuracy data, and the heat map visual summarizes the actual versus predicted counts of a classification model. These two summaries help assess the model's accuracy for each of the possible class values in a visually intuitive manner.


Confusion matrix summary table


Confusion matrix heat map