Confusion Matrix

Displays information about actual versus predicted counts of a classification model and helps assess the model's accuracy for each of the possible class values.

Information at a Glance

Category	Model Validation
Data source type	HD, DB
Sends output to other operators	No
Data processing tool	MapReduce

The Confusion Matrix Operator is a classification model evaluation operator similar to the Goodness of Fit evaluator, but it is more graphic in nature.

Note: All Hadoop and database classification workflows can be evaluated using this operator.

Algorithm

The Confusion Matrix operator is used to evaluate the accuracy of the predicted classifications of any Team Studio Classification Modeling algorithm, including the results of the Logistic Regression, Alpine Forest, Naive Bayes, Decision Tree, or SVM Operators.

The model performance is evaluated using the count of true positives, true negatives, false positives, and false negatives in a matrix. The following table shows the confusion matrix for a two-class classifier:

		Predicted		a is the number of correct predictions that an instance is negative, b is the number of incorrect predictions that an instance is positive,
		Negative	Positive
Actual	Negative	a	b	c is the number of incorrect predictions that an instance is negative, and d is the number of correct predictions that an instance is positive.
	Positive	c	d

In the case of a 2-class classification model (for example), the Confusion Matrix operator calculates several standard accuracy terms.

#	Accuracy term	Description	Equation
1	Accuracy (AC)	The accuracy (AC) is the proportion of the total number of predictions that were correct.	$AC= \frac{a+d}{a+b+c+d}$
2	Recall or true positive rate (TP)	The recall or true positive rate (TP) is the proportion of positive cases that were correctly identified.	$TP= \frac{d}{c+d}$
3	False positive rate (FP)	The false positive rate (FP) is the proportion of negatives cases that were incorrectly classified as positive.	$FP= \frac{b}{a+b}$
4	True negative rate (TN)	The true negative rate (TN) is the proportion of negatives cases that were classified correctly.	$TN= \frac{a}{a+b}$
5	False negative rate (FN)	The false negative rate (FN) is the proportion of positives cases that were incorrectly classified as negative.	$FN= \frac{c}{c+d}$
6	Precision (P)	Precision (P) is the proportion of the predicted positive cases that were correct.	$P= \frac{d}{b+d}$

The accuracy determined using equation 1 might not be an adequate performance measure when the number of negative cases is much greater than the number of positive cases (Kubat et al., 1998). Suppose there are 1000 cases, 995 of which are negative and 5 of which are positive. If the system classifies them all as negative, the accuracy would be 99.5%, even though the classifier missed all positive cases.

Other performance measures account for this by including TP in a product: for example, geometric mean (g-mean) (Kubat et al., 1998), defined in equations 7 and 8, and F-Measure (Lewis and Gale, 1994), defined in equation 9.

#	Measure	Description	Equation
7	geometric mean (g-mean ), 1	Geometric mean of True positive rate and Precision	$g-mean_{1}=\sqrt{TP \times P}$
8	geometric mean (g-mean ), 2	Geometric mean of True positive rate and True negative rate	$g-mean_{2}=\sqrt{TP \times TN}$
9	F-Measure	The harmonic mean of Precision and False positive rate	$P \times TP }{\beta ^{2} \times P + TP}$

In equation 9, β has a value from 0 to infinity and is used to control the weight assigned to TP and P. Any classifier evaluated using equations 7, 8, or 9 has a measure value of 0, if all positive cases are classified incorrectly.

Input

A data set from a preceding operator.
Model(s) from preceding operators. When more than one model is received from its preceding operators, the result can be used for model comparison. This input is optional on database.

Bad or Missing Values: Predictions are made only for rows that contain data. Rows with missing data are skipped.

Configuration

Confusion Matrix offers two possible configurations.

Connect a classification model operator and a data set. In this configuration, the model scores the samples in the data set, and the Confusion Matrix summarizes the results. For Hadoop, this configuration is the only option, where the operator requires no configuration. For a database, for this configuration, you must set the Use Model parameter to true. With both a model and a data set connected, no parameters are necessary.
The second configuration for database uses just an input table, with the prediction columns already present. In this case, set Use Model to false, and select the prediction columns to evaluate using the Prediction Columns input. With just a data set connect (on database) the following parameters apply.

Parameter	Description
Notes	Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator.
Dependent Column	Define the column used as the dependent variable in the model.
Use Model	Specify whether the evaluation uses a model from preceding operator(s) or the data in the prediction columns of the input data set. If true (the default), at least one model operator must directly precede it. If false, the prediction columns must be present in the input data set from its preceding operator. Note: This parameter does not apply to Hadoop data sources.
Prediction Columns	Choose the list of columns in the input data set to compare to the dependent column. Note: This parameter does not apply to Hadoop data sources.

Output

Visual Output

The Confusion Matrix produces both classification accuracy data and a graphical heat map.

Classification Accuracy Data Table

The data output provides the classification accuracy counts for every Observed/Predicted combination for each class.

In the following example, the intersection of the Observed (1) row and Predicted(1) column indicates that 111,309 predictions of value 1 were correct, while the Observed (1)/Predicted(2) cell indicates the model predicted 2 instead of 1 426 times. So for predicting the class of 1, the class recall was 99.62% correct. However, the Observed(2)/Predicted(1) cell indicates 2,311 instances of the model incorrectly predicting 1 for actual values of 2 and the Observed(2)/Predicted(2).

Heat Map

A Confusion Matrix Heat Map displays information about actual vs. predicted counts of a classification model.

The following example shows a Confusion Matrix Heat Map for a Logistic Regression model. In this case, it is evident that the model performs the best when predicting the value 0 with 99% accuracy. However, the accuracy drops for predicting the value 1, being correct only 10% of the time.