Evaluating Classification Models

An overview of commonly used tools and metrics

Published in

Geek Culture

5 min readJul 6, 2021

Introduction

Classification models represent some of the most useful and practical algorithms in the machine learning world. From predicting whether it will rain to determining fraudulent activity on credit cards, these types of models use the available data given to them to classify their predicted outputs into two or more groups. A variety of strategies to tackle classification problems have arisen over the years including logistic regression, decision trees, k-nearest neighbors, and many more.

Example of a complex decision tree with many branches and nodes. — Example of a complex decision tree. Image by author.

Given the different strategies available, a natural question arises: How can classification models be evaluated and compared to one another? While the actual process of building classification models will be saved for another time, this article will walk you through some of the most common evaluation tools and metrics available.

The Confusion Matrix

Before diving in to the specific metrics commonly used to answer the above question, we must first cover the basics. We’ll only be looking at binary classification models in order to keep things simple, but know that the concepts that follow are easily extendable to multi-dimensional data as well. A confusion matrix, despite its name, is fairly straightforward. It shows us how the predictions of a model stack up against the true and correct values, also known as the ground-truth. Let’s see an example:

Example of a confusion matrix showing 114 true negatives, 23 false negatives, 66 true positives, and 20 false positives. — Example of a confusion matrix. Image by author.

There are four key takeaways from this, one for each quadrant:

True Negatives (TN) are the number of predictions where the predicted label was 0 and the ground-truth label was also 0. This can be found in the top left quadrant. Note: negative in this context does not necessarily mean a negative value but rather one part of a binary representing true/false, on/off, alive/dead, etc.
True Positives (TP) are similar to true negatives but for the opposing label (in this case, 1). This can be found in the bottom right quadrant.
False Negatives (FN) are scenarios in which the model predicted a negative value when the ground-truth was actually positive. This can be found in the bottom left quadrant.
False Positives (FP) are scenarios in which the model predicted a positive value when the ground-truth was actually negative. This can be found in the top right quadrant.

These four pieces of information can be combined in various ways to describe the overall effectiveness of a model. Let’s now take a look at some of those specific metrics.

Metrics

Accuracy

The most intuitive of the metrics, accuracy is essentially a measure of how many predictions of a model were correct — that is, aligned with the ground-truth.

While it may seem that nothing more is needed beyond this metric, relying solely on accuracy to evaluate a classification model is a mistake. Consider the following commonly referenced scenario used to highlight this issue: 100 patients are being tested for a disease that occurs in only 1% of people. A model which predicts that no one has the disease at all would technically have a 99% accuracy rate yet be completely useless for actually finding patients that are infected!

Using the values found in the confusion matrix example, the accuracy would be approximately 80.72%.

Precision

The positive predictions of a model contain both true positives and false positives. Precision takes a look at all of the predicted positive values and determines what percentage were true positives.

Using the values found in the confusion matrix example, the precision would be approximately 76.74%.

Recall

Recall scores the number of true positives predicted by the model as a percentage of the total number of ground-truth positives.

Think about this metric in the context of the disease scenario mentioned in the Accuracy section. While the accuracy would be 99%, the recall would be 0% since there would be zero true positives and one false negative (the patient that actually had the disease but the model automatically classified as not having the disease).

Using the values found in the confusion matrix example, the recall would be approximately 74.16%.

F1 Score

Precision and recall exhibit an inverse relationship and their relative importance depends entirely on the context of the data being worked with. In general, having high scores in both metrics is preferable. The F1 score represents the harmonic mean of precision and recall and acts as a measure of the balance between the two.

As a result of using a harmonic mean instead of an arithmetic mean, the F1 score can only have a high value when both precision and recall have high values. A low value in either metric will significantly skew the F1 score lower in response.

Using the values found in the confusion matrix example, the F1 score would be approximately 75.43%.

ROC Curves

For those hoping for a more visual evaluation method of classification models, you’re in luck. The Receiver Operating Characteristic (ROC) Curve plots the false positive rate against the true positive rate and provides a quickly interpretable visualization of model effectiveness.

An example of an ROC curve using a logistic regression model. The area under the curve equals 0.88. — Example of an ROC Curve. Image by author.

The diagonal dotted line represents a model that is randomly guessing. Useful models which have some degree of predictive power will have curves that extend towards the upper left. In this example, a logistic regression model’s performance is shown by the blue line. The ROC curves of multiple model iterations and types can be overlaid on the same graph for a quick comparison. Individual curves can still be quantitatively evaluated and compared by measuring the areas under the curves (AUC) via integration. In the example above, this value is denoted by “AUC = 0.88". A model which perfectly classifies all values would have an AUC of 1.00.

Conclusion

Understanding these tools and metrics and what differentiates them from one another is key to properly evaluating the performance of classification models. Different problems will require tuning models to optimize different metrics according to what makes the most sense for the domain. While additional evaluation methods exist, the topics covered in this article provide a solid foundation.

If you liked this article, make sure to let me know by leaving a 👏 or a comment with any feedback!

Github: https://github.com/tjkyner
Medium: https://tjkyner.medium.com/
LinkedIn: https://www.linkedin.com/in/tjkyner/