The Delicacy of Accuracy: A Deep Dive on Classification Performance
The accuracy of Machine Learning models gives rise to one of the most confusing discussions in the world of Machine Learning. There are multiple reasons for that. First, many different performance metrics are used, which makes fair comparisons and transparent discussions hard. Secondly, expectations are often unrealistically high, caused by extremely overblown media coverage of AI “breakthroughs”, where exaggerated model performance aims to get the article to the front page. When digging into specific studies, mostly you’ll find that the reality is far from the startling achievements quoted in the headline, but these nuanced details and subtle interpretation of metrics get buried under the hype. Finally, the accuracy discussion is mostly held in isolation, detached from the model’s purpose, or the way its outcomes are actually used. In this article, which builds on the excellent overview given by Olfa Kharrat in her model performance blog, I will attempt to clarify the conversation, focusing on classification metrics specifically.
When predicting a binary outcome (e.g. an opportunity will close, a customer will churn, a consumer will engage, etc.), a machine learning model typically returns a probability between 0 and 1. This probability can be classified into a predicted positive or predicted negative using a classification threshold value or decision boundary (e.g. all customers with a churn likelihood above 55% are expected to churn).
These positive and negative predictions can be either correct or incorrect, and therefore we can identify four classes of predictions: the True Positives, False Positives, True Negatives, and False Negatives. Given a validation set for which the real outcomes are known, the confusion matrix visualizes, in a 2×2 grid, how many predictions fall in each class (which quantifies how confused the model is during its classification task). The screenshot below shows the confusion matrix visualized in the Einstein Discovery model metrics page.
As an interpretation of model performance, this matrix has two shortcomings
- the four numbers counted are with respect to a single classification threshold only, and
- humans don’t like expressing quality in four different values; we generally prefer the simplicity of a single score.
These limitations of the confusion matrix have resulted in a wealth of performance metrics for classification. The challenge with these metrics is that they all represent another aspect of model quality, simply because there is no single, optimal way to combine the four values of the confusion matrix into an overall score. The most straightforward metric, called Accuracy, simply counts the correct predictions with respect to the total number of predictions. This approach may work for some situations but typically falls short in cases of large class imbalances. Say that we’re predicting which ticket will win the lottery. We can achieve a stunning 99+% accuracy by predicting all tickets as losing. However, despite that seemingly high accuracy, we can agree that this model is pointless because it has failed its primary task of identifying the winning ticket – it didn’t even bother trying, because it simply predicted “losing” for every ticket.
To address different aspects of model quality, various other metrics have been developed. Einstein Discovery lists no fewer than eight of them under “Common Metrics” on the Model evaluation page.
Note: it is very difficult to provide target values for any of these metrics. The minimum or desired achievable model performance is heavily dependent on the complexity of the prediction task and the quality of the training data. Furthermore, the tolerance for prediction errors will depend on the use case. The consequences of predicting a Lead to be converting, while in reality it will not, are probably smaller than predicting that a customer won’t churn when in reality the customer will.
The first four metrics are the easiest to understand, because they all cover only a single quality aspect of the model, and they always have a score between 0 and 1. Let me briefly summarize them:
- Accuracy: number of correct predictions out of all predictions. A score of 1 means the prediction is always correct. (TP+ TN) / (TP + TN + FP + FN).
- True positive rate (*): number of correct positive predictions out of all positives. A score of 1 means that every positive is identified as such.TP / (TP + FN)
- Precision (**): number of correct positive predictions out of all predicted positives. A score of 1 means that every predicted positive is indeed a positive. TP / (TP + FP)
- Negative predictive value: number of correct negative predictions out of all predicted negatives. A score of 1 means that every predicted negative is indeed a negative. TN / (TN + FN)
(*) sometimes called sensitivity, recall, or hit rate
(**) sometimes called positive predictive value
As you can see from their descriptions, the intuition behind these four measures is actually easier than their names and formulas may suggest.
Reading the confusion matrix and those four metrics defined above will cover a large part of your everyday model quality assessment. Einstein Discovery surfaces more metrics for two additional reasons:
- additional metrics exist that provide more nuanced ways to summarize the four numbers form the confusion matrix, and
- these additional metrics are not based on 1 single classification threshold like the four metrics above, but instead, they quantify the model performance for a whole range of classification thresholds at once.
Single metrics that each cover multiple quality criteria
To continue, let’s first look at metrics that summarize the confusion matrix more effectively. These metrics are more complex because they cover multiple aspects of model quality in one score.
- F1-score combines the True Positive Rate (TPR) and the Precision in one metric. It quantifies both the model’s ability to identify a positive and the trustworthiness of all reported positives. Mathematically speaking, it is the harmonic mean of the TPR and the Precision, calculated as 2 x ( ( Precision * TPR) / (Precision + TPR) ). It is, therefore, always between 0 and 1, where a score of 1 implies perfect TPR and perfect precision. Intuitively, this makes a lot of sense. For example, when a very high classification threshold is used, we expect that the Precision (trustworthiness) of the predicted positives is high. However, not a lot of predictions will be actually positive, so this comes at the expense of a low TPR. The F1-score is a way to capture this trade-off because it will be penalized by the low TPR. Conversely, it will be penalized by a low Precision that would result from an artificially high TPR.
- Informedness is another extension of the True Positive Rate (TPR), by also considering the model’s ability to find real negatives. To define the metric, we first need to identify the True Negative Rate (TNR), which is the counterpart of TPR. It is calculated as TN / (TN + FP) and examines how many of the real negatives are actually identified by the model as negatives. Informedness is then defined as TPR + TNR – 1. A perfect TPR and a perfect TNR, therefore, result in an Informedness of 1 + 1 – 1 = 1. However, when all real positives are predicted negatives, and vice versa, we have an Informedness of 0 + 0 – 1 = -1. The value of Informedness, therefore, ranges from -1 to 1. It is common, however, to look at the absolute value of Informedness, because a ‘perfectly wrong’ model, with an Informedness score of -1, can easily be turned into a ‘perfectly right’ model by classifying exactly the opposite as suggested by the model. Like TPR, Informedness reflects the completeness of the model in retrieving all relevant items.
- Markedness has many parallels with Informedness. Whereas Informedness extends the True Positive Rate with the TNR, we can say that Markedness complements Precision with the notion of the trustworthiness of the predicted negatives. That latter term is defined as the Negative Predictive Value (NPV), calculated as TN / (TN + FN). NPV examines how many of the predicted negatives are indeed negative. The Markedness is then defined as Precision + NPV – 1. Like Informedness, this score ranges from -1 to 1, where a score or 1 reflects a perfectly precise model (both for positives and negatives), and a score of -1 reflects a perfectly imprecise model. Once again, a perfectly imprecise model is easily turned into a precise model by simply flipping the classification outcome. Therefore, it is common to look at the absolute value of Markedness.
- Matthews Correlation Coefficient (MCC) is probably the most widely accepted and most complete summarization of the confusion matrix. There are multiple equivalent formulas to define MCC. I find that the most intuitive way to look at it is as a combination of Informedness and Markedness. To reiterate, Informedness defines how informed the model is in retrieving all positives and negatives, while Markedness defines how precise or trustworthy the returned predictions are. MCC can be defined as the geometric mean (**) of those two metrics, i.e. the square root of their product.
(***) the geometric mean is used instead of the ‘normal’ mathematical average because the Informedness and Markedness represent different magnitudes and are hence not of the same ‘unit of measure’, even if they both range between 0 and 1.
Please note that some of the more advanced metrics are specifically designed to cover scenarios with a large class imbalance, like Informedness, Markedness, and MCC. In CRM use cases for machine learning, it’s somewhat less likely to find extreme class imbalance than in for example science or the medical world. Predicting that rare disease in a large population of species calls for these metrics, for example. If only 1% of your opportunities close, you have a bigger problem to solve than finding the right validation metric, if you catch my drift. Still, it’s good to be aware of this, because the class imbalance of 70-30 or even 90-10 isn’t uncommon in CRM scenarios at all.
Now without a fixed threshold!
So far, we have looked at metrics that summarize the confusion matrix in all sorts of different ways. Some of these metrics even capture multiple quality aspects of the model, and some (like Matthew’s Correlation Coefficient) are fairly complete in doing so. However, one remaining shortcoming is that they are based on a single instance of the confusion matrix, which means they are always subject to a specific classification threshold (e.g. as per the above example, where we defined all customers with a churn likelihood of 55% to be churning).
ROC Curve Now, we will look at generalizations over multiple thresholds. Let’s start with an intuitive plot: the ROC plot in the middle of the screenshot above.
Notice the dots on this curve. These dots correspond to specific classification thresholds (0.001, 0.002, 0.003., 0.999). Yes, Einstein Discovery draws almost 1,000 dots here, and interpolates a line between them, although you actually see the line only at the extremes. For every classification threshold, two metrics are calculated (corresponding to the two axes of the plot): the True Positive Rate as TP / (TP+FN), and the False Positive Rate as FP / (FP + TN). Note that the False Positive Rate (on the x-axis) quantifies the false positives with respect to all real negatives (FP+TN), examining how many of the real negatives slip through as positive predictions. A point on the curve, therefore, expresses the trade-off between the TPR (how many positives are retrieved out of all positives) and the number of false alarms that are generated while doing so.
This curve is very informative. It starts in the origin with a very high classification threshold, and moves up and bends to the right as the classification threshold is lowered. The goal is to climb as steeply up as possible without moving to the right. As soon as the curve starts to bend to the right, ‘false alarms’ are raised. It is logical that some false alarms are raised because the threshold is lowered so more observations start to be classified as positive, but we want to have as few false alarms as possible. So this curve tells you how good the model is in separating the classes over various thresholds. This was also very well explained by Olfa Kharrat in her ‘How good is my model, really?’ blog.
The curve is informative, but it is suitable only for visual inspection. To convert it into a single number, you can measure the area under this curve, therefore called AUC (area under the curve). As both axes run from 0 to 1, the maximum score is 1, when the curve only bends right at the very top of the y-axis. Therefore, the perfect model has an AUC of 1, and its value always ranges from 0 to 1.
Now without any threshold at all!
It is even possible to inspect classification performance completely independently of any classification threshold. Remember that the prediction comes in the form of a predicted likelihood of being positive. To avoid having to draw a line in the sand, another possibility is to simply rank all predictions from most likely to least likely.
The question that begs an answer is then: do all the actual positive items end up on the higher side of this ranking? If so, the model is doing a fine job. The Cumulative Capture Chart (sometimes called Cumulative Gains Chart) visualizes this correctness of the ranking produced by the model. This chart is particularly useful in cases where you need to define a certain cut-off point in that ranking, e.g. deciding on the audience of a marketing campaign. If we can predict how likely a campaign member will respond to the campaign, and we rank all members accordingly, how far down the list do we go for the inclusion of the campaign?
Here is a screenshot of the Cumulative Capture Chart for the model shown previously on the Model Evaluation tab.
Two curves are drawn in the chart; one for the validation set (blue) and one for the training set (red). It is expected that the performance on the training set should be slightly better because the model has actually seen those data points during the training phase. The validation set contains some new data that the model may be less familiar with, so we expect performance to be a bit worse (and we do indeed see the curves diverge a bit).
Let’s see how we should read this chart.
The x-axis in this chart reflects the cumulative data fraction, i.e. what top-X % of the ranking do we take into consideration. The y-axis reflects the percentage of actual positives from the validation set that is captured in that data fraction. For example, let’s take the point (0.2, 0.53), where the blue and red curves still overlap. This means that in the top 20% of our ranking, we already have 53% of all the positives that are in the entire validation set! That already sounds good, but I bet you’re even more impressed after seeing the following calculation, using the numbers of positives and negatives from the confusion matrix that we saw earlier:
- There are 6,335 positives and 11,369 negatives, so in total 17,704 items to predict and rank.
- 20% of 17,704 is 3,541 items.
- According to the Cumulative Gains Chart, 53% of all positives are part of those 3,541 items.
- 53% of 6,335 positives is… 3,358 items!
- This means that, in the top 20% of the ranking, we find only 183 (3,541 – 3,358) actual negative items, which means that the top 20% of the ranking consists of 95% actual positives. The model couldn’t have done much better.!
So, to get a real notion of how good the model is, we have to compare the Cumulative Capture Chart to the best possible result that can be achieved with the ratio in which positives and negatives are divided in the validation set. In this case, the validation set consists of almost 36% positives. So the perfect model puts all positives in the top-36% percent of the ranking, or the cumulative data fraction 0.36. The perfect model, therefore, corresponds to the straight line that we can draw from the origin to the point (0.36, 1), and then moves horizontally to the point (1, 1). As you can see, for a very long time, the blue and the red curves follow this optimal curve, and only from around the 0.3 data fraction do they start to bend down a bit until they hit the horizontal curve again at (0.6, 1).
What to do with the model outcome in your business, and then what metric to look at?
I started by stating that the accuracy discussion is often very confusing or convoluted, because there are so many different metrics, and expectations are often unrealistic due to exaggerated or misinterpreted accuracy claims. After having learned about the different metrics, I trust that you have a far better understanding of how to use them for your models. But I also stated that, too often, the discussion is detached from the original purpose of a model purpose, and the way in which the outcomes are used in practice. The most fundamental question to ask, is “what purpose does the model have in your business?”.
Optimize your alert, find the right threshold
Does your model have an alerting function, e.g. raising a red flag when a customer is likely to churn? That’s a clear classification task, with probably a very low tolerance for False Negatives. You certainly don’t want to miss the alerts for that strategic, high-value account! However, this can come at the expense of getting too many alerts, even for customers that aren’t actually so much at risk. In this case, you need to select a classification threshold, and, in doing so, you should emphasize more on True Positive Rate than on Precision. Discovery allows you to automatically find the optimal threshold for a given performance metric using the ‘Controls’ section in the model evaluation tab. Simply specify the metric for which you want to optimize (such as for Informedness or F1 score), and Einstein Discovery will set the right threshold for you. You can also slide the threshold yourself across the range from 0 to 1 and watch the list of eight Common Metrics and the Confusion Matrix update automatically.
There are other ways in which Einstein Discovery supports threshold optimization. Suppose you have a quantification of the cost of missing an alert (False Negative), as well as the cost of having an alert too many alerts (False Positive). Let’s say that every churned customer will cost you on average of $10,000, and you are offering retention offers worth $2,000 to customers at risk of churning. Put another way, a False Positive is 0.2 times the cost of a False Negative. In this case, you can enter that cost ratio of 0.2, and Einstein Discovery will automatically optimize the classification threshold for you, striking the perfect balance between the different classification errors!
Optimize your ranking, find the right cut-off point
Does your model have a prioritization function, e.g. the ability to predict the propensity-to-buy for your premium product for each customer? Your sales team, which needs to go after these customers, has a limited capacity. You want them to focus on the low hanging fruit first, and get those deals in with the hot customers. You should not, in this case, care too much about the classification threshold (and all confusion matrix related metrics). Instead, you should be more concerned about what your Cumulative Capture Chart looks like. Is it able to put the high-propensity customers at the top? And is there a certain cut-off point below which your sales people don’t need to bother?
Take, for example, the Cumulative Capture Chart shown earlier. In the top 40%, we have covered more than 90% of the actual positives. This means you can safely tell your sales teams to simply disregard the bottom 60% of the propensity-to-buy ranking!
Just by looking at one chart, you can quickly find a very interpretable way in which to look at the model quality without needing to weigh and decipher which “accuracy” metric to use.
An organized discussion
As we make more and more use of machine learning applications in our business lives, the discussion around the quality of the predictions becomes increasingly important. For this discussion to be valuable it needs to be well organized. Therefore it’s important to discuss the basis of the right metrics. Also, the purpose of the model in the business process, and the tolerance for prediction inaccuracy need to be taken into account in that discussion. That and that alone will avoid that wrong or premature conclusions are drawn. After reading this article, how will you (re-)evaluate the accuracy of your prediction models?
2 thoughts on “The Delicacy of Accuracy: A Deep Dive on Classification Performance”
Great blog, i can now use it to verify my model quality 🙂
Thanks for this great blog post.