As artificial intelligence, and specifically machine learning, is conquering today’s business world, a new way of human machine interaction enters our professional lives. We are more and more guided by predictive models that prioritise our work and give recommendations to improve our business. Once widely adopted, this embedding of so-called augmented analytics can bring tremendous value, as it can orient towards better decisions, especially when combined with human knowledge. But to achieve this adoption, one must build trust in the reliability of those predictive models – which start by measuring the model quality. In this blog, we will focus on understanding the main metrics that allow you as a business analyst or Salesforce partner to assess your Einstein Discovery model and avoid surprises when going into production or A/B testing.
The first thing to know is that metrics are different depending on the output type the model is predicting. In fact, Discovery supports two types of predictions for different use cases: Binary output prediction (classification) and continuous output prediction. Let’s first define what these two predictions are and what makes a good model:
- Classification: A problem that can be answered by “Yes” or “No”, ie “Is my opportunity going to close?”, is a classification or binary output prediction. The answer to this question by discovery is a likelihood score for the answer to be a “yes” or “no. A good classification model can clearly tell apart “Yes” from “No“ answers.
- Continuous output prediction: The model goal is to predict a continuous value (ie predicting Sales Volumes, opportunity lifetime duration, discounts…). In this case the model is a good one if the prediction is as close as possible to the actual ones.
Let’s start with the ones used in binary output prediction (classification) then tackle the continuous output metrics later in the blog.
Binary Output Prediction (Classification)
A picture is worth a thousand words, so let me take a concrete example for classification problems. You just built a model to predict opportunity closure likelihood in your company. You are very proud because the accuracy of your model is at 99%! But is that enough to confidently bring the results to your business users? Let’s find out together.
Accuracy is a very widely used metric in classification problems. It is calculated using intermediate figures gathered in what is called the confusion matrix. The confusion matrix is the following:
It is centered around four main metrics that are used to derive others. The number of True Positives (TP) is the right positive predictions of the model, which means the model predicted they would be positive and it turned to be the case, i.e. Opportunity wins that the model predicted to be so. False Positives is positive predictions by mistake, which means the model predicted them to be positive and they turn to be false. The same logic applies for False Negatives and True Negatives but for the negative predictions made by the model, i.e. Opportunity losses.
The accuracy is the fraction of correct predictions compared to all predictions. Taking into account this definition, let’s go back to our Use Case and see what 99 % accuracy means.
What if we have 100 opportunities and that only one will actually close in the future, but the model detected zero opportunities to close? Let’s fill in the confusion matrix for this case:
We have only one false negative, reflecting the only opportunity that is really going to close. However, it was not detected by the model. The accuracy, in this case, is 99%. Is it a good model? I would say no! In fact, it missed the whole point and was not able to detect the small fraction of opportunities that will convert.
So what is a good metric for classification models? The confusion matrix is dependent on a chosen threshold score that separates positives from negatives. Here’s an example: if there is a threshold of 0.7 then all the opportunities that have a score less than 0.7 are classified as future losses (Negative) by the model and the ones above 0.7 are classified as future wins (Positive). But the proper assessment of the quality of the model is independent of the threshold.
A good model is able, regardless of the threshold, to increase the true prediction on the positives without generating multiple false positives. Meaning if the model (in the example above) says that all the opportunities are future wins, then, of course, this will maximize the number of true positives, but on the other hand, it will generate multiple wrong alarms as the model will also classify all the actual lost opportunities as positives.
In order to grasp this idea, we use a curve called ROC, which stands for Receiver Operating Characteristic. Every point in this curve is showing the True Postive Rate (TPR) vs the False Positive Rate (FPR) given a particular threshold.
As mentioned above the goal is to increase for every point the true positive rate and decrease the wrong alarms rate. By extension, the furthest every point is from the identity line the better it is. Thus the bigger the grey area is the better it is. AUC (Area under the curve) is the metric that reflects this notion and is equal to the fraction of the blue area from the whole area under (1,1) point.
In conclusion, a good model is able to clearly tell apart positives from negatives. Meaning that if we rank the points using the score given to them by the model (like in the graph below) we should decrease as much as possible the areas where the actual positives and negatives are mixed.
Where to find those metrics in Discovery
When you have your story open, go to “Model” in the upper right corner and then choose “Model Evaluation” in the tabs area. You see here a light version of the confusion matrix and the ROC curve we already introduced above. You can also find the AUC in the Overview tab.
The confusion matrix shows the error on the positive predictions (TP/TP+FP) and the one on false predictions (FN/FN+TN) – concepts that were introduced above. The green cells are the TP in the top line and TN in the second line. And similarly the red cells represent the FN and FP. The matrix is updated whenever you select a different threshold in the curve to show the corresponding figures.
But there are a lot of other cool features to this page. For example, it gives multiple tools to choose the right threshold depending on your business need. Suppose you have a very limited sales capacity that visits likely churning customer. In that case, you want to be very selective in choosing the ones to visit. Therefore, you need to decrease as much as possible your error on Positives (if the model is maximising the churn probability). And so you would ask discovery to optimise a specific metric called precision, also called positive predictive value (in the left side of your screen) and that gives automatically the right threshold to do it. That was just an example, a lot of other metrics can be optimised like accuracy, true negative rate, etc.
Let’s now go to another very interesting topic: the continuous output prediction.
Continuous output prediction
Before digging into metrics, first a little refresh of how Einstein predicts the continuous outcomes. It is done using linear regression by segment.
For every segment, the average is taken as a prediction. For example, if you are predicting the revenue of a company with the number of employees as input then: If between 40 and 45 employees we have as an average revenue 10k in the training dataset, then 10 k would be the prediction for a 42 employees company. Of course, this becomes more complex when using multiple input variables at the same time as opposed to just number of employees – in that case, the segment values need to be weighted amongst each other too – but the principle remains the same.
For these linear regression models, an important metric to look at is is the difference between the predicted and the actual values via the following metric MAE:
MAE looks at the error in absolute terms, making underestimation equally important as overestimating and avoiding that the error terms cancel each other out. MAE is easily understandable from a business point of view. Example, if you are predicting the CSAT for service agents, then an MAE of 0.5 means you are in average 0.5 points away from the actual value in your predictions. However, MAE has the disadvantage of not highlighting the outliers.
Another important metric is the R squared, which you can see below:
Please note that the above is an intuition, not its exact math. An R squared of 20% means that using the model, you can predict 20% better than taking the average as a prediction. In general higher the R squared ith the better is the model. However, The R squared is very sensitive to outliers.
So what should we look at on top of the MAE and R squared?
First, visualise the prediction compared to the real value. The Closer the curve is to the identity line better the model is. Second, investigate the residuals, which are the differences between real and predicted values(error). The distribution of the residuals should be:
- Pretty symmetrically distributed on both sides of the 0-line
- in general, there aren’t clear patterns (like lines).
In fact, if the error shows clear patterns then it means there existed relations between the output and input variables that the model did include in its predictions.
So let’s look at an example. In the following, I am predicting the daily quantity of sales volumes. I had a model of 0.7 as R squared which is quite a good value but as long as we did not investigate more the residual we can’t be that the model is performing well for all data segments.
First, I built the curve representing the DailyQuantity in the “x ”axis versus the predicted daily quantity in the “y” axis (below chart). The curve shows that we are under-predicting for high values of daily quantity (more than 1,2 K).
Second, I drew the residuals (see chart below). Residuals are, as explained above, the errors on every prediction. We can clearly see that bigger is the predicted value, more important is the error which is why we see a sort of tunnel shape for the graph (Tunnel effect).
This phenomenon indicates the presence of Heteroscedasticity and means that the error variance is different from an output segment to another. Here it is small for small values but get very big for big values of the predictions.
This problem would get bigger if we use the same linear function to predict the output for all data segments. However, in Discovery we are minimising the effect of heteroscedasticity as we are predicting different functions per data segment. The problem can still persist if the data is more sparse or rare of some segments.
How to visualise errors and other metrics in Einstein Analytics
As Einstein Analytics unifies data visualisation with machine learning you can easily visualise multiple metrics and dig into them from different angles. Here is how I visualised the residuals from the above example.
Open the Data Manager and select or create a data dataflow. We will need to create a new set of nodes as per below descriptions.
1. Use the Edgemart to load the dataset you want to use for predictions.
2. Use the Prediction node to generate predictions for the rows in the dataset: In the configuration of this node you will be asked to choose the model to predict with (prediction definition). The model will only appear as an option if it was deployed using the deployment wizard that you can find in the Story options. I personally just deploy to any object when I am in a testing phase.
3. Use a ComputeExpression Node to calculate the difference between the predicted value and the actuals one in for each row in your dataset.
4. Use a Register Node to store the dataset that now contains the predicted values and the corresponding prediction errors
5. Remember to always add an ID on your lines to be able to visualise the error per separate line.
The final Dataflow looks as follows:
Once you run the dataflow and open the resulting dataset you can see the error field we just added in the dataflow. We can draw using the scatter plot the error where the X axis displays the predictions and Y axis displays the corresponding prediction errors and the bubbles are based on the key of every single row in the dataset:
By default Analytics only integrates 2000 rows from your dataset in the chart but you can change this limit function in the UI or using SAQL, to make sure that all your rows in the dataset is taken into account. Below you can see how it’s been done in a SAQL.
Summing it up
We saw in this blog how to assess the quality of a model by leveraging all the strengths of the Einstein Analytics platform. By combining the data visualisation tools of the platform with a theoretical understanding of the model metrics it gives you a large toolset to fully explore your model performance via different angles. Remember this analysis is key before deploying your model but it also applies when iterating on its enhancements.
Next up we will look at some tips for model improvement (blog coming soon). Those tips will allow you to improve your model and hence to increase the trust which is key for adoption.