You created a story in Einstein Discovery. You measured its model’s accuracy. Then you made some necessary improvements. The model you deployed is now bringing predictions and recommendations right to your users’ fingertips. You’re pretty satisfied with what you achieved.
Cool, you’re done. Your model is out in the wild now. Onto the next adventure, right?
Nope! This is only the beginning of your model’s lifecycle. Occasionally, I still hear the statement that “predictive models will get smarter over time”. Well, not by themselves, unfortunately. In fact, quite the opposite is true. The quality of your model will likely get worse over time – unless you monitor its performance and intervene when needed. We know that change is the only constant in life. Well, change causes your model to lose its ability to capture reality. Why? Because real-world changes actually violate your model’s underlying assumptions. And when your model decays, its decreasing accuracy damages user adoption. Your model needs love, too.
Don’t let your model drift!
What exactly causes a model to decay? One important phenomenon is called drift.
Drift means that something moves (slowly) as a result of outside forces, with no control over direction. In car racing, the term has been adopted to describe when a driver oversteers, with loss of traction, while maintaining control over the vehicle. For a driver, drifting is, therefore, a very deliberate action to let the car almost slip sideways through the entirety of the corner. However In machine learning, drift is not deliberate at all; outside forces cause the model to slip sideways involuntarily – deteriorating the prediction accuracy.
In machine learning, the meaning is closer to the conventional definition. Data scientists distinguish two types of drift: data drift and concept drift.
- Data drift refers to changes in the input data. Suppose your model predicts the likelihood of winning an opportunity. The distribution of the underlying data may change. For example, your business expands but not equally among your business segments – some grow faster than others. Also, the way features are correlated with the outcome can shift. For example, one of your existing competitors launched a new product, and as a result, the win rates for that previously unchallenged offering start to go down. As a result, the training data on which the model is predicated no longer represents the current reality. Prediction accuracy decreases because the model, based on historical data, cannot reliably find the road forward.
- Concept drift refers to changes to the target variable for essentially the same input data. Suppose your model predicts customer lifetime value. As a result of macro-economic factors (economic downturn resulting in less spending in your category), the overall expected lifetime value is shifting. What you know about your customer hasn’t changed, and your interpretation of that knowledge also hasn’t changed – a positive factor remains a positive factor. However, the whole expected outcome for this input data has shifted, independent of the input data itself. When that occurs gradually, we talk about gradual concept shift, but it can also happen more abruptly, e.g. due to an immediate crisis, in which case we talk about sudden concept drift.
Use the Accuracy Analytics App to detect drift
Probably the most informative method to detect drift is with accuracy analytics. Accuracy analytics determines how well your model performed in the real world. For each observation, the predicted outcome (that the model produced) is compared with the actual, real-world outcome, then accuracy metrics are calculated based on this comparison. Remember those model metrics that you looked at before deploying your story? Those were calculated during the training phase, based on a validation set (during training; the data is split randomly in a training set and a validation set to quantify model performance). Ideally, we want to repeat that validation process periodically, each time, with the most recent and complete validation set we have available. Einstein Discovery facilitates validation through the Accuracy Analytics App.
Because the model has been in use for a while, you have accumulated a lot more data than what you used for model validation during the training phase. Since then, a lot of observations have changed – probably even new ones were – and all predictions have been generated and updated. Many (but still not all) of those observations can now be used to validate the model quality. For example, when the model predicts opportunity win likelihood, we can’t use the Open opportunities for validation purposes. After all, we don’t know yet whether these are going to be won or lost. Therefore, only when the record has reached a terminal state can we use it to validate the prediction. The terminal state indicates that we expect that the data in a record will no longer change. For opportunity win likelihood, the terminal state is be reached when Closed = true. You define this terminal state condition at deployment time in the Deployment wizard, or after the deployment in the Model manager.
The Accuracy Analytics App is a template that you can instantiate in your environment. For setup instructions, refer to the
“Analyze Prediction Accuracy with the Einstein Accuracy Analytics App” article in help.
The Accuracy Analytics App comes with a dataset, dataflow, and dashboard. The dataflow pulls records from Salesforce that have reached the terminal state. The app then runs these records against the model to obtain the prediction. Because these records have reached terminal state, you know the real outcome, which you can compare with the generated prediction to determine how close the model estimated the actual value. The App does this automatically for all records in terminal state, and calculates the corresponding model metrics. Those metrics then get stored in the App with a timestamp, reflecting the model accuracy at that specific moment based on all terminal state records. This accuracy over time is then used to populate dashboards like the one below.
So when you do nothing, chances are the quality of your model decreases over time. Some variation in model quality is expected. How much decrease should you see before getting worried? That’s hard to say in general and depends a lot on the context of your deployment, and what the predictions are used for.
In any case, you should definitely observe the trend as much as the actual values. When model quality sometimes improves and sometimes degrades, you may just be observing random noise. However, if there is a clear downward trend, approaching a monotonous decrease, you should think about taking action. More about that in the next section, but first some further clues that model decay is happening.
Advanced: More slicks to detect drift
Go Out-of-Bounds. When you generate a prediction for a record, some values of that record may be out-of-bounds. This can be a categorical label that the model hasn’t seen during training (e.g. a new product type that wasn’t in the training data). For numeric fields, it can be a value that exceeds the largest value that was seen for that field during training (e.g. opportunities take longer to close, and now we have one with no less than 35 activities whereas the previous maximum was 27 activities on an opportunity). When such an out-of-bounds happens, the field is simply not used during the prediction so no harm done. These situations are however good clues that your data is changing, and the more frequently they occur, the larger the chances are you are having data drift.
Screen the Story (and the Settings Screen) There are some very useful clues that you can use from the Story and the Story settings screen. See the below screenshot for an example. Firstly, this gives you the correlations all features have to the outcome variable. If you periodically ingest an updated set of records including the latest additions, you can compare these correlations. Have they changed? Have those data alerts changed? If between different versions you observe a lot of variance, you can be pretty sure that either the data or the concept is drifting. You don’t even need to actually create a Story for that, just inspecting the Story settings screen gives you this information upfront. From Summer’20 onwards, You can even do a side-by-side comparison of specific data segments set by (de)selecting field values in the story setup, and recalculate the correlations. If you play around a bit on this screen, it can give you even an indication of which parts of the data are possibly drifting.
The Story itself can provide clues to drift too. If you create a Story using the updated dataset, from Summer ‘ 20 onwards, you can compare Stories side by side in detail to see where drift may have occurred. What large coefficients are no longer large coefficients? What key insights have changed in the new version? Remember that the Story is a unique insight into the workings of the model; those model explanations can give a detailed queue to data drift as well.
Check your Target. To check for concept drift specifically (a changing distribution of the outcome or target variable), you can put together a quick dashboard that illustrates the distribution of that variable, and take some periodic snapshots of it. Do you see some of its key summary statistics like the median, mean, and standard deviation changing? Do you see the histogram skewing in a certain direction? All these are indications that the target is changing, and you are hence at risk of concept drift. See the below screenshot for an illustration of such a dashboard.
And then What to do about it?
The basic answer is: retrain your model!
However, retraining is a broad term. How many of the story settings do you revisit? Fully back to the drawing board, or just run with exactly the same settings but just an updated training data set? Thankfully, in most cases just rerunning the same story but with new data will suffice.
But be careful! New data doesn’t always mean more data! To cope with concept drift, you may need to drop older historic data from your training set. Take again the example of opportunity win likelihood and that changes with a new competitor for one of your key products. Although those old opportunities were in fact often won, their feature values are no longer indicative of winning in the new world. We won’t relabel them now to ‘lost’, but to have a realistic training set, we need to drop them from the data.
In other cases where you are predicting more subjective labels, you may want to relabel your training data according to the new interpretation. If your model arbitrates for example between high-value and low-value customers and those training labels have been manually given, you may decide to relabel them when concept drift has occurred to that outcome variable.
To cope with data drift, you need to include additional training data that captures the new trends and patterns.
Changing the training data in this way probably solves most of the drift issues, but it may be that some further changes are needed. Thankfully, Einstein Discovery helps here too, because many of the more complex model settings are chosen automatically. Whereas a manually trained machine learning model may need a lot of changes now, probably Einstein Discovery will pick up on all the needed changes just by doing its magic on the new training data. It can even decide to choose a completely new model type, as in Summer’20, new model types have been added like Gradient Boosting Machines (GBM) and XGBoost. These extend the already present GLM technique (Generalized Linear Models) with Decision tree-based models – and Einstein Discovery can choose to use the best model type automatically during the training phase. All it takes is running a training phase with automatic model selection, and Einstein will try the various options and come up with the best model for the current training and validation datasets.
How often should you be retraining, actually? That depends of course. What performance is acceptable given the use case, and the way your users are basing decisions on those predictions? One key thing to consider for sure is the periodicity of your business process. Are there any cycles, like for example a new fiscal year or fiscal quarter when new information becomes available in larger amounts? Then it makes sense to align the retraining with that pattern and it’s not worthwhile to retrain more frequently.
Where is this going in the future? Can we expect more automation?
Wouldn’t that be nice 🙂 Can we not put some thresholds on the performance degradation, and trigger automatic retraining if drift is detected? This is certainly on the radar and watch this space for future updates on that topic. But as always, even if perhaps some steps will be possible to automate, it remains a priority to make it as simple for users to keep ownership of this process – make it as smooth as possible for the everyday analyst to do this the right way.