Einstein Discovery accelerates the embedding of augmented analytics in a business user’s work environment. The predictive models are created by the platform using a “No Code AI Approach”, which means that the algorithms to train these models are part of the solution, and don’t need to be coded by advanced developers or data scientists. To get the most out of the platform, you still need to develop an understanding of your data, interpret the model results and configure the training datasets in the right way. Your business knowledge and data analysis are therefore key to both choosing the right input data to make relevant predictions but also to iteratively reach an optimal model quality.
The purpose of this blog is to outline some practices for model enhancement and answer some related questions as listed below:
- Choosing the right features: Machine learning is not any different from any other IT field, building the right solution always starts with a deep business understanding.
- Decomposition: Splitting the data into different models can help to improve the results. For example, if you are predicting churn and you know that your customer behaviour is completely different between France and USA then it makes sense to have a different model for those two different countries. But when should I do it?
- Bucketing: As we saw in the last blog, Discovery calculates an optimal prediction per input segment. But what is the optimal bucketing method for my use case?
- Multicollinearity: Linearly dependent inputs can decrease your model quality. How to detect this dependency and solve it?
- Sizing: Discovery has a limit of 100 different values per predictor. What to do in case there is a more diverse input?
- Outliers: Extreme sparse observations can degrade model quality. How to detect them and when to take them out?
- Adding variables with explicit interactions: Einstein Discovery (ED) only looks at second level interactions. How can you overcome this limitation?
This article builds upon a good understanding of model quality metrics. In case you need to brush up that knowledge, please read my previous blog.
Choosing the right features
The first step towards building the right model is to ask the right business questions. I suggest that you ask the following:
Question 1: What are the major predictors of the output variable? For example when predicting opportunity probability to close, it might be interesting to look at activities. However, should we just include the number of activities as a feature, or also the details of what actions were performed? If a face-to-face meeting is more impactful than a call, it would, in this case, make sense to have two input variables: ‘number of calls’ and ‘number of face-to-face meetings’ as two different input columns in your dataset instead of one with the number of all activities type.
You can also enrich the model with derived columns. Derived features are calculations or transformations of other (plain) data. For example, in the win likelihood prediction, it can be interesting to calculate a score that reflects the number of won opportunities with the opportunity account, or even a calculated historic win rate of the product segment in the corresponding industry vertical. The strong predictors will give you good quality predictions and trustworthy explanations.
Question 2: What is the business explanation needed as an output of the model? It is important to include those factors as variables in the model. Let’s continue with the opportunity closing likelihood example. Suppose that the business needs a recommended guidance on add-ons to ensure the offer reflects their differentiators. Adding an Add-on type as input variable would allow us to come up with that recommendation. This column will be correlated to the leading product on the opportunity to come up with the best combinations.
Like any product, implementing ED requires a deep understanding of the company business.
As mentioned in the introduction, having different models for different data segments can improve prediction quality. But when should you explore this improvement possibility?
The question hides a dilemma. Having one model allows having more data thus, making data segments learn from the behaviour of other data segments; for example, knowing the sales in Germany for a certain article can help you estimate it in France if you are lacking data points in France. However, very different behaviour between data segments may create unnecessary complexity.
Three steps are necessary to qualify segmentation as an improvement strategy.
Step 1 – The first step is conceptual: a business understanding of the dataset helps to separate segments with very different behaviour. For example, if you are predicting the win rate of the opportunities for your company then it may make sense to have a model for opportunities in the enterprise business segment and a model for opportunities in the small/medium business segment. In fact, you know, thanks to your business knowledge, that those opportunities behave in total different ways:
- Different Input ranges for the same output value: 80% probability to close happens most probably when you have done 10 to 15 meetings for big accounts but only 1 to 3 for small/medium accounts.
- Different strong predictors per data segment: Let’s assume that in your particular case, the quality of the relationship with the decision-maker is a good predictor in the small and medium business, but matters much less in the enterprise world. However, the customer references you have in that industry matter a lot. When decomposing the models, ED is able to attribute the correct importance to those variables; ‘RelationWithDecisionMaker’ will be strongly correlated in the SMB model, and ‘NumberOfReferences’ will be strongly correlated in the enterprise model. For the two cases above splitting the model would make sense.
Step 2 – Ok so now let’s assume you did that thinking process but you are still not sure yet to have unraveled all your data mysteries to decide how to decompose into multiple models. The second step is to understand the distribution of the input variables to detect the following:
- Columns with few predominant values: For example if you are predicting lead conversion and that you in the column ‘Lead Source’ have three predominant values: “Web”, “Event”, “Phone”, “Other small terms”. Then it makes sense to try and split the model according to those three values.
- Different second-order interaction between variable (changing a column value drastically changes other columns behaviours): Let’s continue with the lead conversion example. Changing the lead source, drastically changes the ratio of converting leads in some countries. Germany is much more successful when the lead is coming from the web than France. It is another good reason to split according to the lead source.
Step 3 – The last step would be to build a first model and gradually end up with the right number of models. Every time you build a model you should look at:
- The “why it happened charts” of the biggest predictors per different categories. In fact, the chart will show you the most important drivers for the chosen segment, if they are different for your data segments it is worth trying to split the model. Example: By looking at ‘why it happened’ for lead source “web” you see that the country “Germany” is the biggest predictor and adds 29 to the score for “Web” leads. On the other hand, when looking at the lead source “Event” then Industry type is the biggest predictor and adds 40 the score. Then it makes sense to split the model as the explanations for the different Lead source categories are totally different.
- If by segmenting into different models the predictor’s correlations change or the R-squared of the model gets better then splitting may be a good idea. The predictors are ranked in the ‘Edit Story’ part of the model after the first model was created.
Segmentation can be a good way to quickly improve results. It is more efficient when based on a solid business understanding. Segmentation is also needed due to ED sizing limits, this is what we will see in the next section.
There are two important limits we have to considered when building models with ED:
- ED can take as input datasets with up to 50 variables/features (columns).
- Every feature can have up to 100 values. In fact, in ED every feature value is treated as a separate variable; ED creates one ‘dummy variable’ for every input column value. If a column happens to have more than 100 values, then ED will create 99 dummy variables for the first 99 most frequent values but all the rest will be grouped in the 100th variable. That means that the impact of the whole variable on the calculation of the output will be less accurate because the approximation for the “rest” category has a low accuracy when it groups together many different column values.
So what is the solution? Well, if there is a need to have all those different values, I would advise you to segment into different models based on the values of the variable that has a lower granularity level than the problematic one, to make sure that variable with many different values contains again 100 or less different values. Here is an example:
Use case goal: Predict the sellout value per SKU and store.
Dataset global description: For different SKU segments and their SKUs we have the sell-out per week and store.
There are 14 different SKU segments and for every around 20 different SKUs.
I first ran the model with all the SKUs. The result was a model with the following metrics:
27% of the rows went to the “others“ category. Also as you can see in the orange box below the correlation with the output variable of the SKU is 50%.
I then tried with SKU family as input (without SKU column) but the quality dropped, so going to the SKU level is needed:
Then I built a model for a subset of the data, that contains only one SKU family and 80 SKU.
Even if that reduced the training data set size from 260k records dataset to only 30 k records, the model metrics improved significantly:
- The R-squared improved from 0.67 to 0.82!
- The SKU variable correlation with the output went to 53%. This is because Einstein Discovery can predict the sell-out for those 80 SKUs accurately, but 200 SKUs more inside a single model the results will only be accurate for those largest 99 SKU, and there will be no differentiation between the remaining SKUs.
The above example hopefully illustrates that sizing is an important consideration to understand. A good investigation of your dataset sizing allows you from the beginning to make the right choices on whether or not it is needed to split into different models to cater to this limit.
Multicollinearity is when dependent variables can be linearly predicted from each other. For example, suppose you are predicting opportunity closure likelihood and have “Sales Rep” and “Country” as input, and that every Sales Rep only works in a single country. In this case, knowing a Sales Rep would always allow you to know the country as well. Using Einstein Analytics data visualisation capabilities, you can detect this by exploring a dataset and group by ‘Sales Rep’ and showing as a measure ‘unique of country’. In this case, you would always have one unique ‘Country’ per ‘Sales Rep’.
When Multicollinearity happens, It causes a technical difficulty that makes the linear regression not solvable analytically. Please see the article Analytical Solution of Linear Regression for more details. The final formula requires an inversion of a matrix but if we have multicollinearity then we have linearly related columns and that causes this matrix to be non-invertible.
The way we solve the linear regression is then different (using a lengthier process) but the results can, in this case, be strange and even decreases the model quality. Discovery highlights this problem in the recommendation section under ‘DUPLICATES’.
In the following example I am going to use the same dataset I used above. But in this example, I have two columns that have exactly the same meaning: SKU (number) and SKU_Name. That gives me the following “Why it happened chart”.
As you can see (in the blue rectangle), ED is trying to understand the correlation of those variables but they are actually the same and it does not know to which one he should attribute the effect anymore.
A Side Note: By looking at the first level correlation for the model having both SKU and SKU_ Name (chart below taken from the ‘Edit Story’ of the model), you see that SKU name is 6,2% correlated to the output variable so you would expect it to bring more accuracy, but It does not! In fact, the R squared of the total model is not necessarily the sum of the correlation factor of each input variable.
So pay attention to taking out multicollinearity. The following are clues that may highlight the existence of multicollinearity:
- A story that takes far longer to run than expected
- Redundant insights
- Large conditional frequencies (100% of others in SKU_name fall into one segment)
So know that multicollinearity can degrade your model. An understanding of the business relations between input variables and a proper investigation of your dataset helps to avoid it. ED will also guide you in detecting it in the improvement section
Outliers may degrade the results. Predicting an output from observations, that is exceptionally far from the rest or not easily put into categories, is not accurate. Removing them, of course, removes the noise on the output variable and in consequence improves the accuracy.
Discovery detects outliers and shows the most relevant ones in the “Recommendation” section as you can see below.
Let’s look at the following example. A dataset with the sales volume (Daily Quantity) per different factors (city, discount, promotion…). I first created the first model with all the rows in the dataset and got the following metrics for my model:
However, by looking at the frequency histogram that shows the distribution of the output variable, I clearly saw that beyond 2000k, there are too few values. This is what we see in the chart below. In order to get the distribution, I did the following:
- Edited the story
- Selected the daily quantity variable
- Increased the number of bins used for visualisation (see highlighted box) which allowed me to see the distribution with more details.
Removing the high values of the sell-out was also the recommendation of discovery, which we see in the recommendations:
The model with No Outliers
The next step was then to remove, through the dataflow, the rows of my dataset where the DailyQuantity was higher than 2000 (the outliers). The model that was trained on the resulting dataset has the following metrics:
When removing outliers we see two things:
- The R squared dropped from 0.78 to 0.75 which is not surprising and does not mean that the model predictions quality has decreased. In fact, when there are outliers, the average model to which the R squared definition is comparing our model to, is a bad prediction. Please review my previous blog for more insights regarding this observation.
- The MAE and MSE decreased when removing outliers, so the model predictions are globally getting closer to the real values.
In general, the model with outliers is giving higher predictions, which is a way to take into account the high values of the sell-out. This is what we see in the chart below: the PredictedValue using the model with outliers is on the Y-axis and PredictedValue using the model with no outliers is on the X-axis. The identity line is in red. We clearly see that the curve is higher than the identity line which means that the predictions with outliers are generally higher. So this is not only true for those large quantities, but the outliers are skewing the model to overpredict those smaller quantities as well.
As mentioned in the previous blog How good is my model really?, The classical metrics like R-squared are not trustworthy when there are many outliers in the data. It is in consequence very important to explore outliers and, if agreed with business, take them out when building the model as part of model improvement.
Discovery splits all numerical variables into buckets. Observations belonging to the same bucket will all have the same prediction. The right bucketing choice is therefore key in determining the accuracy of the model.
- Bucket by Count: same number of data point per bucket. This is the default method used by Discovery and works well if the variable is non uniformly distributed and has no clear clusters. For example, you are predicting for medical sales reps their opportunities win likelihood depending on the number of visits they do to pharmacies. The number of visits is very uniformly distributed, meaning we find multiple occurrences for every possible number of visits, and the chances of conversion increase accordingly till a certain limit. Here is makes sense to split the number of visits variable per count as there is no clear average.
- Manual Bucketing: You can specify your own bucket ranges. I would advise to use it if you want to keep control over the categorisation of the variable and have a more understandable business explanation of the result. For example, if you are predicting win rate of opportunities depending on their amount and that your business sees big opportunities as bigger than 1 million and the rest small then splitting the explanatory variable amount into different small subcategories would make the explanation complex and meaningless for a sales. It is also good if you want to avoid buckets with decimals and would rather go for integer ones. Of course, you can only do that if it has a small impact on model quality
- Bucket by Width: Same numeric range per bucket. This is efficient when the variable is uniformly distributed. Same as with the manual bucketing I would advise to do it if you are willing to sacrifice accuracy for a clearer explainability.
- Recommended buckets by Einstein: a method that uses clustering techniques to recommend the right buckets. This method is good if you focus on your model metrics but the bucketing may be unclear for business. Also, recommended buckets can sometimes result in fewer buckets than you could sometimes want to force the model to use for improved model quality.
By default, Discovery splits the data by count into 10 buckets. Discovery will always bucket data as it performs linear regression per segment (per bucket). However, you can choose other bucketing options.
You can increase the number of data buckets in the ‘Edit Story’ interface by selecting a variable, then you see its distribution, the number of bins and the selected bucketing method. However, increasing the number of buckets does not always mean improving the results. In fact, you can hit a problem of overfitting when splitting the variable into very small segments, meaning that the results on another data than the training data will be bad.
As you can see in the screenshot below the recommended bucket is not available as a choice. In fact, this method is used to generate bucketing recommendations after a first run.
Let’s continue with the example from above where we are predicting the sell-out. One of the strongest predictors is the ‘MovingAverage’ one the mean sales for the last periods. During the first run, Discovery did a bucketing based on count and the R2 was 0.82. However, in the recommendation it suggests bucketing it into 12 ranges.
When following the recommendation, the R2 moved to 0.84 so I gained in model quality but I now have a more complex explanation as I have more ranges. If this is not a problem from a business point of view then going with the recommended bucketing by Einstein is better for model quality.
Bucketing is key in model quality and explainability. The right balance between optimal metrics and a clear explainability is a decision to be taken with business involvement.
Adding explicit second-order variables
Discovery leverages the correlation between input variables and the output variable to come up with its predictions, and in doing so also considers what we call second-order interaction; how do two variables together explain the outcome. That means that an explicit regression coefficient is estimated for the co-occurrence of two values from different features/variables. For example if you are predicting the win rate of an opportunity with the following as input variables: “country”, “industry”, “product”, then Discovery will maintain an explicit weight for the combination of two values (Ex: country x industry combination or country x product) but not at the combination of the three at the same time. So, for example, ED will see that the pharmaceutical industry in Germany does better than the average but not that specifically the product A in Germany in the pharmaceutical industry does well because of a certain law in Germany in that field.
Sometimes, improving the accuracy of your model requires going to a third level of interaction between variables. The model accuracy can sometimes be improved by adding another level of interaction, in this case you need to create another variable that represents the explosion of all combination possibilities of two fo three columns you want to combine. So if we go back to our example, you would need to create a column that represents all possibilities of Country X Industry.
The second-order interaction between the new column and an existing one will now produce third-order interaction. So in the above case, the combination of the two Country X Product with Segment will allow to explicitly estimating a regression coefficient or weight for the correlations between the three variables country, product and segment.
However, some consideration have to be taken into account when adding this type of variable:
- The combinations should not exceed 100 possibilities as then we hit the sizing limit per column explained above.
- The data should be various enough to reflect all possible combinations in the newly created column. If we go back to the above example but change it a bit to the following:
As you see country France x Product A is the only combination created for France and the only one that Discovery will correlate with the segment variable. The model will not be able to extrapolate the three-level correlation to another product as it cannot create France,ProductC or France,ProductB column by itself.
What did we learn?
In this blog, we walked through some of the improvement methods that you can use to enhance the accuracy of your model. The underlying rule behind all of them is following, improving and accepting a predictive model of a machine learning model is a joined work between a data analyst/data scientist and business users.
Business understanding of the data behaviour and fluctuations is key in deciding the acceptable metrics values (like MAE) but also on finding ways to improve results and explainability like bucketing and model splits decision. On the other hand, a data scientist’s deep understanding of metrics underlying meaning like the one for R-squared and improvement methods helps to avoid jumping to quick conclusions on model quality but also pushing for model continuous improvements.
So my advice to you is to include business users in the early stages of model development and assessment. Such collaboration helps to close the gap between two different competencies and ultimately the one between machine and human.