Einstein Discovery: What is Cardinality, and why should I care?

4.9
(7)

Data cardinality is an important concept when it comes to Einstein Discovery, but what does it mean exactly? And what considerations should you have around cardinality? Check out this short video to get the answers.

Note: Find this video as well as other tips and tricks videos on Einstein Discovery here.

Note: The video illustrates important concepts in regards to cardinality, to get the full understanding it is recommended to watch the video and not rely solely on the transcript.

Transcript

This bird is called the Northern Cardinal. I guess you can see why. In mathematics, Cardinality refers to the size of a set. The set Y, has Cardinality 4. I guess you can see why. In this edition of Einstein Discovery, Tips, Tricks & Best Practices: what is Cardinality, and why should I care?

Suppose you are predicting Opportunity Win rate. Suppose, your training data set has more than half a million rows. Suppose, you want to include a Story column, called SKU, which is really the Product on the Opportunity.

As you can see, there are almost 500 different Products in your data. Some products occur very often, like the most frequent Product, that we call Product #1. It occurs 3600 times. Some products occur very seldomly, like the Product that we call Product #491. It occurs maybe only once.

What happens if we use this field, as a Column in our Story?

Einstein Discovery currently allows up to 100 different unique values per Column. It will take the 99 most frequently occurring Products. That is about here, so product #1 until product #99.

Why 99? Because all this other data, product #100 until #491, will go into 1 Bucket, called the “Other Category”. Here we see a screenshot from the Story configuration. So this here is Product #99, and this here is the other category for products 100 to 497.

There are 250,000 rows in the Other category. That is about 50% of our data! That means, that Einstein Discovery is going to treat 50% of our data as being Opportunities for one and the same Product. It will see no difference in the Product dimension, for half our data. That will not be a good model.

Is there a solution? A little bit. Einstein Discovery allows for one Story column, to be of high cardinality. That means that instead of 100 unique values, it can take 200 unique values. Therefore, this is Product #199, and the Other Column, now contains Product 200-497. That is still almost 20% of our Data. Remember, Einstein Discovery will treat all these Opportunities as being for one and the same Product, called ‘Other’.

Is there a different solution? Yes. Probably, there is a Product Hierarchy as well. Suppose that every Product, belongs to a Product Family. Suppose there are 35 Product Families.

You can use this Product Family for two solutions:

Solution 1. Don’t use Product in your Story, but use Product Family instead. Then, you have only 35 unique values, so that column has a much smaller cardinality.

Solution 2. If you do want to use Product as a column, then Split your model into submodels, and combine only a few product families in each model. That way, you have less unique values for Product in each model. You have reduced the cardinality of the Product Column.

But how do you know that you have this issue? Check in your story configuration, if there are any columns with large Other Categories, you need to fix them. In case you have missed any, Einstein Discovery warns you in the Recommended Updates, after the Story Creation.
Like this.

So what is cardinality? It’s really just the number of unique values for one column. Why should I care? Because if you have more than 100, Einstein Discovery starts to put the small ones together in a large Other Category, and treats them all the same. If that other category becomes too large, your model is bad and you need to fix it.

How useful was this post?

Click on a star to rate useful the post is!

Written by


Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.