Einstein Discovery: How much data for a good story?
Have you wondered how much data you need to create a good story in Einstein Discovery? Well, you are not the first one with that questions. Check out this short video to understand what it takes to make a good story.
Note: Find this video as well as other tips and tricks videos on Einstein Discovery here.
Note: The video illustrates important concepts in regards to how much data you need to create a good story, to get the full understanding it is recommended to watch the video and not rely solely on the transcript.
When Leo Tolstoy wrote Anna Karenina, he used almost 350,000 words. A literary short story, however, contains only 5000 to 10,000 words.
In this first edition of Einstein Discovery, Tips, Tricks & Best Practices: how much data, for a good Story?
How much data for a Story?
Well, that’s easy. This is about machine learning, so the more, the better, right?
Euh, no. Contrary to popular belief, it’s often a case of less = more!
What matters primarily, is having the right data, rather than having a lot of it.
A Story or model based on small good data is much better than based on big bad data.
I’ll cover the difference between good and bad data at some later stage, but first want to give you some idea about what small and big mean in terms of Einstein Discovery.
Let’s go T-shirt size here.
When we speak about data, there are two dimensions: columns and rows.
Let’s look at columns first. Right now, a Story can contain max 50 columns and has to have at least 3 columns. Let’s define XS to be between 3 and 5, Small to be under 10, Medium to be Under 15, and Large under 30.
Now for the rows. The max number of rows in a Story today is 20M, and you have to have at least 400 to obtain a predictive model.
Now, let’s define XS to be between 400 and 1000, S to be under 10k, M to be under 100k and L to be under 1M.
Are these hard rules? Nope, of course, they’re not, and as always it depends on the use case and many other things. But let’s consider them as broad rules of thumb.
Now, what you need to keep in mind, is the balance between rows and columns. Suppose that you have an extra small number of rows. Does it make sense to have 50 columns? I don’t think so. With less than 1000 rows, you will probably not have enough data to populate all those columns with and create the necessary variety across all of them.
For XL columns, you need also XL rows. Ok, and maybe a Large number of rows will also still do, and maybe medium is like a borderline case.
If you have XS rows, then also be modest in the number of columns that you use. In fact, we can fill out this matrix with Checks all across the diagonal, and probably these semi-diagonals as well.
As we saw, that top-right quadrant is a no-go area, or borderline. But what about the bottom?
A very small number of columns, but a huge number of rows? That could work, but I’m not quite sure what the appropriate use cases for those would be. Nothing stops you from doing that, just make sure that all that additional data is still adding extra information to the model, and that you’re not introducing noise or dirty data.
Again, are these hard rules? Nope, consider them broad rules of thumb.
That said, I haven’t seen many users being successful outside of the checkmarks in this matrix. Therefore, don’t worry if you have only a few rows, just be modest in carefully selecting a few good columns for them.