If you’re late to the party, Data Prep recipes have arrived as the next-generation data platform. With recipes, you can quickly blend, enrich, and export your data with an easy-to-use visual editor and real-time data preview so you can see the impact of your transformations on the data as you build.
In this post, we will dive into the Clustering transformation that automatically categorizes your data based its inherent numerical patterns. We’ll talk about what Clustering is, how it works, and how you can best use it.
What it is: Clustering Transformation at a Glance
Clustering transform uses the input columns you’ve specified to group data together automatically. Common use cases include: customer segmentation, product bundling patterns, white space analysis, and merchandising recommendations.
This is Clustering transform in Data Prep Recipe in 30 seconds:
Let’s have a closer look at the Clustering transform and exactly how to interact with it.
Yes, it is literally that simple. No-Code Clustering algorithm is three steps away from segmenting your data and uncovering new insights!
The transform can create between 2 to 50 clusters and supports up to 2-billion-row datasets, and you can choose up to 1000 input fields for the clustering calculation. The algorithm can determine the optimal number of clusters automatically, and you can guide it to a range of your choice, or explicitly override it with your target cluster count. To ensure all of the data columns contribute to the K-Means algorithm properly, you can optionally apply automated measure scaling, and generate new columns for the scaled measures.
When you run the recipe, the job performs the following:
- Clustering applies K-Means algorithm with smart initialization on cluster centroids (more on it below)
- Using a distance calculation across all data rows and all centroids, each row is assigned to the closest centroid
- Centroid coordinates are then recalculated by simply averaging all of the data points for each centroid
- Repeat the last two steps until the centroids no longer change and data points no longer switch between centroids.
- The cluster configuration and the finalized centroids are then persisted into Salesforce.
Yes, this is the old-school K-means clustering algorithm wrapped in a point-and-click operation, that can be directly embedded in your data integration process.
Smart Initialization in the Clustering transform works by initializing the K-Means algorithm using the centroid values the system has previously persisted.
To cluster data based on aggregated metrics (eg: clustering Accounts based on Opportunities, Cases, etc), you simply aggregate the data (eg aggregate Average Amount in Opportunities by Account, aggregate Average Case Age in Cases by Account), and then join them to the right grain (eg Account) so that you can cluster Accounts using the aggregated metrics.
Common use cases include: account segmentation or product bundling preferences.
What We Do In the Shadows
Since the entire philosophy of the Smart Transforms is no-code, point-and-click ML, we manage the lifecycle of the clustering model behind the scene. Transparency is the basis for trust in ML capabilities, so we are outlining the full model management lifecycle below so you know exactly how clustering works and how clustering models are retained/reused.
About Clustering Models
A clustering model consists of the clustering configuration (list of input fields, number of clusters, cluster centroids, etc).
Cluster centroids are arrays of the average values for each input field of the rows in the cluster.
Every clustering transform has its own stored model; the model is updated and persisted after the recipe runs. Each model is stored based on
- (a) the Recipe ID of the clustering transform,
- (b) the Clustering node API name, as defined in the recipe, and
- (c) the API names of the user-selected input fields for the clustering algorithm.
Before each recipe runs, the system looks to find a matching model using the recipe’s ID, the clustering node API name, and the list of Input Fields’ API names to search for a model in the model store. If an exact match is found, then that model is retrieved and sent to the transformation engine. If an exact match is not available, then the system uses the list of input field API names in the clustering transform to find a model; there may be multiple models that meet this criteria, so we take the most-recently-updated model. This is considered a likely-match. If a matching model is found, then its centroids are used to initialize the K-Means algorithm. If no matching model is found, then cluster centroids are initialized based on random seeding and a new model is created.
An example of the likely-match use case: You create a recipe with a clustering transform. You run it a few times, and you like the results. You clone the recipe into a new recipe, and you proceed to run the cloned recipe. Initially, the cloned recipe will have no associated clustering model. When the recipe is run, the system will retrieve the matching model. In this case, it will find the source recipe’s clustering model, and it will use that model as the starting point for the cloned recipe, generating the same clustering results as the source recipe. The rationale behind this approach is to ensure recipes clustering the same data using the same input fields will yield the same results.
During the clustering algorithm, the system initializes the K-Means model using the matched centroids (or the random-seeded centroids), and then train the model on the data. When the algorithm converges such that the data no longer move between clusters and the cluster centroids (which, as you know from above, consist of average values of the input fields across all of the data rows in the cluster) no longer change, the resultant centroids are created and the cluster model is updated based on the latest values.
If you specified the clustering algorithm to Use Optimal Number of Clusters: the system will use a silhouette score to identify the optimal cluster count, and the resultant data will be clustered based on that number. If you specify a cluster count as well, then the algorithm will scan a range of clusters from -25% to +25% of your cluster count (i.e. if you specified 20 clusters, the algorithm will scan between 15 to 25 to see which is the optimal cluster count in that range). If you do not specify a number, then the algorithm will use the default range between 2 and 50. Once the optimal number of clusters has been determined, it is included as part of the cluster model and will be re-used in subsequent recipe runs instead of getting recalculated again.
After the recipe execution, you can examine the business meaning of each cluster via dashboard visualization or looking at the aggregated data for each cluster, and decide if you want to apply further processing downstream using the generated cluster labels for each row.
By updating and storing the model for smart initialization in subsequent runs, the model can respond to incremental changes in the data (since it is recalculated in each recipe run) and produce consistent clustering results across multiple runs of the same recipe.
Of course, if, somehow, the data being clustered drastically changes, such that the centroid values in the model are no longer relevant or applicable to that data, those centroids will be random-seeded again; this means the data rows can end up with different cluster labels than from previous runs, and you will need to re-examine the clustered data to assess their business meanings. Some examples of such drastic data changes would be:
- Introduced a large amount of new data that have very different values from the existing data
- Updated the existing data and changed the values significantly (eg: change Amount > $1,000,000 to $0)
- Added filter node(s) ahead of the clustering transform, which reduces the total data fed into the clustering transform significantly
Keeping in mind that if you add or remove Input Fields in the Clustering transform, or you change the API names of the fields feeding the clustering transform, the clustering definition in the recipe would not match the saved model’s signature, and therefore a new model may be initialized (see above).
Model Storage / Clean-Up
After every recipe run, the cluster model for that clustering node is persisted into Salesforce, based on Recipe ID, Clustering Node API name, and the list of input fields’ API names. You can store up to 10,000 models in the system. When you reach the max model limit, you will not be able to run recipes with new clustering transforms.
When you delete a recipe that contains clustering transform, the corresponding models are also deleted.
This means that, if you run clustering and you do not like the results, you can download the recipe JSON, delete the recipe, and re-create the recipe using the downloaded JSON. Assuming there are no likely-matched models in the system, your recipe will end up with a new clustering model starting from scratch.
Get Started, Cluster Something!
Please, go create a recipe right now, and add clustering to it. You will need to select a measure column first in order to see the clustering icon. Add the clustering transform to Accounts, to Opportunities, to Cases, to anything really, and share your feedback with me directly on the Trailblazer Community or join us on Slack at DataTribe! I want to put this power at your fingertips but I need your help to make it work for you – please share your use cases and feedback with me so we can make it better!