Smart ELT with Data Prep: Clustering in Three Steps

Jim Pan 09. February 2021 Connect and Prepare Data, Spring 21 0

We are now expanding our Smart Transform offering in Tableau CRM! You can now participate in a Smart Transform Pilot Program and explore the upcoming features first hand!

As part of the pilot program in Spring ‘21, we are introducing Clustering, Time Series Forecasting, and Pivot in Aggregate in Data Prep to make your recipes even more powerful. In Summer ’21 release, they will become generally available. In this blog, I will focus on the clustering transform and demonstrate how to conduct white space analysis using clustering in data prep recipes.

What is Clustering Transform in Data Prep?

The clustering transform allows you to discover relationships across data rows using the existing data in text or numerical formats. Practically, that means you can apply clustering to segment customers based on their behaviors so that you can analyze white space, predict churn, etc. Of course, clustering is not only applicable to accounts and contacts; we have built a generic capability that you can apply on any data of your choosing.

Behind the scene, when you apply a clustering transform on your data as part of a Data Prep recipe, the backend data processing platform applies a k-means algorithm on your chosen columns to calculate characteristics for each row, and then classify them into “clusters”.

Clustering Data in Practice

Suppose I have opportunity data with Account Region and Product information, and suppose I want to group accounts by their buying patterns using that data to perform white space analysis, this is how I might do it in a simplified example:

1. Aggregate Your Data to the grain you want to cluster. Because I want to segment Accounts, I will aggregate the Opportunity Sum of Amount by Account. And since I’m curious about sales performance based on Region and Product, I will split them out in the aggregate node using “Group Columns” capability (also known as “Pivot in Aggregate”). I do this by applying an Aggregate node to Opportunity dataset, aggregating “Sum of Amount”, then grouping rows by “AccountID”, and finally grouping columns by “Account Region” and then “Product”. This gives me a tabular format of the opportunity amount data at the account level. Check out the steps in the gif below.

Note: Not all values are displayed in the pivot column selection at this time. You can manually type in a value if the value you are looking for is not displayed.

Note: When we make pivot generally available, you will have the option to group the remaining values into an “Other” column.

Note: In the pilot version, the API names of the generated columns are simplistic and do not reflect the labels; we are updating the API name generation to be more intuitive and consistent.

2. Apply the Cluster transform. Add a Transform Node, add clustering step from the toolbar, and choose your predictor columns.

You will observe that the Cluster column is generated with the label “Cluster TBD”. When you run the recipe, the field will be populated with actual cluster labels (eg: Cluster 1, Cluster 2, Cluster 3). When we make clustering generally available, the label will have a realistic sampling of the cluster labels so that you can perform downstream transformations such as renaming the cluster labels.

Now, you have segmented your accounts based on opportunity. What’s next?

3. Build a segment benchmark to compare your Account data to identify sales opportunities. You may want to build a segment benchmark to capture the general performance metrics for each segment.

To build a benchmark, you aggregate the metrics you want to track (eg: max, min, average values of the predictor columns) and group by the cluster label.

Note: During the pilot phase, there is only one row in the preview. When you actually run this recipe, this step will produce one row for each of the clusters.

Next, join the aggregate node (your benchmark) right back to the main branch by dragging the plus-sign by the aggregate node back into a transform node. This will allow you to join the benchmark data together with the source data with the cluster column.

From here, you have the benchmark data on the same row as the aggregated account rows. You can apply calculations against the benchmark using the same data that was used for clustering.

Here’s one example: to determine if the total spend of an account in each region is above or below average, I can simply take the Sum of Amounts by Account and in the Central region, and subtract the Average Amount for the Central region from the segment benchmark.

There you have it.

Using the custom formula transformation, you can apply any number of calculations using multiple columns based on your business needs. Then, you simply create an output dataset with your segmented-accounts-with-benchmark-comparison data.

By now, we have shown you how to segment accounts by opportunities using clustering, create a segmentation benchmark based on the generated cluster, then creating a comparison of each account record to the benchmark, which you can then write to a dataset. The next natural step is to analyze account sales performance relative to benchmarks in your dashboards to determine how an account might be underperforming compared to other accounts in the same segment.

Like Detect Sentiment, Cluster transform simplifies the process of applying ML-driven transformation to your data, allowing you to quickly deliver traditionally time-consuming projects such as white space analysis or churn analysis.

You can still participate in the Smart Transform Pilot Program by reaching out to Salesforce and asking to join “Data Prep Recipe – Smart Transform Pilot Program”.

Forward-looking statement

This content contains forward-looking statements that involve risks, uncertainties, and assumptions. If any such uncertainties materialize or if any of the assumptions proved incorrect, the results of salesforce.com, inc. could differ materially from the results expressed or implied by the forward-looking statements we make.

Any unreleased services or features referenced in this document or other presentations, press releases or public statements are not currently available and may not be delivered on time or at all. Customers who purchase our services should make the purchase decisions based upon features that are currently available. Salesforce.com, inc. assumes no obligation and does not intend to update these forward-looking statements.

Written by

Jim Pan

Director of Product Management

See author's posts

Smart ELT with Data Prep: Clustering in Three Steps