Clean up Data Manager Assets

Darshna Sharma 16. September 2021 Connect and Prepare Data, Winter 22 3

Like any other tool out there, Tableau CRM is bound by limits and limitations. The license type you purchase will introduce you to an ocean of possibilities and resources when you’re just starting with TCRM. However, if you’re not careful enough you may end up with a dirty TCRM closet, one where you have a lot of assets you are not using.

In this article, I’ll cover how you can clean up your Data Manager assets – dataflows, recipes, datasets, and connected objects – which ultimately will help in avoiding hitting limits and help optimize performance. This is going to be a long, but valuable one. So let’s get started…

Master Game Plan

I have divided this blog into two broad categories:

Datasets, Dataflows, and Data Prep
Connected Objects

For each category, I’ve followed a simple approach mentioned below:

Identify existing limits and their current consumption
Identify the asset(s) to clean – dataflow/data prep/dataset/connected objects
Outline the process of cleanup

By the end of this blog, you should be in a position to identify what to clean and how to clean.

Datasets, Dataflows, and Data Prep

This section deals with datasets, dataflows, and data prep. Here, I have highlighted how to analyze, identify and deleted/update unused/redundant datasets, dataflows, and data prep aka recipes.

1. Limits and Current Consumption

Let’s say you’re close to hitting a limit in TCRM. Step 1 would be to identify the current consumption of limit against the prescribed limits. This is what this section is all about.

Note: You can find mention of all global limits here.

In this section, I have discussed limits related to datasets, dataflows, and data prep. You can use the table below to better visualize your limits. I usually define criteria for what is “critical” and highlight them in red. Keeping a close eye on limits’ consumption is definitely a good practice.

In case you’re wondering how you can get the limits available in your org, you can use REST APIs for it.

/services/data/v52.0/limits
/services/data/v52.0/wave/limits

Once you have identified the limits, your org’s capacity and current consumption, you’re ready to figure out which limit you should tackle first.

Note: A more comprehensive view of the limits and their consumption would be available OOB sometime in the future (safe harbor). Check out this article by Antonio for more details on it.

2. Analysis

So now you know which limits you may potentially hit in the near future. Be it a limit on dataflows, datasets, data prep, or even connected objects, cleaning up the datasets will help them all.

Let’s first identify datasets that are not being used at all. For this, I have used “Most Recently Used” (MRU) parameter of the dataset. You can get this value either through REST API or can easily access it through the Analytics Adoption application. I have divided this analysis into two parts:

Identify and analyze datasets that are not in use currently
Perform field usage analysis to identify and analyze unused fields registered in datasets.

Step 1: Identify Datasets not used recently

Define time period for “Recently Used Dataset” or “RUD”
Identify datasets that are not RUD:
- Using Rest API
  - Use the ‘lastAccessedDate’ parameter if you’re using REST API
  - /services/data/v52.0/wave/datasets/<datasetId>
- Using Analytics Adoption application:
  - Explore ‘DatasetLinkageSFDC’ dataset
  - Execute the query below and keep the results handy:

q = load "DatasetLinkageSFDC";
--remember to update the timeframe in the filter
q = filter q by date('LastAccessedDate_Year', 'LastAccessedDate_Month', 'LastAccessedDate_Day') in ["Define Period"];
q = group q by ('DatasetName','DatasetId', 'LastAccessedDate_Year', 'LastAccessedDate_Month', 'LastAccessedDate_Day');
q = foreach q generate 'DatasetName', 'DatasetId', 'LastAccessedDate_Year' + "~~~" + 'LastAccessedDate_Month' + "~~~" + 'LastAccessedDate_Day' as 'LastAccessedDate_Year~~~LastAccessedDate_Month~~~LastAccessedDate_Day', sum('RowCount') as 'sum_RowCount';
q = order q by 'sum_RowCount' desc;

Identify the sources of these datasets. If you do not see the source of a given dataset, then it is either generated through CSV, external data API, or the source dataflow/recipe has been deleted. Keep this handy as well.
Now generate a list of dependent components. There are a couple of options to do that:
- Analytics Adoption application – You can get the list of dependent dashboards, and lenses. However, it doesn’t tell you usage in dataflows, recipes, and stories. This is where you can leverage REST API for more accurate results.
- REST API –
  - Use dependency API for TCRM dependent assets: /services/data/v52.0/wave/dependencies
  - Use discovery API for ED dependent assets: /services/data/v52.0/smartdatadiscovery/stories

Step 2: Perform Field Usage Analysis

You’d want to run field utilization analysis for two reasons:

Removing redundancies will speed up your existing dataflows
Field usage can open insights about fields and connected objects which indeed are being synced, but are not being used at all. Removing them will speed up your replication process, and will open up space for more sync objects

In TCRM, you can create datasets through external APIs, dataflows, data prep, and CSV files. Currently, Mohan’s plugin supports only dataflows in his field utilization analysis. But you can run the same for individual datasets. So, I’ve split this analysis into two parts:

Field Usage Analysis for dataflow generated datasets
Field Usage Analysis for other datasets

You can run your own scripts to figure out field usage analysis. At the end of the analysis, you should have a list of all the fields registered in a dataset and their usability. I like to use Mohan’s plugin here to do the same. You’ll find it’s mention in ‘How to optimize long running dataflows’ as well.
The steps to get Mohan’s plugin are here. His plugin checks for usage in dashboards, lenses, security predicates. It doesn’t consider fields used in transformations within dataflow, or in other dataflows/recipes, stories. So an expert’s check is required here.

For datasets created using dataflows, run the commands below:

Authorize your org:
- If you’re using production:
  sfdx force:auth:web:login
- If you’re using sandbox:
  sfdx force:auth:web:login -r https://test.salesforce.com
Get the job list along with job id:
sfdx mohanc:ea:dataflow:jobs:list -u <username>
The command for field usage analysis:
sfdx mohanc:ea:dataflow:jobs:analyze -u <username> -j <jobId> -d <dataflowId>

For other datasets, run the commands below:

Authorize your org –
- If you’re using production:
  sfdx force:auth:web:login
- If you’re using sandbox:
  sfdx force:auth:web:login -r https://test.salesforce.com
The command for field usage analysis:
sfdx mohanc:ea:dataset:fieldUsage -u <username> -d <datasetname>

Look at the results, analyze the usability of the fields mentioned, and make a decision on which fields you think can be removed from the final dataset and connected data sources.

By end of this exercise, you should have two lists with you:

Lists of all datasets you think are not being used and can be deleted – make sure to have their sources and dependencies listed down as well
Lists of fields within datasets that you think can be deregistered from dataflows, recipes, and connected objects.

3. Cleanup

You now know what has to be cleaned up, be it deletion or removal of columns from registered datasets. Your company may have its own rules when it comes to modifying metadata in the org. However, I have captured steps that you may follow:

Step 1: Backup your metadata – Take a backup of all the datasets you are going to delete:

Metadata backup
Data backup, if applicable
Take a backup of dependent assets as well

Step 2: Deletion Preparation – Prepare and sign-off on a release plan and communication strategy for this move. You can keep the following points in mind while coming up with the strategy.

In the first step, revoke access of end-users to the datasets to be deleted on a communicated date.
Make sure respective users and stakeholders are informed about access revoke. Wait for people to revert through your company preferred communication channel. Give sufficient time, driven by company policies, to people to respond. Send multiple reminders if necessary.
Address scenarios where people revert, analyze, and manage the ask. Update the list of datasets to delete, if necessary.
Revoke access to datasets on the communicated date.
Wait again for sufficient time and see if your users revert back. Address these situations, analyze and manage the ask.

Step 3: Dataset deletion in a lower environment – The ideal way to proceed with this step would be to perform this action first in ~~production~~ sandbox where the TCRM, ED and embedded dashboard’s metadata is the same as in production. These are the steps you can follow to delete datasets.

You can use metadata API for this
You can invoke CRUD call for this
You can delete datasets manually.

Step 4: Testing, and Communication – Push these changes to the QA environment, have stakeholders validate, and do the smoke testing.

Make sure stakeholders sign off the release changes
Communicate and send reminders on the hard deletion date of datasets.

Step 5: Dataset Deletion – Delete the dataset in production on the day of Go-Live.

And voila! Your org now has no unused datasets, dataflows, data prep and the field utilization percentage of your datasets is also high. What next? Connected Objects…

Connected Objects:

It’s a good move to clean up the first section of this blog and then jump to cleaning up connected objects. Once you clean them up, you’d know which object and fields you do not need to replicate.

In this section as well, I’ve followed the same strategy as the section above…

1. Limits and Current Consumption

Link to global limits

I have recreated the table:

2. Analysis as per best practices

I have assumed that the analysis of which datasets can be deleted, field utilization analysis, and modification of dataflows, recipes, and data dictionaries is already done.

The data dictionary is quite important here. A typical format of data dictionary looks like this:

In case you do not have a data dictionary, you can leverage tools like “Einstein Analytics Data Catalog” or “Mohan’s plugin” to create one. Neither of these tools provides fully formed data dictionaries but can be a good starting point. In case you use Mohan’s plugin, use the commands below to get the list:

Authorize your org:
- If you’re using production
  sfdx force:auth:web:login
- If you’re using sandbox
  sfdx force:auth:web:login -r https://test.salesforce.com
Identify dataflow id
Run the field source command:
- sfdx mohanc:ea:dataflow:fieldSource -u <username> -d <dataflowid>

The data dictionary will provide the necessary information on:

Connected objects which are not used in any dataflow or recipe
Synced fields that are not used in any dataflow or recipe.

3. Cleanup

As of today, you cannot deploy connected datasets through metadata API. But you can use REST API instead. By this point, you already have the information on which objects and fields can be removed from the sync. The next step would be to do the clean-up.

Follow the steps below for the cleanup:

Step 1: Removal of fields from connected object:

Through the UI:
1. Locate the connected object in “Connect” tab. Make sure you’re in the right connector space.
2. Click on the object
3. Deselect unwanted fields
4. Save
5. Run replication
Through REST API:
1. PATCH the unnecessary fields as "skipped" : true
2. Follow this document for more information.

Step 2: Deletion of connected object:

Through the UI:
1. Locate the object in “Connect” Tab. Make sure you’re in the right connector space.
2. Disconnect the object
Through REST API:
1. DELETE the unnecessary connected object
2. Follow this document for more information.

Note: Disconnecting/Deleting a connected object also removes the associated connected dataset.

Final thoughts…

Tableau CRM is all about data. Cleaning up the data manager assets will not only prevent hitting the limit(s) but will also speed up your ETL processes – your dataflows and data prep will run faster, your replications jobs will run faster too. Your datasets will become lighter and hence your queries can see runtime improvement as well (to some extent).

After all the efforts, if you’re still nearing the limits, please do reach out to salesforce support. Not all limits are hardcoded and you may just get your limits extended with the right justification.

In case there are more cases you’d want me to cover, or have any questions, please do feel free to drop a comment below. Thanks!

Written by

Darshna Sharma

See author's posts

3 thoughts on “Clean up Data Manager Assets”

1

PRANIT on September 16, 2021 Reply

Thanks Darshana , really helpful
2

Evan Emerson on November 19, 2021 Reply

Thank you! Absolutely helpful and extremely relevant my new role’s scope of responsibilities.
3

Viktar on March 1, 2024 Reply

Do you know why I get only 24 limit of datasets?

Clean up Data Manager Assets

Master Game Plan

Datasets, Dataflows, and Data Prep

1. Limits and Current Consumption

2. Analysis

Step 1: Identify Datasets not used recently

Step 2: Perform Field Usage Analysis

3. Cleanup

Connected Objects:

1. Limits and Current Consumption

2. Analysis as per best practices

3. Cleanup

Final thoughts…

Written by

Darshna Sharma

Related

3 thoughts on “Clean up Data Manager Assets”

Leave a Reply Cancel reply

Master Game Plan

Datasets, Dataflows, and Data Prep

1. Limits and Current Consumption

2. Analysis

Step 1: Identify Datasets not used recently

Step 2: Perform Field Usage Analysis

3. Cleanup

Connected Objects:

1. Limits and Current Consumption

2. Analysis as per best practices

3. Cleanup

Final thoughts…

Written by

Darshna Sharma

Share this:

Related

3 thoughts on “Clean up Data Manager Assets”

Leave a Reply Cancel reply