Optimizing long-running dataflows

Darshna Sharma 03. June 2021 Connect and Prepare Data, Summer 21 3

Do you have a dataflow in your org which used to run for a couple of minutes once upon a time, but now it takes far too long to be called efficient? Are you also in a situation where you need to run more dataflows daily, but are stuck with a mammoth and can’t see a way around? Or do you have business users who want to see their data refreshed more frequently, but you have no scope to squeeze in this ask? If yes, then this is the right blog for you.

In this blog, I’ll talk about how you can optimize your dataflow so that it takes lesser time to run. The approach mentioned here is based upon the contribution and experience of Rikke Hovgaard, Terence Wilson, Siva Teja Ghattepally, and mine. I’d be using Mohan Chinnappan’s plugin to run an analysis on the dataflow.

Note: You can find out more about installing the plugin in this blog by Rikke.

In order to run this plugin, you need to have the following user permissions :

Manage Analytics user permission granted via permission set or profile
API enabled user permission or a similar one

You also need to know a bit about the Salesforce CLI. However, the extent of this analysis is limited to certain commands which I’ll be mentioning in this blog.

Before I jump into the details of analyzing a dataflow; I’d like to share this excellent webinar on TCRM asset performance. It talks of best practices to follow while developing TCRM assets. There’s also an excellent webinar on how you can improve dataflow performance.

Datasets and how they’re stored..

You probably are wondering why knowing about data storage in TCRM important for this blog. Further in the blog, you’ll see that the storage of data in the database matters when it comes to performance, and not only that of your dataflows, but even the eventual queries you’ll write in your lenses, dashboards.

TCRM is an ELTL type of platform. ELTL stands for Extract, Load, Transform and Load. Let me explain the EL part here. Every single time you run a replication, the data from various sources is extracted and is stored in a cached layer. At this stage itself, the data is indexed and we call this data “Connected Data”. TCRM uses inverted indexes to store data for faster readability. Separate files are created for each dimension and measure where only dimensions are indexed. The greater the number of fields you pull, the more files will get generated and hence the replication will take longer.

Consider the image below:

This connected data is then pulled into your dataflows/recipes where you transform your data and finally register(load) it as datasets.

The image below provides the full picture of what happens in the background:

ELTL+Orchestration

Dataflow Best Practices:

You now know about the ELTL part of TCRM. This is what I have built this blog on. I’ve divided the details of this analysis into 4 sections. However, at the end of this blog, I have also mentioned the quick steps you can take to identify low-hanging fruits.

Let’s take a look at each section now:

Data Extraction and what you need to keep in mind

This is the Extract and Load part of TCRM. You can keep the following points in mind while dealing with this layer:

Pull only the data you need in your cached layer. This will not only reduce the replication time but will also bring down processing time when used in your dataflows. For eg: You can pull in only pipeline+opportunities closed from the last 2 years(CY vs LY). This ensures only the necessary data is ingested for processing. Please remember that you can enter a filter on extracted data using the complex filter in the sfdcDigest node of the dataflow. However, if there is more than one digest node across all the dataflows with different filter conditions; then only one filter condition would be applied. Therefore, as a best practice, put the filter in the Connect tab.

Please note that all data connectors do not support this. You can go through this doc to understand which connector(s) support a filter.

Pull only the fields you will need for building your visualizations eventually, or in computations. This will again bring down the replication time.
More often than not, in your visualizations, you may want to sort the data in ascending or descending order. To accurately track your sales data, traverse the most current opportunities sorted in descending order by Amount. See if you can identify this trend and while pulling data from external connectors, try to pull data that is already ordered by the metric you have identified.
Data sync matters – TRCM provides you the option to run Incremental, Periodic, and Full sync for native salesforce data. Clever usage of sync will help decrease the replication time. For instance, you may not want to run full replication on your “Event” data, instead pull only the incremental changes. However, there are some limits and limitations around it. To understand more about it, please refer to this doc.

Data Transformation

The first step in analyzing dataflow runtime is to look at the nodes which are taking longer to run. You can leverage the Monitor window of Data Manager to get info at each node. I personally use Mohan’s plugin to run this analysis. This plugin gives me runtime details of each node in the dataflow in a csv, which I simply pull into TCRM and start my analysis. Here’s the blog you can follow for more information about it. Make sure to run this on your production org or in a sandbox where you can simulate similar runtime for each node. The commands run are:
- Authorize your org –
  1. If you’re using production – sfdx force:auth:web:login
  2. If you’re using sandbox – sfdx force:auth:web:login -r https://test.salesforce.com
- Get the job list along with job id – sfdx mohanc:ea:dataflow:jobs:list -u <username>
- The command for node analysis – Node Analysis : sfdx mohanc:ea:dataflow:jobs:timing -u <username> -j <jobId>
- This will create a csv with the jobid as its name. Locate this csv and load it in TCRM. Now you can play around with the data and see which nodes are taking long to run.

Data ingestions –
- The complex filters in the digest node actually put a filter on your data source Connect. So be careful while using them. This is a powerful way to limit the amount of data you’re pulling in TCRM and can be instrumental in bringing down the replication runtime, digest runtime.
- Having more than one digest node for the same object in the dataflow introduces redundancy by ingesting the same object multiple times. If you have a similar situation, merge those nodes.
- If you’re using edgemarts in your dataflow, then avoid duplicate edgemart.
- The same applies to data from external sources

Note: As TCRM is built on the lightning platform, it follows lightning limits as well. Please keep the limit of 100,000 characters in mind while merging the nodes. You can find more information here.

Grain Matters – A dataset’s grain is the lowest entity upon which you can report. Build your datasets from child to parent and not the other way around, unless it is absolutely unavoidable.
Wider Dataset vs Multiple Datasets – Try to go for wider datasets, instead of multiple datasets, as long as the sanctity of the grain is not compromised. For example, having a dataset that has got all cases and all opportunities will prove to be rather expensive when you use this dataset in your lenses, dashboards, etc. TCRM gives you the flexibility to have as many as 5000 fields out of which 1000 can be date fields. This will reduce the complexity of queries, improve query execution performance and ease maintenance of queries and dashboards

Multivalue fields – Multi-value fields may prove to be expensive in augments, especially when they’re in the right select of the augment node. Having many multi-value fields in the registered datasets can impact the final dataset register time. So, these fields should be handled with care.
- If a multivalue lookup cannot be avoided, then it should be done at the right grain. For example, say you’re augmenting user with opportunities, and also want all teams the opportunity owner is part of. Mostly the number of users in an org is less than the number of opportunities. So perform your multi-value lookup at the user level, and then augment this to opportunities.
- You can either create multi-value fields using flatten transformation, or a multi-value field can be present in the data source itself. In this manner, you can reduce long-running augments for multi-value fields
- Reusability of the fields should be kept in mind.
- Pull in only the needed fields in the ‘right select’ parameter of a lookup, especially in multi-value lookups. Dimensions that are multivalue fields, when in the right select, can increase the runtime of the augment. They can also affect the register time of the dataset.

Note: In case you wish to read more about the limits and limitations around the multi-value fields, please find them here.

Calculate fields in datasets vs on the fly – This is an age-old question, which doesn’t always have a very straightforward answer. I think the answer to this question can be a multi-fold one:
- What’s the final cardinality of your result – If the final cardinality isn’t very high, a calculation on the fly may do the trick. But if the cardinality is high, then you may want to consider moving the calculation to the dataset. As an example, consider a split between late and non-late opportunities as a % on the fly, as opposed to calculating opportunities which are late in the dataset. Whereas determining the age of a case is a good example of calculation to be done in the dataset.
- What is your benchmark EPT – If you realize that you’re missing your EPT because of a calculation on the fly which isn’t exactly on grain, but still is contributing to the increased load time of the page, then you may want to consider moving this calculation to dataset(and of course I am assuming the query is written as per best practices)
- Push complex logics to the dataset as they may take longer to run on the fly
- The final call is always on where can you compromise with the increased run time – dashboard or lens EPT or dataflow runtime

Filters-
- Remove the redundant filters – obvious but true. For example – a filter by ‘Country’ == “Germany” && ‘City’ == “Munich” can be redundant.
- Remove filters that do not return any data. After all, what’s the point of this filter condition, right? For example, “‘Stage’ == “XYZ”” may not return any records if the stage XYZ doesn’t exist in the data ingested. See if you can remove it. To do this more easily, I keep an eye on the input rows and output rows count. I get this data easily using Mohan’s plugin.
- Your dimensions are indexed, measures are not. So a filter on dimension will run much faster than one on measure. Also, indexing takes place once TCRM is done with the “Load”. So the dimensions in your connected data and registered datasets are indexed. A filter on the computed dimension field will not have the advantage of indexed filter behavior. However, you’ll benefit from indexing when you use this dataset in your queries or as a datasource in other dataflow or recipes.
- Filter on null values is time-consuming. So see if you can replace the nulls with a default value. You can do so in the sfdcDigest node or by using compute expression, I prefer the former option.

If you got null values in your data, be mindful of them when you use them in the augments. Check out this very illustrious article about it. After running Mohan’s plugin, check if there are any warnings for augments. If the augment isn’t returning any matches, then you may have some issue with the augment key or even your data. See if this situation can be avoided. I think for me the golden rule is to avoid redundancy wherever possible

Compute Expressions may be the culprit – Most often than not, compute expressions prove to be the culprit of increased dataflow runtime. This node, however helpful, runs the computation on each row, for the given columns in the expression.
- See if you have run these calculations on the correct grain. For example: Say you want to group countries into zones. You’ve decided to use compute expression to do this. So, pull your countries table, create a compute expression to calculate zones, and then use this node to augment with other grains
- A clever usage of filter nodes right before compute expression nodes can bring down the run time. I have come across multiple scenarios where a rigid and long case-when statement is written in compute expressions, and a couple of conditions in the statements are common across multiple fields. See if you can split the complex expression into a combination of filter+compute expression nodes and then augment the fields back to the original source
- I cannot stress enough on reusability of fields. Especially the ones which are created to support augments, filters, compute relative expression down the stream. Again, obvious but true
- Avoid creating “measure” fields in compute expression if you’re going to use those fields in filters alone in your lens/dashboard
- Duplicating columns using compute expression? – Again, I have seen people create fields using compute expression where the sole purpose is to just change the “name”/“label” of the field. This perhaps was done when dataset xmds were not deployable, or you had to make some upstream changes and that resulted in changing the references to the fields which ultimately broke your existing dashboard/lenses. I just want to say that this is redundant. See if you can avoid this
- Do you have a string of compute expression nodes that actually can be merged? If yes, then do it. This reduces dataflow runtime

Compute Relative Node – Compute Relative nodes are very powerful. They work by partitioning your data by a field, sorting in order as per a field, then it gives you the flexibility to calculate your business metric. Compute relatives work within grouped + ordered rows, a lot like a window function. This is another node that may take a while to run. So make sure to run your node on the right grain, keep points about compute expression in mind while calculating computed fields within compute relative.
Get rid of the nodes which are not leading to the register node, because then you’re running those transformations unnecessarily.
Slice Node to the rescue – In the process of data transformation, you may create a couple of fields that are no longer needed. Before you register your dataset, slice away those fields using slice node

Dataset Register

How many datasets to create using a dataflow? I have come across dataflows that generate 10+ datasets. I have also come across dataflows creating as low as one dataset. It is no brain-twister that a higher number of datasets will result in a higher dataflow runtime. So, how many datasets should one create from 1 dataflow? There’s no straightforward answer to this question. The answer is rather driven by a couple of factors:
- The data refresh frequency and schedule
- How long can you let your dataflow run
- Number of different projects the dataflow caters to
Do you have sharing inheritance and security predicates in dataflow’s register node? Please note that while you can create predicates using the register node, you cannot update them through the register node. You’d have to update the predicate through the dataset edit window.
Are you registering your edgemart dataset without any transformation? If yes, then why? See if this can be avoided as it simply takes up space in your org.
You have got a couple of fields in your dataset, but are you using them all? Or, are you even using the datasets you created once upon a time? Keep your datasets clean. Now, you can use the “Analytics Adoption” application to figure out which datasets are being used. But it doesn’t tell which fields within the dataset are not being used. This is again where I like to use Mohan’s plugin to run dataset and field usage analysis. This plugin checks for usage in dashboards, lenses, security predicates. The commands I follow are as follows:
1. Authorize your org –
  - If you’re using production – sfdx force:auth:web:login
  - If you’re using sandbox – sfdx force:auth:web:login -r https://test.salesforce.com
2. Get the job list along with job id – sfdx mohanc:ea:dataflow:jobs:list -u <username>
3. Command for field usage analysis – sfdx mohanc:ea:dataflow:jobs:analyze -u <username> -j <jobId> -d <dataflowId>
4. This also generates a csv with a dataflow id as the name. Import this csv in TCRM, and start analyzing the field usage. If the usage is low, then you may want to drop unnecessary fields from your dataflow either in the digest nodes or by using slice node. Please note that the size of the dataset is a function of the number of rows, the number of columns, and the type of columns. Keeping unnecessary fields away will only bring down the dataset size, thus reducing the register node runtime.

Orchestration:

It’s all about data. We have so far covered what to keep in mind when you pull this data, transform it and create datasets eventually. But keeping a close tab on when to pull this data also important. Here are some of the scenarios to keep in mind:

The Connected Data – Schedule sync of your connected data. For salesforce data, this sync can be incremental, Periodic, and full in nature. Schedule your sync to run before your dataflows/data prep consume this data. This will ensure your data isn’t stale when dataflows and data prep process it
Want to run different kinds of syncs for sfdc objects? Clever usage of sdfc_local connector can prove to be a smart move in running data sync at the right frequency, of right data volume. For example, say you have got millions of opportunities in your org, and a couple thousand users. You know that as per your business process, you need to have opportunity data refreshed every 6 hours, whereas the users data can be refreshed once a week. In this scenario, leverage sfdc_local connector to split the objects with different data refresh frequency needs.
External On-prem data – Say you have got external data that you’re bringing into TCRM using APIs. Again, make sure to run the relevant batch class, for instance, at right time to ensure data being further transformed in fresh
Dataflow/Data prep Schedule-
- Make sure you run your dataflow/data prep after sync finishes. As of today, you can kick off dataflow when data sync finishes
- If your dataflow is dependent upon the output of a data prep or vice versa, make sure you run the former first and then the latter. Look out for this OOB functionality in the summer’21 release(safe harbor)

Summary and Analysis’s quick look:

While I have tried to capture multiple scenarios above, it may not be possible for you to implement everything. A change as simple as renaming a field, removing it, or placing a filter on data requires analysis and can prove to be time-consuming implementation and testing. Trust me, this is normal. This is why I have mentioned below a couple of steps using which you can identify the low-hanging fruits and this may bring down the dataflow runtime. Here are the quick win points you are looking for:

Figure out nodes that are taking longer to run. Obvious, right? Check out the section Data Transformation, point 1.
Remove unused/unnecessary fields from your registered dataset. Identify these fields using Mohan’s plugin. This will bring down the register time, and hence the dataflow run time. Checkout section ‘Dataset Register’
Pull in only the data you’d need in TCRM. So, if there are any filters that can be moved to the cached layer instead, do it and remove the filter node from the respective dataflow(s). Checkout section ‘Data Extraction and what you need to keep in mind’, point 1.
In your dataflow, identify duplicate nodes. For example multiple sfdcDigest nodes for one object, or multiple edgemarts for the same dataset. Merge these.
Remove redundant filter nodes
Remove redundant computed fields
Remove nodes that are not leading to a register node.

I hope you found this blog helpful. In case there are scenarios that you think I should add, please do let me know in the comments section. I’ll try to suggest how you can optimize further. Thank you!

Written by

Darshna Sharma

See author's posts

3 thoughts on “Optimizing long-running dataflows”

1

Pranit on June 3, 2021 Reply

It’s a very informative blog Darshana, thanks a lot. Some of my dataflow were running for almost an hour.. and now I exactly know where I was going wrong. It helped a lot.
2

Darshna Sharma on June 7, 2021 Reply

I am glad you find it helpful 🙂
3

Jakob Laursen on August 6, 2021 Reply

Hi Darshna,

thanks for the article,

In our org we have some very long running dataflows, which I am trying to reduce in running-time.
A large part of the time is used in the “optimize-” step running before registering the dataset.

Problem is that we have some very large datasets with transactional data since 2017 (+ 25,000,000 rows), and the datasets will be used in different apps with different security, hence we need to register up to 5 different versions of almost the same datasets, and each optimize runs for +15 min.

Do you have any inputs for what we potentially can do?

Br
Jakob