This blog is part of the Data Orchestration blog series. In this part of the series, we will discuss in detail how to work with dependent dataflows and recipes including a practical approach.
Dependent Dataflows and Recipes
Let’s quickly review the “tale of three dataflows” from part 6. We had three dataflows Astro, Codey, and Einstein that were scheduled to run at the same time. Due to the fact that only two dataflows can run concurrently, one dataflow will be queued. In the example below Einstein must wait for one of the other dataflows to finish.
Taking this example further, there could be instances where Einstein and Astro would run first and Codey would have to wait for either of them to run. Hence it becomes difficult to predict the sequence of these dataflows as well as the runtime for each of them.
This example illustrates that we cannot be certain of which of the dataflows would be queued or finish first which makes some situations challenging.
But what is the implications?
To answer this question let’s take our example further to understand why it can be important to know the runtime of a dataflow. Again we will ask Astro, Codey, and Einstein to run, but this time it is not a race, it is a relay.
We want Einstein to run first, followed by Astro and then Codey. How can we achieve this?
Let the Relay Begin
We already know who should be running first and who will have to wait. So let’s see how this pans out.
The above image clearly illustrates the issue, it is not clear when to schedule Astro and Codey. But still, we know we want to ensure a specific sequence is maintained.
Why is the Sequence of Dataflows so Critical?
There could be some scenarios, where we want to use an existing dataset as a starting point for a dataflow. Let’s say that we are using Dataset 1 from Einstein and Dataset 2 from Astro as data sources for Codey by leveraging the edgemart transformation.
Note: Check the Salesforce help documentation for more information on Edgemart Transformation.
In this scenario, if the sequence is not maintained, the outcome of the datasets will vary and most likely result in incorrect data. Hence the sequence of the dataflows is crucial and has to be exactly as we planned it to be.
Enhanced Event-Based Scheduling
Coming in the Summer ’21 release event-based scheduling is being enhanced to allow you to schedule a recipe or dataflow to run after another recipe or dataflow (up to five). Please note that this feature is being added on a rolling basis during the Summer ’21 release.
Control Your Data with the REST API
An important feature we have in Tableau CRM is the ability to make API calls. You can access Analytics features such as datasets, dashboards, and lenses programmatically using the Analytics REST API.
Note: Check the Salesforce help documentation for an overview of the Analytics REST API.
Using the Analytics REST API’s we can also start, stop, schedule our data sync, dataflows and recipes. This is helpful when we need to maintain a certain sequence. Note that all the jobs performed via the REST API as well as those set up using the UI can be monitored from the Dataflow Monitor tab in Tableau CRM data manager.
Note: It is strongly recommended to check the complete list of requests using REST API from the Salesforce help documentation.
Approach to Dependent Jobs
As mentioned we can make REST API calls to start a dataflow as well as check the status of the dataflow job. We will use this feature to control the sequence of the dataflows.
The use case we have discussed determines that the business requirement is that Dataset 3 is created based on Dataset 1 and Dataset 2. As established we cannot control the sequence of the dataflow just based on dataflow runtime and there is a real challenge in trying to set a schedule for dependent dataflows. So let’s have a look at how we can make use of the Analytics REST API to meet the business requirement in 5 steps:
- POST: Using Analytics REST API we will make a POST request to start Einstein dataflow.
- GET: We will then monitor the status of the Einstein dataflow by continuously checking the dataflow jobs in the monitor by making a GET call using the Analytics REST API.
- POST: Once the Einstein dataflow job status is “Success” we can make a POST request using the Analytics REST API and start the next dataflow i.e Astro as Dataset 1 is now created or updated.
- GET: We will then monitor the status of the Astro dataflow by checking the dataflow jobs in the monitor. Once the status is successful Dataset 2 has been created or updated.
- POST: Start the last dataflow, Codey. Codey uses Dataset 1 and Dataset 2 and it’s therefore important to wait for those datasets to be available before we create or update Dataset 3.
Do You Need to Be a Geek?
Now that we know it is possible to control the sequence of dataflows or orchestrate the dataflows using the Analytics REST API, you are probably wondering if you should be a geek to do it? The answer, not really.
There are various ways to orchestrate the dataflows, it really comes down to what you are comfortable using. Below are some of the ways where you can orchestrate your dataflows and recipes.
There are platforms like Postman, which have simple UI and can make our lives easy to make API calls. To make it even simpler, we have created a scenario-based sample API collection that can be downloaded and you can tweak it and use it to orchestrate the dataflows, which I have linked to as we walk through the use case.
Note: Check out the steps to set up Postman.
Yes, we can also use Python to make these REST API calls. We have put some scenario-based code together, which I have linked to as we walk through the use case.
Different Orchestration Scenarios
We have covered what data orchestration is and why we need to orchestrate the dataflows and recipes. Let’s see some common scenarios you may very well encounter in your business cases:
- Run dataflows in sequence
- Run recipes in sequence
- Run dataflow after a recipe in sequence
- Run recipe after a dataflow
Let me walk you through these scenarios and introduce some sample code you can benefit from.
Run Dataflows in Sequence
We have used this scenario earlier to understand why and how orchestration is important. In this case, there could be datasets that are getting created from one dataflow and the resulting dataset is being used as the source for the next dataflow.
Here is the sample code to address this use-case:
Run Recipes in Sequence
This is similar to running dataflows in sequence, one thing to note here is that we would have to use the targetedDataflowId. We can get the targetDataflowId from Workbench. Workbench is a web-based tool that helps administrators and developers to interact with Salesforce for data insert, update, upsert, delete and export. It also supports the undelete program, deploy, retrieve, rest explorer and Apex execute actions.
Here is the sample code to address this use-case.
Run Dataflow after a Recipe in Sequence
In this case, we have certain datasets which come from a dataflow and some from a recipe. In this case, we will start the recipe, check its status, and then start the dataflow.
Here is a sample code to address this use-case.
Run Recipe after a Dataflow in Sequence
In this case, we have certain datasets which come from a dataflow and some from a recipe. In this case, we will start the dataflow, check its status, and then start the recipe.
Here is a sample code to address this use-case.
Hopefully through this comprehensive blog series, the concept of Data Orchestration including the many considerations an analytics developer should have when managing data in Tableau CRM have been clarified and simplified. If you wish to review any of the other blogs in this series head back to the overview page.