Data Sources and Landscape – Data Orchestration Part 1
This blog is part of the Data Orchestration blog series. A key part of Data Orchestration is of course data sources and the landscape of data you are creating. This blog will take a closer look at these topics.
What is Data Orchestration?
First of all, let’s define what Data Orchestration is. Data Orchestration is the process carried out by a piece of software that takes siloed data from multiple sources, combines it, and makes the data available for data analysis.
In Tableau CRM (Einstein Analytics), the Data Manager is the key tool to help us get data from multiple sources, combine and transform as well as store the data. Once we have the stored data available it can be analyzed to get meaningful insights.
Remember Governance Limits
Before we discuss how we can get data into Tableau CRM. We need to remember that there are few governance limits that need to be considered including but not limited to:
- Connector limits
- Data Sync limits
- Dataflow and Recipe limits
We will be discussing these limits in more detail in the appropriate sections in this blog series, but let’s first have a look at why governance is important.
Why is Governance Important
Salesforce provides services to customers in a multitenant cloud. These services are shared and it is important to be aware of the limits for each of the services to ensure that every customer gets the same computing power, data storage, and core features.
Note: to learn more about the Salesforce Architecture check out the module Understand the Salesforce Architecture on Trailhead.
The below image from Trailhead is a good example to understand the multitenant concept.
Intro to a Dataset
When we get the data into Tableau CRM, the data is stored as a Dataset. We have Dataflows and Recipes as data prep tools in Tableau CRM. Dataflows and Recipes are used to transform the data that is coming from the data sources and stores it as a Dataset for analytics exploration. We will discuss Dataflows and Recipes more as we move along in this blog series.
How to Get Data from Various Systems into Tableau CRM?
In most cases, the data for any organization is in different systems. We have various data connectors available to enable us to connect to various data sources and get the data in Tableau CRM. Once the data connection is set up we can start to sync the data from the data source into Tableau CRM. There are various types of external data connectors which we will talk about as we move along.
Note: We are covering the different types of external connectors in part 4 of this blog series [LINK].
Why do you need to get Data into Tableau CRM?
Dataset is a collection of related data that can be viewed in a tabular format. The data in a dataset can come from multiple sources like Salesforce objects, external data sources, or even other datasets. The dataset can be crafted to be friendly for self-service exploration by the end-user at scale without the need for a deep understanding of the data domain which is required in most data repositories.
Note: See more information on Tableau CRM datasets in the Salesforce help pages.
Dataset structure and Advantages
The dataset is a data repository (storage) similar to file system storage but with proprietary format and algorithms flattening the data with inverted indexes (for volume and speed).
The below image shows how the Tableau CRM datasets are different from the traditional table data structure.
Inverted Index Example
Tableau CRM datasets have all their dimensions indexed. Dimensions comprise the inverted index, in that dimension values point to one or more records. Measures are scalar and by their nature are individual to each record. Date types are decomposed into dimension values as date parts and measures (epochs).
The example below shows different dimensions for Airline Carrier, Destination Airport, and Origin Airport. Each dimension points to values – for a Carrier: AA (American Airlines), DL (Delta), and for Origin: SFO (San Francisco), SLC (Salt Lake City). Each dimension is held in a file and the dimension value points to the records that reference this value. Hence the value is stored once in the file but exists on many records. The miles measure however is not indexed as there are likely to be too many distinct values and is stored for each record.
Need for Live Data
There could be scenarios where we want to look at certain KPIs in real-time. To address such cases, we make use of the Salesforce Direct, live datasets, and Apex queries.
Using Salesforce Direct we can query the data directly from the Salesforce object itself. There is no data sync or dataflow required.
Note: See more information on Salesforce Direct.
Live datasets are similar to Tableau CRM datasets, except that the data remains in an external data source.
Note: See more information on creating a Snowflake live dataset.
There can be cases where one might want to get live data from a website into a Tableau CRM Dashboard. This can be done using an Apex step.
Note: See more information on how to create an Apex step.
Limitations of Live Data
We need to be aware of certain limitations when working with live data. The data that is queried is coming by querying the data source which means that the time taken to retrieve the data could be slow. Be aware that the ability to query live data is not available for all data sources. We get the live data by making API calls made from Salesforce to the data source and the number of API calls that can be made is limited both from Salesforce as well as from the data source. It is hard to keep track of the limitations and clearly, it can not be scaled to build an entire dashboard using Salesforce Direct. Another aspect to consider while using live data is that we do not have the ability to set row-level security as the data is coming directly from the object as we can do for datasets.
Governance Limits of Live Data (Salesforce Direct)
Be aware of the following governance limits when using Salesforce Direct:
- The maximum concurrent Analytics API calls per org are 100.
- The maximum Analytics API calls per user per hour are 10,000.
- The maximum concurrent queries per organization are 50 per platform.
- The maximum concurrent queries per user are 10.
Get Live Data
As of Winter ’21, the ability to query live data is available for the SFDC_Local Connection and Snowflake. We can query live data for a few KPIs that the business wants to look at in real-time and include them in the dashboard. As mentioned previously the data comes directly from the data source itself, you can pick the Salesforce Direct option when you are creating the query as illustrated below.
Note: See more information on Salesforce Direct queries.
Note: Direct Data widgets aren’t automatically refreshed but are updated when the dashboard is refreshed.
Note: As of Winter ’21 we can query Live Data from Snowflake. See more information on how to explore data directly in Snowflake.
We now know that we would need to connect Tableau CRM to the data sources to make data available. So let’s now take a look at the data landscape.
Below is an image showing the entire Data Landscape for Tableau CRM. Before you panic, we will go over each aspect of the image as we go on in this blog series. The image below gives you a holistic picture of how we connect to various data sources and how the data finally gets stored as a dataset in Tableau CRM.
In short, the image shows the various data sources and the data connections. The data from the data sources is being synced (cached) into the Tableau CRM Data Connect (cached data). Note that the connection has two arrows, this is because the Tableau CRM initiates the fetch of data into the connected objects. Where you see a single arrow that indicates that data is pushed to a Tableau CRM dataset from an external source. The data from the cache and existing datasets can then be transformed and prepped further using dataflows or recipes. The dataflows and recipes can then be run to create datasets. And finally, the datasets can then be used to create lenses or dashboards, which will help in analyzing and getting insights from your data.
In the next part of the data orchestration blog series, we take a closer look at the concepts of data sync. Or head back to review the other blogs in the Data Orchestration blog series.
1 thought on “Data Sources and Landscape – Data Orchestration Part 1”
Thank you Sreedhar. The Blog is really cool and language is so simple to understand. We hope to read more from you 🙏