Working with Einstein Analytics I often hear people finding the data manager the most challenging part to learn. Perhaps because when people work with dashboards in core Salesforce we don’t have to shape our data the same way, we simply need to understand what objects we want to use in our report or dashboard. There are more considerations to have with Einstein Analytics and the Data Manager is a crucial part of that and I want to try to uncover this in this blog.
What is the data manager?
The data manager is the place where everything is centered around bringing data into Einstein Analytics from Salesforce or external sources and transform it into datasets that can be used in your dashboards. Remember a dataset is simply a flat file where each row represents a record of the grain you have defined including information from related objects; as an example it could be a dataset where each row is a case with information of the contact that submitted the case. Anyway, back to the data manager. Simply put the data manager will give you get a 360 view of the data in Einstein Analytics including:
- Data sync
- (and soon data prep will be transforming the data landscape)
You get to the data manager from Analytics Studio by clicking on the gear icon and choosing “Data Manager”. From the home page, you can also see a link to the Data Manager in the sidebar to the left.
Once you get to the data manager you see the monitor, which allows you to get an overview of the flow of data in Einstein Analytics. The monitor is a log of every data activity including data sync, data upload, dataflow runs etc., and their status. You will see both successful and failed attempts as well as warnings, but remember this is a log. A dataflow might have a log as failed yesterday but ran successfully today – your data will be up to date due to a later successful run. You can always see the error or warning message in the message column, which is helpful when you need to fix the error or warning.
Einstein Analytics comes with a list of out of the box connectors that allow you to get access to a range of data. Firstly, the connect option in the menu will only be available if the setting “Enable Data Sync and Connections” is enabled. You will find this in the Analytics Settings in the core Salesforce setup.
With the setting enabled, you will by default have the Salesforce Local Connector, which allows you to bring over data from your local Salesforce objects. By local I simply mean the environment Einstein Analytics is enabled on.
It’s important to note that because you have created a connector it doesn’t mean you have data available when you run your dataflow or recipe. In fact, you need to make sure that your connector is scheduled to sync data over else you will not have any data or any new data to add to your dataset. Hence each object is synced by setting a schedule for the connection, in this case, the SFDC_LOCAL.
Each object for the local connector can be synced incremental (based on last modified date), full sync or periodic full sync. The option to change the sync mode can be found in the “Connection Mode”.
As you may be aware of formula fields in Salesforce are calculated when you view the record, hence if the field is based on other objects the last modified date might not have changed and your synced records in Salesforce will not be updated with incremental sync in Einstein Analytics as the last modified date has not changed. The full sync option would make sure every record is synced over at the scheduled time and periodic full sync is incremental except once a week where it runs a full sync. The latter two options will make sure that the formula fields are up to date.
When you create a dataset from the dataset builder (click create dataset in the Analytics Studio) you have the option to select any object and fields, once you click create these objects and fields will automatically be added to the local Salesforce connection. If you manually want to add more fields you can always click on the object in the overview to add more fields.
If an object has not yet been added you can manually add it by clicking “Connect to Data” in the top right corner and select the local connector. This will allow you to browse the objects from that connection, select it and add it to Einstein Analytics.
You don’t only have to rely on local Salesforce data, you can bring in data from anywhere. The easiest way is using the default connectors, the list is ever-increasing so check the documentation for the latest list of connectors.
Just be aware that external connectors run a full data sync and not incremental. Some of the connectors will allow you to apply a filter on the data so you can bring in smaller amounts of data. Another thing to be aware of is that the connectors do have their own set of limits in terms of how much data you can bring in. My recommendation would be to evaluate these limitations in the documentation as part of your design process, you may have to do some work on the source to cope with the limits if you are dealing with large volume.
And remember you can always bring data in using middleware like Mulesoft (in fact there are Mulesoft connectors in the connection option if you have Mulesoft), Jitterbit, Informatica etc. In fact, most of these middleware solutions make it very easy to push data to Einstein Analytics. Just note this would not be part of the data sync, this instead would create a dataset directly in Einstein Analytics.
With the connectors in place you get the option to sync your data, this all comes down to the setting “Enable Data Sync and Connections” as mentioned above. Now if you choose not to enable this you will not be able to leverage the connectors and you can only get data in from the local Salesforce environment unless you use middleware. In this case, when your dataflow runs it will extract all your data from Salesforce, which will add time to the dataflow run. This scenario is illustrated below.
When the setting is enabled we are able to sync our data over so it’s ready to be grabbed and used in our dataflow or recipe. This means that the dataflow won’t extract any data as it’s already there ready to be used, hence it has a positive effect on your dataflow run time.
Note that connected data does not count against your Einstein Analytics row count limitations, only data that has been registered in a dataset counts against this limitation.
The recipe is one way of transforming your connected data or existing datasets into a new dataset. New users tend to use recipes as the adoption of the tool and its UI is easy.
You find it in the left menu under “Dataflows & Recipes” and select the “Recipes” tab.
A recipe can be scheduled to run by clicking the dropdown menu.
To create a new recipe click on the create button. You first need to select the source data and you can pick any existing dataset or choose from the connected data when selecting that tab.
Once you are in the editor you see
- Preview of your data
- Transformation panel of all the transformations you are doing to your data
- Column profile where you get an overview of data completeness as well as the attributes of the column.
- Options to transform the data as a whole including filtering, adding data or aggregate the data.
To transform columns simply highlight a column and click on the dropdown to see the options. Depending on the field type (measure, dimension, date) different options will appear. Also, notice the “Einstein Suggestions” that can recommend machine learning (ML) transformations like clustering or complete missing values, which are options unique to recipes.
One of the great options you don’t have in the dataflow is to create true joins. By clicking the “Add Data” button you are able to add new columns using lookup as well as true joins (right, left, inner, full).
Dataflows are what Einstein Analytics was “born with” and hence what a lot of legacy setups are leveraging. It used to be just a JSON file you had to manually edit to make any changes, but since we have gotten the visual dataflow editor which I probably should say comes with some challenges for easy adoption. And so a lot of new users find the adoption can be a lot harder than recipes.
To find the dataflows you go to the “Dataflows & Recipes” menu option and make sure that the “Dataflows” tab is selected, which it should be by default.
If the “Enable Data Sync and Connection” setting is enabled, you can have multiple dataflows (just like recipes) but one dataflow can result in many datasets (unlike recipes). So you always have the choice of modifying and existing dataflow or create a brand new one by clicking the create button.
Each dataflow can be scheduled to run by clicking on the dropdown next to the name.
Looking closer at a dataflow, you may find it a little bit more confusing than the recipe. But it’s really just a flow of different actions that are tied together. If we first look at the different components on the screen we see the following (see numbered image below):
- Download and Upload of the JSON file. Before making any changes I can recommend taking backups, there is no versioning and if you mess something up the easiest way to go back is to upload a backup.
- The dataset editor is the UI friendly way of choosing a data source, it’s fields and add related objects – it’s the same tool you get when creating a dataset from Analytics Studio.
- All the different transformations (aka nodes) you can leverage in your dataflow.
- How the different nodes are tied together. Each node symbolizes a transformation in the flow. It will always start with some kind of extract (sfdcDigest, digest, edgemart) and end with a register.
Alright, let’s take a closer look at the nodes.
- sfdcDigest: define a local Salesforce object and fields to include in the dataflow.
- digest: define a table or object and fields from a connected source to include in the dataflow.
- edgemart: define an existing dataset to include in the dataflow.
- append: join two or more sources together as an append meaning joining them together by creating more rows.
- augment: join two sources together as a lookup (not true join) by adding more columns. A left and a right source must be defined plus if you want to look for a single value or multiple values – make sure to check out the documentation for the latter.
- computeExpression: derive new fields (similar to a formula field in Salesforce) by using different values from the same row.
- computeRelative: derive new fields by using different values from the same column. This is done by grouping your data, sorting it and calculate on previous values, next values etc.
- dim2mea: convert a dimension (text) to a measure (number).
- flatten: take a hierarchy and flatten the output, this is typical relevant when setting up security predicate for datasets. Check out this recorded webinar for more information.
- predictions: score your records based on a model from Einstein Discovery. You would need to create a story and deploy it before being able to use it in this node.
- filter: limit the result of your data.
- sliceDataset: define which columns to keep or drop from the dataflow.
- update: define columns to update in your dataflow.
- sfdcRegister: register a dataset with all the previous nodes/transformations in the dataflow.
- export: to export data to Einstein Discovery. This is only relevant for legacy users that do not have access to Einstein Discovery stories within Einstein Discovery.
The Einstein Analytics product team is working on transforming the data platform and will be introducing the new Data Prep starting with a beta in the Summer 20 release. Read more about that here.
How is data structured?
If you are used to reports in core Salesforce then you need to get used to a different way of organizing your data in Einstein Analytics. In core Salesforce report types are used to define what data you can combine in a report. A few things apply here:
- You start from the top and move down in the hierarchy
- The hierarchy is a straight and direct relation
- Only three levels can be added.
Hence a report type could look similar to the below image.
In Einstein Analytics this is turned upside down and you need to start from the bottom of the hierarchy and go upwards. The first object is often referred to as the “root” or “grain”. By doing this you can also add multiple objects from the same level and you can go as many levels out as you need to. Below you can see an example of how the report type example from above would be joined together in Einstein Analytics using a dataflow or recipe to create a dataset.
Einstein Analytics uses an inverted index when storing the dataset. The essence of this is that when you are bringing data into Einstein Analytics all your dimensions will be indexed, which is why any data can be quickly queried in your lenses and dashboards regardless if you have a thousand rows in your dataset or 1 billion rows. You don’t have to do anything to enable this, but this is the reason why Einstein Analytics is not querying data directly from an object like core Salesforce reports.
Flow of data and the schedule
Finally, let’s talk about the end to end flow of your data. Each of the components in the data manager is related, which also means you need to make sure that they are updated in the correct order. If your data sync is set to run after your dataflow or perhaps not scheduled at all, you will not have the latest data in your dataset. So make sure that your data sync is scheduled before your dataflow or recipe.
As mentioned when you create a dataflow or a recipe you can reuse existing datasets as a source. If you do that you want to make sure that the source dataset has been updated before you apply new transformations to that dataset. Hence make sure you know how your dataflow and recipes are connected and schedule them to run. Below you can see a simple flow where the dataflow or recipe is just using connected data, here the data sync runs at 1am and the dataflow will run at 2am assuming the data sync has completed in less than an hour. You can always run your data sync and check the monitor to see how long it takes to complete.
Note you can also schedule a dataflow to start when the data sync has completed so you don’t have to estimate the time the sync takes to run.
As mentioned above it can become more complex if dataflow and recipes are dependent on each other and you do need to consider this when setting up the scheduling. Let’s say you have a dataset that is based on connected data, but the output is used in another dataflow. And the output of that is being used in a recipe, well then suddenly you have several levels of dependency and you need to make sure your scheduling reflects that as illustrated in the image below.
Finally, I just want to remind you that the data manager is changing with the new data platform, so please keep up to date with the release notes and check out the blog series that I will be writing starting with Welcome to the new data platform.