If you are new to Salesforce or Einstein Analytics and you haven’t been in the trenches getting data curated for Einstein Analytics you might not have heard the term “Data Sync” even if you have heard of it you probably don’t fully understand it. Don’t fret I am the Product Manager for this feature and there are times I am not sure what magic is happening. It’s not that the concept is foreign, we have been staging data for analytics for as long as data warehouses have been around. The mystery is around how it is executed and what sort of methods can be used to make it more efficient.
Ok let’s start with the basics; Data Sync is a mechanism for establishing a link to an object in a data source (we call them connections facilitated by connectors in Einstein Analytics) that can be used to copy some or all of the data into Einstein Analytics. We have different Connectors that can be used to set up connections to applications and databases such as Snowflake, BigQuery, Redshift, Salesforce, Marketing Cloud and numerous others. Not all connectors are created equal, they have different features and limits so be sure to read the documentation for the ones you plan to leverage.
Salesforce Local (aka SFDC_Local) connections (the Salesforce Org that hosts the Einstein Analytics you are using) have particularly special capabilities when it comes to Data Sync. So it is important to consider the connection you are using when contemplating how to use a particular connection. Check out the documentation for more detailed information.
One of the more powerful capabilities for SFDC_Local is something called “Connection Mode” which for objects coming from the local Salesforce org allows for incremental load capabilities. In the future, we’ll dive into this topic in more detail, but for now, just understand objects capable of supporting incremental will load faster, therefore, decreasing the overall processing time and load on the system (win-win).
For Data Sync regardless of the connection a single object/table is set up at a time and the process involves using the connection to query the metadata of the database/application API to obtain field names and datatypes. We also pull back sample data to help evaluate if you are selecting what you really want.
You are using this step as a means of only bringing in relevant and needed fields to keep things simple and clean.
Recently we’ve also added Filters to the connectors that can support them so you can limit the data coming in rather than limiting it afterward of course if it makes sense to do so.
If a connector supports filtering we add a section to it’s documentation page identifying it as such and you will see the “filter icon” in the Preview Source Data page as well.
Ok, you have a connection and you have an object set up for Data Sync. Now it just magically stays in sync all the time … right … no it’s not that magical. We have to schedule Data Sync to run and that is done at a connection level. So you will want to pick the right time of day for refreshing the data in Einstein Analytics for the different connections you have. And with our scheduling capabilities, you can schedule up to every 15 minutes. But hold your horses bucko, just because we can schedule every 15 minutes doesn’t mean you should as your load may take 1 hour to run so we will be canceling jobs more than running them. This is the part I was talking about how stuff behind the scenes is getting “resolved” but you don’t see it unless something goes wrong.
A cool thing about the scheduling of Data Sync is that you can use it to drive your Recipes in Data Prep. So, if you make the Recipe’s schedule dependent on Data Sync, when Data Sync is done your Recipe runs and you have the latest and greatest data.
Oh and on the topic of an enhancement that we have done recently. In the past, you could have only 1 connection for your Salesforce Local. So everything ran every day or every hour … a bit painful. Well, we have added Connection Groups (up to 10 of them) that you can create so you can schedule 1 object for every 15 min, 5 for every hour, and 20 for once a week if you need to.
The way to do this same thing for non-Salesforce Local scenarios is to just create new connections since scheduling is done at that level.
Why do we even need to do this sync stuff, can’t we just pull it from the source when we need it? Well, the answer is yes you can in some cases, but in others, your operational teams will be quite irritated that your dashboards are impeding the operations of the business. Data Sync facilitates those cases where your data does not need to be up to the minute fresh for the cases where that is the requirement we have Salesforce Operational Reporting and also some new Direct Connectors for Salesforce and Snowflake that can be leveraged in Einstein Analytics.
One more topic that sometimes comes up related to Data Sync is that there are “limits”; Salesforce as a multi-tenant environment uses limits to manage and maintain expected service levels. So for Data Sync, we have some base limits such as the 100 objects; meaning no more than 100 objects can be set up for sync by default. If you hit this limit this is something we can discuss extending and we are examining if globally we might increase this. The other primary consideration is that we allow for 3 concurrent syncs to run at a time. So there is a need to set up a scheduling strategy to allow things to complete in a reasonable time without competing. See the documentation for additional limits.
Hope that helps illuminate the whole Data Sync topic or you all! Stay tuned as there is more to come on this front