Unlocking Peak Performance in Zero-Copy Federation: A Guide to Data Locality

5
(18)

The ability to access and analyze information from various sources without creating multiple copies is a fundamental goal of modern data architectures. This is the promise of zero-copy federation, a powerful capability that allows you to work with your data where it lives. To unlock its full, game-changing potential, you only need to master a few key principles, the most critical of which is data locality.

Data Federation and Data Sharing in Data 360

To achieve a zero-copy architecture, there are two primary modes in Data 360: Data Federation and Data Sharing.

  • Data Federation allows you to access data from external systems (like a data warehouse in Google BigQuery) directly in applications in Data 360.
  • Data Sharing is the reverse, enabling external systems (like Databricks) to securely access data that is managed within Data 360.

The core idea behind both is to virtualize your data and act on it without having to move it. While these concepts apply to data of all sizes, the performance considerations discussed here become most critical when dealing with large datasets. This blog post will focus primarily on the challenges and nuances of Data Federation, but the principles of data locality and performance apply equally when external systems access your data via Data Sharing; they will simply face the same issues on their end.

Types of Data Federation

Data 360 offers three primary data federation methods. File Federation allows direct access to the underlying data files in a remote system. In contrast, Live Query Federation sends a specific, tailored query to the remote system for each access to minimize data transfer. Finally, Accelerated Query Federation pulls the entire raw table from the remote system to be cached and reused efficiently within Data 360.

Let’s first focus on the simple case, i.e. you have only a single table. For single-table use cases, the following table summarizes the various considerations:

Each approach has its own set of benefits and trade-offs:

  • File Federation: Based on the zero-copy initiative, this approach uses open standards like Apache Iceberg, Iceberg REST Catalog, and Parquet to enable direct access to data files. It is currently supported for Snowflake, Databricks, IBM and generic Iceberg catalogs with Amazon S3 or Microsoft Azure storage. With this method, queries are executed directly in Data 360’s query engine on the remote files. This means governance is also defined and enforced within Data 360, as the external system’s query engine is bypassed. From a cost perspective, compute happens in Data 360, with external costs limited to potential data egress fees. The main downside is a more involved, multi-step setup process.
  • Query Federation (Live): In contrast to File Federation’s direct file access, this approach sends a query to the remote data source. The external system computes the query, applies its own governance, and returns a result set. Data 360 automatically identifies parts of the overall operation that can be pushed down as a query to the external source (via JDBC). This method is supported for Snowflake, Databricks, Google BigQuery, and Amazon Redshift and features a simple, single-step setup. Because the work is shared, compute and governance are responsibilities of both platforms. Consequently, costs are incurred in both places: compute costs on the external system, plus compute and per-row data transfer costs within Data 360. The primary drawback is that performance bottlenecks and high transfer costs can arise when querying large tables. Additionally, Query Federation (Live) does not support change data feeds. 
  • Query Federation (Acceleration): This is a hybrid approach where, instead of querying the source directly every time, data is cached (including support for incremental updates) into Data 360 as needed. It is ideal for data that changes infrequently but is used multiple times, or for Data 360 features that require Change Data Capture (CDC), as it provides the necessary snapshots to calculate deltas. Same platforms as live query federation are supported and the setup is identical to live query federation. Governance is a shared responsibility between the data warehouse and Data 360. For the initial caching and customer controlled full refreshes, the external system’s compute is used, incurring costs there. After that, all queries run on the cached data using Data 360’s compute engines. The main trade-off, however, is that the cached data is subject to a window of staleness, which is defined by the customer’s configured refresh frequency.

The Impact of Data Locality

So far, we focused on use cases with a single table. As soon as we have more than one table, there is a new concept to take into consideration: data locality. It’s important to note that if you’re working with small amounts of data–let’s say under 100k rows–, none of this makes a significant difference; the convenience of federation often outweighs minor performance hits. This blog post is about navigating scenarios involving large amounts of data, where understanding data locality becomes critical.

The physical location of your data relative to the compute resources has a massive impact on query performance. Mastering this principle is what ensures your queries return in milliseconds instead of timing out, even across massive datasets.

The guiding principle is: the closer the data is to the compute, the faster the processing. This is due to the physical limitations of network latency and bandwidth. Moving large volumes of data across regions or different cloud environments introduces delays and potential bottlenecks, which directly impacts query performance.

Here’s a breakdown of how data locality affects performance in different scenarios:

As the table shows, the fastest scenarios are those where the compute and data are co-located, minimizing data transfer. This is achieved when data is only in Data 360, accessed via File Federation within the same cloud region, or queried via Acceleration (which essentially means that the data is located in Data 360). A live query where all the data is remote can also be extremely fast, but only if the entire query’s compute can be successfully pushed down to the remote system, which then executes the operation and returns only the small, final result set.

The performance of live query federation on large tables depends mainly on where the data resides. If all the data for a query is remote and the entire query can be pushed down, performance is as optimal as it gets. However, if the data is spread across both the remote system and Data 360, pushdown is limited. When full query pushdown isn’t possible (for example, if a join requires a local table), large amounts of intermediate data must be transferred between the remote data source and Data 360, and this becomes the slowest option. This is not just a theoretical concern; let’s walk through a practical example of how a seemingly simple optimization can have strong effects.

Case Study: From Efficient Pushdown to Performance Bottleneck

This example series shows the effects of data locality on live query federation when segmenting customers.

Step 1: The Efficient Starting Point
  • Initial Setup: We begin with our customer data living in an external data warehouse, connected to Data 360 through a live Zero Copy Data Stream. Our goal is simple: create a segment of customers with a ‘Gold’ loyalty status.
  • Query Execution: Because the data and the filter attribute (loyalty status) both reside in an external data warehouse, Data 360 can push the entire operation down. It sends a simple query to the remote that filters the data and returns only the ‘Gold’ members. Assuming only 100,000 customers have reached this status, the data transfer is minimal and the process is fast and efficient.
Step 2: Adding Complexity, Improving Efficiency
  • Changing Requirements: We now want to refine the segment to only include ‘Gold’ members who have purchased at least one ‘Dishwasher’. To do this, we configure a new live BYOL data stream for an “orders” table in an external data warehouse and map the relationship to our customer data.
  • Query Execution: Since both the customer and orders tables are live BYOL streams from the same connection, Data 360 is smart enough to formulate a single query. This query joins the customer and orders tables within the external data warehouse, applying both the ‘Gold’ status and ‘Dishwasher’ filters directly in the source. This is an ideal scenario where the entire query is pushed down to the remote. The result is an even smaller, more targeted dataset being returned (e.g., 50,000 rows), making the job even faster.
Step 3: Evolving the Strategy for a New Requirement
  • The Change: In an attempt to optimize performance further, we decide to accelerate the customer data stream, creating a cached copy within Data 360. Crucially, we do not accelerate the large “orders” data stream.
  • The Impact: The query execution plan now falls apart. Data 360 can quickly find the ‘Gold’ status customers from its local cache. However, to check for dishwasher purchases, it must join that local data with the remote “orders” table. Since the customer data is no longer in an external data warehouse, the join cannot be pushed down. The engine’s only option is to pull all relevant data from the remote source. It is forced to fetch every single order record for a dishwasher, for all customers, from the remote. The number of transferred rows explodes from 50,000 to potentially 1 billion. The query time is now dominated by data transfer, and the job becomes significantly slower or potentially runs into query timeouts and never finishes.
Step 4: Restoring Performance with a Unified Strategy

The issue in Step 3 wasn’t the choice to accelerate, but the creation of a mismatched federation strategy. The fix is to re-align the strategy for both tables. There are three excellent ways to do this:

  • 4a (Preferred): Use File Federation for Both Tables. Given that the source is the external data warehouse, the most robust solution is to switch both the customer and orders tables to use File Federation. This allows Data 360’s query engine to access the underlying files directly. The data remains in the remote warehouse, but the compute is co-located in Data 360, completely eliminating the data transfer bottleneck and restoring high performance.
  • 4b: Keep Both Tables as Live Queries. A simpler fix is to revert the customer data stream back to a live query. By ensuring both tables use the same Live Query Federation method, we return to the scenario from Step 2, where the entire join can be executed efficiently within the external data warehouse.
  • 4c: Accelerate Both Tables. A third option is to fully commit to the acceleration strategy. By accelerating not only the customer table but also the orders table, both datasets are co-located as cached copies inside Data 360. This makes the join a local operation, ensuring it runs extremely fast, with the only trade-off being potential data staleness.

You Might Not Even Know You’re Mixing Data Locations

The performance trap described above is subtle because you might not realize you’re mixing local and remote data sources. Certain Data 360 platform features introduce local data dependencies by design. For example, enforcing GDPR consent means that any query involving an Individual is automatically joined against the GDPR consent table stored in Data 360. This single, implicit join can break the pushdown for a query running entirely on remote data. Similarly, a Data Model Object (DMO) in Data 360 might be composed of data from different locations. Further, if you use the output of a Batch Transform on your remote data (which is materialized in Data 360 by design) in a new Segment that also uses remote data, you’ve inadvertently created a mixed-location query that forces a data pull. These are just a few examples, but they highlight how the details matter, especially when working with large datasets where understanding the location of all underlying data is crucial to maintaining performance.

Guiding Principles for High-Performance Federation

This example highlights a critical principle for Query Federation: Always consider where the data for your operation lives. For small datasets, these considerations are often negligible, as the convenience of federation outweighs any minor performance hits. However, as soon as your data gets big, a small change in data location can turn an efficient, zero-copy query into a massive, slow, and expensive data-copying operation.

The Composability Challenge: Strategically Combining Local and Remote Data

Query federation is incredibly powerful, but its effectiveness hinges on one critical condition: the ability to push most, if not all, of the query down to the remote system for processing. The entire model can break the moment a single part of the query depends on data that exists locally in Data 360. Because that one table is local, the entire join must now be processed in Data 360, potentially forcing the system to pull billions of rows from the remote system.

This behavior is the root of the composability challenge. Certain features, by design, create materialized data in Data 360. Operations like Harmonization, Batch Transforms, Calculated Insights, and Segmentation all produce outputs that are stored within Data 360. For example, creating a complicated segment on remote-only data is fine, as most, if not all, of the segmentation query can be pushed down to the remote system. However, the output of that process is a so-called Segment membership table that is created and stored locally in Data 360. A subsequent Activation on this segment requires this local membership table as an input, which immediately breaks the pushdown model for the activation query. Using any of these materialized datasets in a subsequent query complicates the pushdown model. It’s not necessarily an all-or-nothing scenario; parts of a query that don’t rely on local data may still be pushed down to the remote system. However, any part of the operation that requires the materialized local input—like a join—cannot be pushed down. This prevents a full query pushdown, which is where the most significant performance benefits lie. The engine is forced to pull all the required remote data to perform the join locally, which breaks the zero-copy paradigm and creates performance bottlenecks.

Choosing the Right Federation Strategy

Navigating these options can seem complex, but a good rule of thumb is to follow a progressive approach. Start with Live Query Federation, as it is the simplest to set up and ideal for many use cases. If you find that performance isn’t meeting your needs (often due to large data transfers or a missing pushdown), try Accelerated Query Federation next. This is a great fit if your data does not change too frequently and can tolerate minor staleness. Finally, if neither of those options fits your requirements, or if you need the highest possible performance on remote data that remains in your data lake, File Federation offers the most powerful and robust solution.

With that in mind, let’s look at the strategic design patterns that emerge from these options:

  1. The Co-location Strategy (Accelerate Everything): If your workflows frequently use features that materialize data (like Harmonization or Batch Transforms) or depend on previously ingested data, the most performant approach is to co-locate all the necessary data. By accelerating all relevant datasets into Data 360, you ensure that all subsequent operations are performed on data that is local to the compute resources, guaranteeing the highest possible performance.
  2. The Direct Access Strategy (File Federation): If your priority is to keep data in the source system while still enabling complex transformations, File Federation offers a highly efficient alternative. Since this pattern accesses the underlying data files directly, it can often bypass the pushdown limitations encountered when materialized data is introduced, providing a more robust path for zero-copy operations.
  3. The Hybrid Strategy (Filter and Co-locate): This pattern perfectly illustrates how to use Live Query Federation’s strengths—like its simple setup and fast time-to-insight—while consciously avoiding its common pitfalls. It’s a powerful design where you start with live query federation on a large remote dataset, then create a transform to filter it down to a smaller, more relevant subset. This transform, by design, materializes a much smaller, more relevant result in Data 360. You can then use this smaller, local dataset in subsequent, high-performance joins with other accelerated or local data. This approach gives you the best of both worlds: you get the immediate access of live federation for rapid exploration and initial filtering, but deliberately materialize only the small subset you need, ensuring all heavy-lifting joins are fast and local.

Conclusion

Zero-copy federation is more than just a buzzword; it’s a powerful architectural approach for building a truly unified and agile data landscape. The principles of data locality and composability aren’t limitations—they are the strategic levers that turn this concept into a high-performance reality. By proactively choosing the right federation pattern for your use case and embracing data locality as a core design principle, you move from simply connecting data to intelligently orchestrating it. This is how you unlock the true speed, efficiency, and agility promised by a zero-copy future.

Additional Resources / Further reading

How useful was this post?

Click on a star to rate useful the post is!

Written by

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Scroll to Top