Optimizing Data Processing in Azure Data Factory: The Power of Change Data Capture

Tahar Chanane
Towards Dev

--

In the fast-moving world of data, being efficient is very important. The Azure Data Factory (ADF), created by Microsoft, is a vital tool in extracting data. But the challenge is making these extractions faster and cheaper while using ADF in a better way. In this piece, we will look at two different strategies: one that uses a lot of API calls and costs more, and another that uses Change Data Capture (CDC) in ADF to save time and money. We’ll explain how these two methods are different and how using CDC can make data handling better.

Using API Calls with Azure Data Factory

The Azure Data Factory is a powerful service in the cloud that helps in setting up and managing data pipelines. In this case, we are using ADF to set up a process to get data in a way that is both efficient and doesn’t cost too much.

In ADF, the Copy Data function helps in setting up dynamic API calls within these pipelines. We start with one API call made through ADF to get important data, such as details about ongoing projects or teams. This first API call starts a series of other API calls that are set up based on the data we get first. This way, the process of getting data is dynamic and conditional, making it more intelligent and detailed as explained in our case study below.

The Traditional Method: A Case Study

In our first case study, we see a method that is traditional but costs a lot. Here, each API call gets a lot of data every time it is run. It starts with one API call getting data on all projects, which then starts a loop. Each project ID gotten from this helps in creating unique URLs for individual API calls to get detailed data for each project. But this loop doesn’t tell apart new and old data, causing many unnecessary API calls and increasing costs.

A Smarter, Cost-Saving Method with CDC: A Case Study

On the other hand, the second method in our study introduces a smarter, cheaper tactic. This method only makes further calls for data sets that changed since the last time it ran. Here’s how it works:

  1. An API call starts to gather data on all projects.
  2. A filtering system is used to only include project IDs with changes since the last run time.
  3. Following API calls only focus on these changed data sets, reducing the number of calls and the costs.

This way, the process not only saves money but also focuses only on changed data sets, saving time and computing power.

API-Driven Data Retrieval Workflow, by Tahar Chanane.

A Note on Azure Data Factory’s Inbuilt CDC Features

Before we continue, it’s essential to note that Azure Data Factory offers built-in CDC features that can further enhance the efficiency of data retrieval processes. According to the official documentation by Microsoft, there are several avenues within ADF to implement CDC effectively. Here are some key highlights:

  • Change Data Capture Factory Resource: This is a straightforward way to get started with CDC in ADF, where you can select your sources and destinations, apply optional transformations, and start your data capture without designing extensive pipelines or data flow activities.
  • Native Change Data Capture in Mapping Data Flow: This feature allows ADF to automatically detect and extract changed data (including inserted, updated, and deleted rows) from source databases without needing timestamp or ID columns for change identification.
  • Customer-Managed Delta Data Extraction in Pipeline: This offers a more flexible approach where organizations can create custom delta data extraction pipelines for all ADF-supported data stores, including using lookup activity to get watermark value stored in an external control table and more.

For a deeper understanding and to explore tutorials and templates, refer to the official Microsoft Azure documentation.

Comparing Both Methods

Now, let’s compare the cost-saving method to the traditional method to understand its benefits fully:

  • Saving Money: The cost-saving method significantly reduces expenses, possibly bringing them down from thousands of dollars to much less, thanks to smart filtering systems.
  • Efficiency: By focusing only on changed data, this method avoids getting the same unchanged data again and again, making data pipelines more efficient and faster.
  • Lean Processing: This method follows lean processing principles, promoting efficiency and avoiding waste, while the traditional method does not, resulting in wasted resources and higher costs.

Integrating Change Data Capture (CDC) in ADF

Now, let’s look deeper into how Change Data Capture (CDC) can be used in the Azure Data Factory to make data workflows better. CDC helps in identifying and capturing changes in data within a database or datastore, offering great value in optimizing data workflows within ADF.

Understanding Change Data Capture (CDC)

CDC focuses on reading and processing only the data that has changed since the last time the pipeline ran, unlike querying and processing the entire dataset repeatedly. The goal is to save time and reduce operational costs.

Using CDC in Azure Data Factory

Azure Data Factory offers several resources and techniques to use CDC effectively, including:

  • Change Data Capture Factory Resource: This resource makes it easier to use CDC in ADF, providing a guide to choose data sources and destinations, apply transformations, and start data capture. This means complex pipeline designs are not needed, and you only pay for the processing power you use.
  • Native Change Data Capture in Mapping Data Flow: ADF’s mapping data flow has built-in CDC features, helping in automatic detection and extraction of changed data from source databases without needing timestamp or ID columns for change detection.
  • Auto-Incremental Extraction in Mapping Data Flow: In cases where delta data retrieval is needed from source stores, ADF’s mapping data flow can automatically identify and extract updated rows or files based on incremental columns or file last-modified time.

The support for CDC extends across a variety of connectors including, but not limited to, Azure SQL Database, SQL Server, Snowflake, and various Azure storage options.

Implementing Best Practices

To maximize the benefits of using CDC in ADF, certain best practices are recommended:

  • Utilizing Native CDC: Leveraging native CDC functionality is the simplest way to retrieve change data, imposing less load on the source database during data extraction processes.
  • Opting for Auto Incremental Extraction: If your database is not in the list of ADF connectors supporting native CDC, considering the auto-incremental extraction option is a wise choice, requiring only the input of incremental columns to capture changes.

For a more detailed understanding and insights into using CDC in ADF, you can explore the official Microsoft documentation on the subject.

In conclusion, using Change Data Capture (CDC) in Azure Data Factory can lead to more efficient and cost-effective data processing. By focusing on only the changed data since the last run, businesses can save time, reduce costs, and optimize their data workflows. Integrating CDC into your ADF setup is straightforward, with several tools and resources available to aid in the process.

By employing smarter data processing techniques such as CDC, organizations can ensure a leaner, more cost-effective approach to handling data in the cloud, thus fostering growth and innovation in their operations.

Remember, making smart choices in data processing can lead to significant savings and efficiency improvements. Choose CDC in Azure Data Factory for a smarter approach to data management.

Seamless Data Ingestion from Azure Blob Storage to Snowflake: A Detailed Walkthrough featuring Snowpipe

In your data pipeline journey, you may find the necessity to seamlessly ingest data into Snowflake from various data sources. A highly efficient method to achieve this is through utilizing Snowpipe to ingest data from Azure Data Lake Storage (ADLS) directly into Snowflake. This process is known for its scalability and speed, making it an optimal choice for modern data pipelines.

For a comprehensive guide on setting up and using Snowpipe for data ingestion from ADLS to Snowflake, we invite you to refer to our detailed article on this topic, where you will find step-by-step instructions to implement this process: Seamless Data Ingestion from Azure Blob Storage to Snowflake: A Detailed Walkthrough featuring Snowpipe

By familiarizing yourself with this method, you can further enhance your data pipeline’s effectiveness, ensuring a smoother, more integrated approach to data management and analytics.

--

--

Data Consultant, Professional Services @Qrious. BInfSc Data Science. 3x Azure, 4x Databricks, 2x Snowflake, and Tableau Certified.