Dynamic Data Mapping and ingestion with Snowflake

Manisha Jaiswal
Nov 24, 2021
6 min read

Whereas syncing data from static files into a database is straightforward, the situation is rather different when handling dynamic files (where file attributes change for a given file). Due to the changing structure of rows and columns during ETL loading and handling, issues may arise. In this writeup, we will illustrate how to use native Snowflake and Azure technology in handling inconsistencies encountered when loading data files.

Potential Scenarios for Dynamic Data Handling

There are several scenarios of dynamic data handling that could result in ETL error. The below table summarizes four of such situations, each of which will be discussed in detail in the subsequent section titled “Handling Scenarios with Dynamic Data Handling”.

Of the four scenarios, only the first one has an equal number of columns in the file (source input) and table (destination) and is straightforward. As a result, no ETL error will be generated, as no structure update is necessary to insert the source data into the target table.

However, structural adjustments will be necessary for each of the other three scenarios that when unhandled, may result in ETL failure. With Snowflake's data management capabilities, it will be feasible to handle all four issue scenarios

How Dynamic Data is Handled with Snowflake?

The process flow diagram below illustrates how the Snowflake architecture initiates the data mapping and ingestion process when a JSON file is uploaded to blob. All actions are performed progressively utilizing the Azure cloud architecture, and at the final step, data is fed into Snowflake's loading zone using Snowpipe.

The Snowflake Stored Procedure is the backbone of this solution and is an accelerator developed by indigoChart.

Technologies We Use

To implement this data management architecture efficiently, we have coupled Azure services with Snowflake components.

Azure Services

We use Microsoft's Azure cloud computing technologies to build a data pipeline for the automated feeding of files from Azure storage into a snowflake table, with the commonly used Azure tools listed below:

Blob Storage: Among the many benefits of utilizing blob to manage vast volumes of raw data are its high availability, cost efficiency, and data consistency. Additionally, it is very competent in data recovery in the event of a disaster.

Event Grid: Webbooks, Azure Functions, Event Hubs, Service Bus, Relay Hybrid Connections, and Storage Queues are just a few of the Event Grid's widely utilized event handlers. It is very beneficial for developing a server less cloud architecture.

Queue Storage: Azure's Queue Storage enables us to significantly increase efficiency via throttling and to be very robust to any system faults.

Snowflake Components

Notification Integration feature enables us to create a connection between Snowflake and Azure.

Storage Integration enables single storage to accommodate many external stages and, concurrently, allows users to withhold credentials during data loading or unloading.

External Stage enables storing data files in a non-Snowflake environment and natively supports Azure containers for smooth communication.

Snowpipe gives real-time insights to all users cost-effectively and flexibly without causing task conflict. It helps us to load data instantly.

Stream maintains a history of data manipulation language (DML) alterations, executing particular operations on the updated data. For example, the following figure illustrates how the steam keeps track of DML modifications in a previously created table.

Tasks enable the execution of continuous ELT processes in combination with table streams and the scheduling of data transformation. It may be used in a stand-alone environment to generate periodic reports by inserting or by combining data. Additionally, we make the greatest use of Snowflake's operational rules while conducting overlapping jobs by doing a suitable SQL statement analysis. Proper usage of these integrated Snowflake technologies greatly saves our users' operating costs.

Set up and Scenarios

We will now set up our scenarios to demonstrate smooth handling of all four scenarios using Azure portal and Snowflake environment.

On the Azure side

First, we have created a Storage Account and Event Grid in our Resource Group on the Azure side

We also have our Container to load files and Queue storage for messages created.

On the Snowflake side

Here are some of the key tables that we will use for this load:

Data Dictionary is used for deriving new columns, especially the ones that are available in files but missing in target tables.

Landing Table is to upload files data.

Operational Data Store Table is used for parsing and storing LZ data.

Structure of ODS Table.

In the next section, we will upload files and show how data flows into target tables in all four scenarios.

Handling Scenarios with Dynamic Data Handling

Scenario 1: File Columns = Target Table Columns

The source JSON file has identical number of columns as the destination table and traverses the Snowflake landing zone. Following that, a Stream from the top of the table monitors DML changes and launches a Task to start the Stored Proc that inserts data into the ODS layer.

Thus, each attribute in the Source File has been represented by Column in Target table, as expected.

Scenario 2: File Columns < Target Table Columns

As seen in the accompanying figure, the source JSON file has three columns, but the target table expects five. Data for Columns “Description” and “Leave” are missing in the file.

Despite this, we were able to correctly enter all data by identifying the missing ones as NULL.

Scenario 3: Dynamic Columns\Attributes within the file

The number of attributes in the source JSON file are different for each row in this scenario. For instance, the 1st record has four attributes, the 2nd record has three attributes, and the 4th record has only one attribute as seen in the highlighted area.

Now let’s locate the data in the ODS table and see how it is represented once integrated in Snowflake.

We scanned the data for each row in the target table and ensured that the columns specified in the File were inserted. Columns that are not present are denoted as NULL.

For example, Record containing single column – “SerialNumber” in the file, has been inserted in target with value in SerialNumber column, rest all columns are marked as NULL.

Scenario 4: File Columns > Target Table Columns

In this case, the source JSON file includes two more columns (“Comments” and “PendingIssues”) compared to the target table.

Referring to the Data Dictionary, we can see the column “Comments” in this table, the Data type for which is “String”. Likewise, the data type for the column “PendingIssues” is also derived by scanning the data dictionary.

Notice the change in table structure once the Task gets executed for loading the above JSON data. Two new columns “Comments” and “PendingIssues” gets added at the end of the table. For prior entries, NULL was added for these additional columns.

Key Advantage of the proposed Snowflake based solution

Listed below are the primary advantages of using our proposed Snowflake solution:

Users receive the detailed insights instantly, as Snowpipe delivers real-time business data from all areas without any issues related to workload.
Duplication of data can be fully avoided because Snowpipe uses file loading metadata tagged with each pipe object.
The implementation is highly cost effective because you will only need to pay for the actual computational time for data loading. And, it is measured by seconds, not hours!
The system is very easy to implement because data can be loaded asynchronously in a smooth and continuous flow with streams and tasks.
Snowpipe eliminates the worry for server management as it adjusts itself for the required capacity of loaded data automatically.
ETL failure can easily be prevented with Snowflake even in dynamic scenarios, as we already demonstrated in this post.

And some minor drawbacks

While using Snowflake is an excellent complement to your organization's data management strategy, there are a few small drawbacks, including the following:

Tasks cannot be monitored or troubleshooted through web-based features.
Access to the source table may be denied due to stream staleness.
Estimating the latency of a Snowpipe is often challenging.

With our extensive knowledge and relationship with Snowflake, we can assist your business in managing the mapping and ingestion of dynamic data flawlessly. We execute on all areas of Data and Analytics by embracing and organically expanding your current skills, whether it's developing new solutions or supporting your operations. Contact us right off the bat and we'll take it from there.

Visit us at www.indigoChart.com or drop us a line at hello@indigochart.com

indigoChart Menu