This article provides recommendations on a Snowflake powered Data Lake scenario in light of challenges and advantages offered by a Data Lake solution.
In general, modern enterprises must deal with data sources available in varied degrees of volume and variety. Connecting to multiple data sources for analytics becomes cumbersome, while ensuring consistent performance and speed for converting data to insights. Data and thus Information does not provide tangible value unless managed strategically for the purposes of analytics.
What is a Data Lake?
Data Lake is a well known concept and we'd rather not spend too much time explaining it. However for the uninitiated, a Data Lake is a repository to store all your raw data for analytics.
For modern analytical requirements, it becomes imperative to harness data in different types and format coming from different sources and at varied velocities and thus a Data Lake can act as central repository to store all your structured data coming from your on-premises applications, databases, and from cloud data sources. It must also provide capabilities to store data coming in semi-structured (JSON, AVRO, PARQUET, XML, RAW) and unstructured formats (Audio, Video, Binary) with hundreds of millions of data streams continuously ingesting data into data lake storage.
Why Data Lake?
Every organization deals with SaaS data in one way or the other. SaaS applications being easy to configure and ready to use, organizations must not be limited on maximizing usage of its data stored in the cloud. Exponential amount of data being generated by the second, makes it difficult to secure and store the data as soon as it is made available along with the potential risk of losing value derived from it, if left untouched.
Data Lake enables rapid analytical capabilities over big or small data and facilitates maintaining a single source of truth. While Data Lake platforms are expected to be robust, elastic, and scalable to serve analytics requirements, they must also cater to a variety of users and use cases.
Challenges in Implementing Data Lake
5. Lack of Governance and Security
6. Limited Automation capabilities
7. Native SQL support to query your structured and semi-structured data.
8. Complex data pipelines built using multiple applications.
9. Data silos even with data lake implementation
An ideal Data Lake solution must
1. Store data easily in RAW format
2. Enable analytics as soon as data is available in Data Lake.
3. Make data availability consistent and reliable.
4. Facilitate different use case for analytical needs, OLAP, and Data Science.
5. Enable Governance, Security and Granular Access Control (GDPR, HIPAA compliance)
6. Provide Self-service for all data consumers
7. Support Performance and Scalability
9. Support Data-Ops
10. Foster Agility
11. Support Pay as you use (OPEX) models.
An Ideal Data Lake Solution
Snowflake Cloud Data Platform with Data Lake workload
Snowflake enables rapid data access, query performance and transformation while ensuring native support to SQL and built in Data Access control and Role Based Access Control (RBAC) data access security.
Snowflake provides flexible, elastic, robust, massive performance architecture for data lake workloads with abilities to join variety of data formats in single SQL queries.
(**Note: Support for Unstructured data currently in Private Preview)
Snowflake Data Lake workloads can help achieve the following use cases:
1. Enhance existing Cloud data lake capabilities.
Enterprises expend considerable amount of efforts in building, designing and implementing Data Lakes built on S3, Blob, or GCP storage. To achieve simplicity and ease of use of Snowflake Data Lake workloads, enterprises augment existing data lakes instead of replacing them.
Data from Snowflake external tables can be queried without being loaded into Snowflake. To enhance query performance, materialized views with option of portioning can be created on top of external tables.
2. Ad-hoc analytics using SQLs.
Users can run ad-hoc queries on external table without loading data into Snowflake. This helps to determine if relevant data is required for analytics and if required to be kept in data lake.
Data in Materialized views can be refreshed automatically on the basis of Cloud Notification service which notifies Snowflake on arrival of new data files in external stage.
Data Lake with Snowflake Cloud Data Platform can be implemented in two ways:
1. External Tables (Cloud Data Storage)
Snowflake Data Lake with External Tables leverages Snowflake Storage Integration feature to integrate Cloud storage for AWS S3, Azure BLOB or GCP Storage as external tables. By creating storage integration and file format as per file types required to read, SQLs from Snowflake can directly select data using simple SELECT statement from cloud location files.
Steps to create external table
Step 1: Create Storage Integration
Step 2: Create File format
Step 3: Create External Table
Advantages with External Tables
Query data using SELECT statement on top of your Data Lake
Snowflake supports execution of concurrent queries against data lake without impacting performance.
Use external tables to directly query data in your data lake without having to move data.
Increase query performance by using materialized views over external tables
Synchronize external tables with Apache Hive metastore
Automatically register new files from your data lake with partition auto-refresh
Accelerate data exploration with Snowsight, the built-in visualization UI for Snowflake
2. Snowflake Tables (RAW Data Lake + Automation on Data Pipelines)
With Snowflake as Data Lake workload, ingest all your data in Snowflake data lake tables as is from sources. Perform SQL operations on structured and semi structured data at the same time.
The recent GA feature on vectorized Parquet scanner, has improved performance 8x for parquet files.
Leveraging snowflake as Data Lake helps in making data storage cheaper, compressed and secured.
Automated Data Pipelines using external tables
With availability of Streams, Data Sharing and enhanced performance on Parquet file, you can build automated data pipelines for your external data in cloud storage.
Steps in Automation
1. New files landing into Cloud Storage
2. Notification Triggers Stream on External Table
3. Streams and Tasks performs Load (with Transformations) into Snowflake Tables
4. Data Sharing via External Table (now possible) and Internal table
5. Unload data from Internal tables (enriched data) to Cloud storage
6. No Manual actions required for this load/transformation/unload process.
Advantages using Snowflake Data Lake Workloads
1. Augment existing data lakes with Snowflake’s capabilities of automations with zero maintenance.
2. Auto ingest data and enable change data capture (CDC) with continuous data pipelines, using Cloud Notifications, Snowpipe, and Streams & Tasks.
3. Transform data efficiently with ANSI SQL.
4. Transform and load enriched data using Scale up or down instantly to optimize pipeline performance.
5. Build and Orchestrate Data Pipelines using Streams and Tasks on variant data types and different ingestion styles.
6. Implement granular access control using RBAC models.
7. Implement Data Masking, Row and Column level filtering access easily.
8. Use extended capabilities on external data sharing, marketplace for external tables, data lake.
9. Leverage Snowflake compression data storage to reduce data volumes from external data lakes.
10. Use single platform for all your data needs with faster data ingestion, transformation, and consumption.
11. Provide one copy of your data – a single source of truth – to all your data users.
12. Minimize tasks on External Data Lake management, tuning and optimization, bring all your data to Snowflake.
Start using Snowflake Data Lake workloads with efficient data ingestion, transformation, and orchestration of data pipelines using Automation accelerators from indigoChart.
We can help augment, or migrate your existing data lake or build new data lake on Snowflake.
Snowflake Documentation https://docs.snowflake.com/en/
Automated Data Pipelines https://www.snowflake.com/blog/external-tables-are-now-generally-available-on-snowflake/