How to Build and Train Snowflake Document AI Models for Automated Document Processing

Ramesh Sanap
Nov 30, 2024
5 min read

Updated: Dec 4, 2024

Building and training Document AI models in Snowflake involves setting up roles, databases, and workflows that enable automated document processing. This guide walks you through the essential steps, from configuring the environment to creating and training your Document AI Build, ensuring a fully operational pipeline for document processing and analysis.

The article comprises of 3 sections:

Pre "Document AI Build" Snowflake SQL.
Create Document AI Build in Snowflake, Step by Step procedure.
Post "Document AI Build" Snowflake SQL.

SQL statements can be copy pasted into your Snowflake environment as is.

Pre "Document AI Build" Snowflake SQL

Expand to View SQL Code

--Step 1: Create a database and schema that holds the Document AI model build:

USE ROLE SECURITYADMIN;

CREATE ROLE IC_DOCUMENT_AI_MANAGER;

USE ROLE SYSADMIN;

CREATE DATABASE DB_DOCUMENTS; -- Check if this is required, you may use existing RAW database

CREATE SCHEMA DB_DOCUMENTS.DOCUMENTS; -- Check if you can create schema under RAW database

--Step 2: Grant the IC_DOCUMENT_AI_MANAGER role to SYSADMIN.

USE ROLE SECURITYADMIN;

GRANT ROLE IC_DOCUMENT_AI_MANAGER TO ROLE SYSADMIN;

--Step 3: Grant the SNOWFLAKE.DOCUMENT_INTELLIGENCE_CREATOR database role to the doc_ai_role role:

USE ROLE ACCOUNTADMIN;

GRANT DATABASE ROLE SNOWFLAKE.DOCUMENT_INTELLIGENCE_CREATOR TO ROLE IC_DOCUMENT_AI_MANAGER;

--Step 4: Grant warehouse usage and operating privileges to the doc_ai_role role:

USE ROLE SYSADMIN;

GRANT USAGE, OPERATE ON WAREHOUSE DOCUMENT_AI_WH TO ROLE IC_DOCUMENT_AI_MANAGER;

--Step 5: Grant the privileges to use the database and schema you created to the doc_ai_role:

USE ROLE SECURITYADMIN;

GRANT USAGE ON DATABASE DB_DOCUMENTS TO ROLE IC_DOCUMENT_AI_MANAGER;

GRANT USAGE ON SCHEMA DB_DOCUMENTS.DOCUMENTS TO ROLE IC_DOCUMENT_AI_MANAGER;

--Step 6: Grant the create stage privilege on the schema to the doc_ai_role role to store the documents for extraction:

USE ROLE SECURITYADMIN;

GRANT CREATE STAGE ON SCHEMA DB_DOCUMENTS.DOCUMENTS TO ROLE IC_DOCUMENT_AI_MANAGER;

GRANT CREATE DYNAMIC TABLE ON SCHEMA DB_DOCUMENTS.DOCUMENTS TO ROLE IC_DOCUMENT_AI_MANAGER;

GRANT CREATE PROCEDURE ON SCHEMA DB_DOCUMENTS.DOCUMENTS TO ROLE IC_DOCUMENT_AI_MANAGER;

GRANT CREATE FUNCTION ON SCHEMA DB_DOCUMENTS.DOCUMENTS TO ROLE IC_DOCUMENT_AI_MANAGER;

--Step 7: Grant the privilege to create model builds (instances of the DOCUMENT_INTELLIGENCE class) to the doc_ai_role role:

USE ROLE SECURITYADMIN;

GRANT CREATE SNOWFLAKE.ML.DOCUMENT_INTELLIGENCE ON SCHEMA DB_DOCUMENTS.DOCUMENTS TO ROLE IC_DOCUMENT_AI_MANAGER;

--Step 8: Grant the privileges required to create a processing pipeline using streams and tasks to the doc_ai_role role:

USE ROLE SECURITYADMIN;

GRANT CREATE STREAM, CREATE TABLE, CREATE TASK, CREATE VIEW ON SCHEMA DB_DOCUMENTS.DOCUMENTS TO ROLE ;

USE ROLE ACCOUNTADMIN;

GRANT EXECUTE TASK ON ACCOUNT TO ROLE IC_DOCUMENT_AI_MANAGER;

--Step 9: Grant the doc_ai_role to tutorial user for use in the next steps of the tutorial:

USE ROLE SECURITYADMIN;

GRANT ROLE IC_DOCUMENT_AI_MANAGER TO USER your_username;

--Step 10: Create Internal Stage to place the documents

USE ROLE IC_DOCUMENT_AI_MANAGER;

CREATE STAGE STG_INT_DOCUMENTS

DIRECTORY = (ENABLE = TRUE)

ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE')

--Step 11: Create stream on stage

CREATE STREAM STREAM_NEW_INVOICE ON STAGE STG_INT_DOCUMENTS;

Create Document AI Build procedure in Snowflake

Once the Snowflake environment is set up using the SQLs in Section 1, proceed to create and train the Document AI Build within the Snowflake UI:

Using the role IC_DOCUMENT_AI_MANAGER created earlier, navigate to AI & ML → DocumentAI. Select the database and schema you previously set up, along with the virtual warehouse. Click on +Build to initiate the process as shown in the image below.

Provide a unique Build Name and specify the Database.Schema where the build will be created.

3. Upload one or more sample documents for testing. This step allows you to extract information and evaluate the model's output accuracy.

4. Define the specific values you want the model to extract as JSON output under Values to Extract. For example:

● Expected values: Training Cost and Total Cost.

● When the model extracts these values accurately, verify the results by clicking the check mark. If correct, the OCR score will be set to 1. This process also contributes to training the model for improved future extractions.

Click Accept and Close to finalize the build.

5. Publish the build to make it usable. Once published, tasks configured in Step 13 in Section 3 below (Post "Document AI Build" Snowflake SQL) will populate the first table with the JSON output.

6. Create a Dynamic Table to convert the JSON data into a readable and structured format. This table will automatically update with any new data inserted into the base table.

Once the Document AI Build is published, tasks set up in Section 3 below (Post "Document AI Build" Snowflake SQL) populate the first table (invoice_raw) with raw JSON data extracted from documents. This data is then processed into a dynamic table (invoices_from_documents) for structured analysis, with fields like dues, invoice numbers, and OCR scores. These steps automate the entire pipeline from document ingestion to structured data extraction, enabling scalable and efficient document processing workflows.

3. Post "Document AI Build" Snowflake SQL

Expand to View SQL Code

-- Step 12: Create Raw Table for Extracted Document Data. Following table will store the data in JSON variant column via task which is executed every minute but only run when the stream detects new file being loaded into internal stage

CREATE OR REPLACE TABLE invoice_raw (

file_name VARCHAR,

file_size VARIANT,

last_modified VARCHAR,

snowflake_file_url VARCHAR,

json_content VARCHAR

);

-- Step 13: Create Task to Process New Invoices and Load Data into Raw Table. The task will run every minute and load data into table in RAW json format which is information extracted from document

CREATE OR REPLACE TASK load_new_invoice_data

WAREHOUSE = DOCUMENT_AI_WH

SCHEDULE = '1 minute'

COMMENT = 'Process new invoice in the stage and insert data into the invoice_raw table.'

WHEN SYSTEM$STREAM_HAS_DATA('STREAM_NEW_INVOICE')

INSERT INTO invoice_raw (

SELECT

RELATIVE_PATH AS file_name,

size AS file_size,

last_modified,

file_url AS snowflake_file_url,

IC_DOCUMENT_AI_BUILD_1!PREDICT(GET_PRESIGNED_URL('@STG_INT_DOCUMENTS', RELATIVE_PATH), 1) AS json_content

FROM STREAM_NEW_INVOICE

WHERE METADATA$ACTION = 'INSERT'

);

ALTER TASK load_new_invoice_data RESUME;

ALTER TASK load_new_invoice_data SUSPEND;

--truncate table invoice_raw;

select * from invoice_raw;

-- Step 14: Create Dynamic Table for Structured Document Data. For this example, this was created in the same db and schema where json information is processed and data is converted into more structured table format. Here we expect to load data for dues and invoice amount from documents with ocr score which shows accuracy for extracted data from documents

CREATE DYNAMIC TABLE DB_DOCUMENTS.DOCUMENTS.invoices_from_documents

(

File_Name COMMENT 'Document Name',

File_size COMMENT 'Size of the file',

last_modified COMMENT 'Last Modified Date',

snowflake_file_url COMMENT 'File location in Snowflake internal Stage',

ocr_score COMMENT 'Overall score',

dues_score COMMENT 'Score for Dues value extraction',

dues COMMENT 'Dues',

invoice_number_score COMMENT 'Score for Invoice number extraction',

invoice_number COMMENT 'Invoice Number'

)

TARGET_LAG = '1 minutes'

WAREHOUSE = DOCUMENT_AI_WH

SELECT

file_name,

file_size,

last_modified,

snowflake_file_url,

PARSE_JSON(json_content):__documentMetadata.ocrScore::FLOAT AS ocrScore,

f.value:score::FLOAT AS Dues_score,

f.value:value::STRING AS Dues_value,

g.value:score::FLOAT AS INVOICE_NUMBER_score,

g.value:value::STRING AS INVOICE_NUMBER_value

FROM invoice_raw,

LATERAL FLATTEN(INPUT => PARSE_JSON(json_content):Dues) f,

LATERAL FLATTEN(INPUT => PARSE_JSON(json_content):INVOICE_NUMBER) g

GROUP BY ALL;

select * from invoices_from_documents

--resume dynamic table

ALTER DYNAMIC TABLE invoices_from_documents RESUME;

--suspend dynamic table

ALTER DYNAMIC TABLE invoices_from_documents SUSPEND;

Bringing it All Together

By following these steps, you’ll build a fully operational Snowflake Document AI pipeline. From foundational setup to creating and training models, this tutorial equips you to automate document processing and derive actionable insights efficiently. Ensure the pipeline is finalized and production-ready by completing additional configuration and validation as required.

Visit us at www.indigoChart.com or drop us a line at hello@indigochart.com

indigoChart Menu

How to Build and Train Snowflake Document AI Models for Automated Document Processing

Pre "Document AI Build" Snowflake SQL

Expand to View SQL Code

Create Document AI Build procedure in Snowflake

3. Post "Document AI Build" Snowflake SQL

Bringing it All Together

Recent Posts