Creating External Table AS (CETAS) in Parquet format on Azure Synapse Analytics

This is an extension of a prior post that shows how to create an external table in CSV format with Azure Synapse Analytics. Most of the steps equivalent, with the exception of the file format.

This post will show how to create an external table in Parquet format using Azure Synapse Analytics. The creation of an external table will place data inside a folder in the Data Lake Store, that has to be specified. The format can be plain text or parquet. Additional information can be found in the Microsoft documentation link.

There are 3 steps to be accomplished:

  1. Create an external data source
  2. Create an external file format (specific for Parquet)
  3. Create an external table (a schema can also be created as a prior step).

Creating the external references

-- Create an External Data Source
CREATE EXTERNAL DATA SOURCE TestData
WITH ( LOCATION = 'https://account.dfs.core.windows.net/container/');

-- Create an External File Format - specific for Parquet
CREATE EXTERNAL FILE FORMAT parquet
WITH (  
    FORMAT_TYPE = PARQUET, 
    DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
    ); 

Creating the schema and external table

-- Create Schema (optional)
CREATE SCHEMA ext;

-- Create External Table as (CETAS)

CREATE EXTERNAL TABLE ext.onpecombinadoparquet
WITH (
    LOCATION = 'onpecombinadoparquet/',
    DATA_SOURCE = TestData,  
    FILE_FORMAT = parquet
)  
AS
SELECT * FROM ONPE_ASSUMPTION;

In the steps above, we are using the object ONPE_ASSUMPTION that points to a view created on our prior post.

Review the data

SELECT TOP 100 * FROM ext.onpecombinadoparquet;

Visualize Objects on the Data Lake Store

Behind the scenes, Azure Synapse is creating a series of files within a folder (as specified). You can visualize right from the Synapse Workspace:

You cannot preview form parquet, but you can right click and create a Query to read from it instead:

Bear in mind that the query above will only visualize elements of the single parquet file. In order to retrieve all the data, the folder hierarchy has to be specified instead:

SELECT
    COUNT(*)
FROM
    OPENROWSET(
        BULK 'https://account.dfs.core.windows.net/container/onpecombinadoparquet/*.parquet',
        FORMAT='PARQUET'
    ) AS [result]

Query above resembles exactly what the external table ext.onpecombinadoparquet will visualize.

The learnings

A few steps are required to land the resultset into the Data Lake Store, and to be leveraged by any query, downloaded or used as a source for Analytics/dashboard products like PowerBI. The key elements are data source, file format (as parquet) and the external table.