Create External Tables on Azure Synapse Analytics

As we continue working with Azure Synapse Analytics on Peruvian voting data, as I have posted before, we are extending to creating an external table from the ground up, that points to existing data in the Data Lake Store.

This post will show how to create an external table that will point to a folder hierarchy that contains data in plain text (CSV format). All files inside the folder will be considered for this example.

There are 2 steps to be accomplished:

  1. Create an external table with defined datatypes
  2. Create statistics for attributes that will be used for querying (suggested)

Review Data in the Data Lake Store

This step can be accomplished within the Azure Synapse Workspace.

More information can be found in the Microsoft Documentation Link.

All files inside the folder
Preview the data

Creating the External Table with defined datatypes

The next step have to be accomplished if you are not continuing from our prior blog posts

CREATE EXTERNAL DATA SOURCE TestData
WITH ( LOCATION = 'https://account.dfs.core.windows.net/container/');

CREATE EXTERNAL FILE FORMAT txt  
WITH (  
    FORMAT_TYPE = DELIMITEDTEXT, 
    FORMAT_OPTIONS ( 
            FIELD_TERMINATOR = '|',
            STRING_DELIMITER = '"',
             )
    ); 

-- Create Schema if needed'
CREATE SCHEMA ext;

External Table Creation:

CREATE EXTERNAL TABLE [ext].[onpecombinado_defined]
(
	[DEPARTAMENTO] [varchar](100),
	[PROVINCIA] [varchar](100),
	[DISTRITO] [varchar](100),
	[MESA_DE_VOTACION] [integer],
	[DESCRIP_ESTADO_ACTA_1ra] [varchar](100),
	[DESCRIP_ESTADO_ACTA_2da] [varchar](100),
	[N_CVAS_1ra] [smallint],
	[N_CVAS_2da] [smallint],
	[VOTOS_P1] [smallint],
	[VOTOS_P2] [smallint],
	[VOTOS_P3] [smallint],
	[VOTOS_P4] [smallint],
	[VOTOS_P5] [smallint],
	[VOTOS_P6] [smallint],
	[VOTOS_P7] [smallint],
	[VOTOS_P8] [smallint],
	[VOTOS_P9] [smallint],
	[VOTOS_P10] [smallint],
	[VOTOS_P12] [smallint],
	[VOTOS_P13] [smallint],
	[VOTOS_P14] [smallint],
	[VOTOS_P15] [smallint],
	[VOTOS_P17] [smallint],
	[VOTOS_P18] [smallint],
	[FP_1ra] [smallint],
	[PL_1ra] [smallint],
	[VOTOS_VB_1ra] [smallint],
	[VOTOS_VN_1ra] [smallint],
	[FP_2da] [smallint],
	[PL_2da] [smallint],
	[VOTOS_VB_2da] [smallint],
	[VOTOS_VN_2da] [smallint],
	[VOTOS_IZQ_1ra] [smallint],
	[VOTOS_DER_1ra] [smallint]
)
WITH (DATA_SOURCE = [TestData],LOCATION = N'onpecombinado/',FILE_FORMAT = [txt])
GO

Create Statistics

More information can be found in the Microsoft Documentation link.

CREATE STATISTICS onpecombinado_defined_DEPARTAMENTO 
    on [ext].[onpecombinado_defined] (DEPARTAMENTO) WITH FULLSCAN, NORECOMPUTE;
CREATE STATISTICS onpecombinado_defined_PROVINCIA 
    on [ext].[onpecombinado_defined] (PROVINCIA) WITH FULLSCAN, NORECOMPUTE;
CREATE STATISTICS onpecombinado_defined_DISTRITO 
    on [ext].[onpecombinado_defined] (DISTRITO) WITH FULLSCAN, NORECOMPUTE;
CREATE STATISTICS onpecombinado_defined_MESA_DE_VOTACION 
    on [ext].[onpecombinado_defined] (MESA_DE_VOTACION) WITH FULLSCAN, NORECOMPUTE;
CREATE STATISTICS onpecombinado_defined_DESCRIP_ESTADO_ACTA_1ra 
    on [ext].[onpecombinado_defined] (DESCRIP_ESTADO_ACTA_1ra) WITH FULLSCAN, NORECOMPUTE;
CREATE STATISTICS onpecombinado_defined_DESCRIP_ESTADO_ACTA_2da 
    on [ext].[onpecombinado_defined] (DESCRIP_ESTADO_ACTA_2da) WITH FULLSCAN, NORECOMPUTE;

Review Objects on Workspace

We should see the external table with the defined data types on left pane.

Query the Data

SELECT 
    PROVINCIA, 
    DISTRITO, 
    SUM(FP_2da) AS Total_FP_2da,
    SUM(PL_2da) AS Total_PL_2da,
    SUM(VOTOS_DER_1ra) AS VOTOS_DER_1ra,
    SUM(VOTOS_IZQ_1ra) AS Total_VOTOS_IZQ_1ra,
    CAST(SUM(FP_2da) / SUM(CAST(FP_2da AS DECIMAL) + PL_2da) * 100 AS DECIMAL (5,2)) AS FP_Ratio_2da, -- Casting to force decimals
    CAST(SUM(VOTOS_DER_1ra) / SUM(CAST(VOTOS_DER_1ra AS DECIMAL) + VOTOS_IZQ_1ra) * 100 AS DECIMAL (5,2)) AS FP_Der_1ra -- Casting to force decimals
FROM [ext].[onpecombinado_defined]
WHERE DEPARTAMENTO = 'LIMA'
GROUP BY PROVINCIA, DISTRITO
ORDER BY 7 DESC;

Should provide an output like below

The Chart

If we switch to Chart, and consider DISTRITO as Category column and add FP_Ratio_2da, and DP_Der_1ra as Legend Columns, to visualize differences of votes in Lima, for Keiko Fujimori on 2nd round, vs. Assumed Right votes on 1st round, ratio-wise.

The learnings

A few steps are required to point to existing data in the Data Lake Store, with the suggestion of creation of statistics to help the optimizer prune data as needed.