As we continue working with Azure Synapse Analytics on Peruvian voting data, as I have posted before, we are extending to creating an external table from the ground up, that points to existing data in the Data Lake Store.
This post will show how to create an external table that will point to a folder hierarchy that contains data in plain text (CSV format). All files inside the folder will be considered for this example.
There are 2 steps to be accomplished:
- Create an external table with defined datatypes
- Create statistics for attributes that will be used for querying (suggested)
Review Data in the Data Lake Store
This step can be accomplished within the Azure Synapse Workspace.
More information can be found in the Microsoft Documentation Link.
Creating the External Table with defined datatypes
The next step have to be accomplished if you are not continuing from our prior blog posts
CREATE EXTERNAL DATA SOURCE TestData
WITH ( LOCATION = 'https://account.dfs.core.windows.net/container/');
CREATE EXTERNAL FILE FORMAT txt
WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (
FIELD_TERMINATOR = '|',
STRING_DELIMITER = '"',
)
);
-- Create Schema if needed'
CREATE SCHEMA ext;
External Table Creation:
CREATE EXTERNAL TABLE [ext].[onpecombinado_defined]
(
[DEPARTAMENTO] [varchar](100),
[PROVINCIA] [varchar](100),
[DISTRITO] [varchar](100),
[MESA_DE_VOTACION] [integer],
[DESCRIP_ESTADO_ACTA_1ra] [varchar](100),
[DESCRIP_ESTADO_ACTA_2da] [varchar](100),
[N_CVAS_1ra] [smallint],
[N_CVAS_2da] [smallint],
[VOTOS_P1] [smallint],
[VOTOS_P2] [smallint],
[VOTOS_P3] [smallint],
[VOTOS_P4] [smallint],
[VOTOS_P5] [smallint],
[VOTOS_P6] [smallint],
[VOTOS_P7] [smallint],
[VOTOS_P8] [smallint],
[VOTOS_P9] [smallint],
[VOTOS_P10] [smallint],
[VOTOS_P12] [smallint],
[VOTOS_P13] [smallint],
[VOTOS_P14] [smallint],
[VOTOS_P15] [smallint],
[VOTOS_P17] [smallint],
[VOTOS_P18] [smallint],
[FP_1ra] [smallint],
[PL_1ra] [smallint],
[VOTOS_VB_1ra] [smallint],
[VOTOS_VN_1ra] [smallint],
[FP_2da] [smallint],
[PL_2da] [smallint],
[VOTOS_VB_2da] [smallint],
[VOTOS_VN_2da] [smallint],
[VOTOS_IZQ_1ra] [smallint],
[VOTOS_DER_1ra] [smallint]
)
WITH (DATA_SOURCE = [TestData],LOCATION = N'onpecombinado/',FILE_FORMAT = [txt])
GO
Create Statistics
More information can be found in the Microsoft Documentation link.
CREATE STATISTICS onpecombinado_defined_DEPARTAMENTO
on [ext].[onpecombinado_defined] (DEPARTAMENTO) WITH FULLSCAN, NORECOMPUTE;
CREATE STATISTICS onpecombinado_defined_PROVINCIA
on [ext].[onpecombinado_defined] (PROVINCIA) WITH FULLSCAN, NORECOMPUTE;
CREATE STATISTICS onpecombinado_defined_DISTRITO
on [ext].[onpecombinado_defined] (DISTRITO) WITH FULLSCAN, NORECOMPUTE;
CREATE STATISTICS onpecombinado_defined_MESA_DE_VOTACION
on [ext].[onpecombinado_defined] (MESA_DE_VOTACION) WITH FULLSCAN, NORECOMPUTE;
CREATE STATISTICS onpecombinado_defined_DESCRIP_ESTADO_ACTA_1ra
on [ext].[onpecombinado_defined] (DESCRIP_ESTADO_ACTA_1ra) WITH FULLSCAN, NORECOMPUTE;
CREATE STATISTICS onpecombinado_defined_DESCRIP_ESTADO_ACTA_2da
on [ext].[onpecombinado_defined] (DESCRIP_ESTADO_ACTA_2da) WITH FULLSCAN, NORECOMPUTE;
Review Objects on Workspace
We should see the external table with the defined data types on left pane.
Query the Data
SELECT
PROVINCIA,
DISTRITO,
SUM(FP_2da) AS Total_FP_2da,
SUM(PL_2da) AS Total_PL_2da,
SUM(VOTOS_DER_1ra) AS VOTOS_DER_1ra,
SUM(VOTOS_IZQ_1ra) AS Total_VOTOS_IZQ_1ra,
CAST(SUM(FP_2da) / SUM(CAST(FP_2da AS DECIMAL) + PL_2da) * 100 AS DECIMAL (5,2)) AS FP_Ratio_2da, -- Casting to force decimals
CAST(SUM(VOTOS_DER_1ra) / SUM(CAST(VOTOS_DER_1ra AS DECIMAL) + VOTOS_IZQ_1ra) * 100 AS DECIMAL (5,2)) AS FP_Der_1ra -- Casting to force decimals
FROM [ext].[onpecombinado_defined]
WHERE DEPARTAMENTO = 'LIMA'
GROUP BY PROVINCIA, DISTRITO
ORDER BY 7 DESC;
Should provide an output like below
The Chart
If we switch to Chart, and consider DISTRITO as Category column and add FP_Ratio_2da, and DP_Der_1ra as Legend Columns, to visualize differences of votes in Lima, for Keiko Fujimori on 2nd round, vs. Assumed Right votes on 1st round, ratio-wise.
The learnings
A few steps are required to point to existing data in the Data Lake Store, with the suggestion of creation of statistics to help the optimizer prune data as needed.