Creating External Table AS (CETAS) in Parquet format on Azure Synapse Analytics

This is an extension of a prior post that shows how to create an external table in CSV format with Azure Synapse Analytics. Most of the steps equivalent, with the exception of the file format.

This post will show how to create an external table in Parquet format using Azure Synapse Analytics. The creation of an external table will place data inside a folder in the Data Lake Store, that has to be specified. The format can be plain text or parquet. Additional information can be found in the Microsoft documentation link.

Continue reading

Creating External Table AS (CETAS) on Azure Synapse Analytics

This is a continuation from a prior post that allows users leverage Peruvian voting data with Azure Synapse Serverless.

This post will show how to create an external table using Azure Synapse Analytics. The creation of an external table will place data inside a folder in the Data Lake Store, that has to be specified. The format can be plain text or parquet. Additional information can be found in the Microsoft documentation link.

Continue reading

Taking Assumptions with Peruvian Presidential Voting Data with Azure Synapse Serverless

Last week I published a post that allows users use Peruvian voting data to Azure Synapse Serverless.

This post will allow us take assumptions to visualize votes on the 1st round that aligned to either left or right, and compare them to the 2nd round, where Pedro Castillo (Perú Libre) represents the left, and Keiko Fujimori (Fuerza Popular) represents the right.

Continue reading

Peruvian Presidential Voting Data with Azure Synapse Serverless

It has been a controversial voting scenario for Peruvians, on the second round of voting Between Pedro Castillo (far left) and Keiko Fujimori (far right). There have been allegations of fraud, though no tangible evidence has been provided yet.

Continue reading

Windowed Functions empowering analytics [#TSQL2sday]

T-SQL Tuesday #16

T-SQL Tuesday #16

This blog entry is participating in the T-SQL Tuesday #16 for the month of February, hosted by Jes Schultz Borland (Blog|Twitter).

Back in 2001 I was involved in a project migrating an e-commerce site to a new platform. We were mapping and moving data from an Oracle 8.1.5 to an Oracle 8.1.7 instance. The framework was the beloved Dynamo running on Sun Solaris. The Dynamo engineers decided to use sequencers on each entity (e.g. product, category, etc). I was trying to figure out the best reusable PL/SQL block to move the data creating the correct sequence until I was pointed out that Oracle 8.1.6 supported Analytical Functions. After reading and practicing I discovered the power of those functions.

For the purpose of this blog I will refer the Analytical Functions as Windowed Functions instead. They are very useful as it allows the user to crunch measures of subsets of data maintaining the “raw” detail level data. For example, it allows us to get the average price of a product category while also displaying the unit price of each product. It can also help us rank a product by its price and “window” it by category.

Examples

Basic Aggregation:

1
2
3
4
5
-- Qty of products and average price
SELECT
    COUNT(productKey) AS [Quantity]
   ,AVG(StandardCost) AS [AverageCost]
FROM [dbo].[DimProduct]

Distribution of Data:

1
2
3
4
5
6
-- Tier by Product Cost (4 tiers)
SELECT
    NTILE(4) OVER(ORDER BY [StandardCost]) AS [Tier]
   ,productKey
   ,StandardCost
FROM [dbo].[DimProduct]

Average per Category with raw data:

1
2
3
4
5
6
7
-- Take the Average for each Product Subcategory and attach to the detail level data
SELECT
    AVG(StandardCost) OVER(PARTITION BY [ProductSubcategoryKey]) AS [AverageCostSubPerCat]
   ,productKey
   ,[ProductSubcategoryKey]
   ,StandardCost
FROM [dbo].[DimProduct]

Ranking:

1
2
3
4
5
6
7
8
-- Rank Products based on the minimum and maximum cost per Product Subcategory
SELECT [ProductSubcategoryKey]
   ,RANK() OVER(PARTITION BY [ProductSubcategoryKey] ORDER BY StandardCost, [ProductKey]) AS sequence_min
   ,RANK() OVER(PARTITION BY [ProductSubcategoryKey] ORDER BY StandardCost DESC, [ProductKey] DESC) AS sequence_max
   ,[ProductKey]
   ,StandardCost
FROM [dbo].[DimProduct]
WHERE StandardCost IS NOT NULL

Averages excluding certain ranks:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
-- Get the Average and Qty. of the products per category excluding the minimum and maximum cost per Product Subcategory (above)
WITH CTE_sequence AS (
  SELECT [ProductSubcategoryKey]
     ,RANK() OVER(PARTITION BY [ProductSubcategoryKey] ORDER BY StandardCost, [ProductKey]) AS sequence_min
     ,RANK() OVER(PARTITION BY [ProductSubcategoryKey] ORDER BY StandardCost DESC, [ProductKey] DESC) AS sequence_max
     ,[ProductKey]
     ,StandardCost
  FROM [dbo].[DimProduct]
  WHERE StandardCost IS NOT NULL
)
SELECT [ProductSubcategoryKey]
  ,COUNT([ProductKey]) AS Quantity
  ,AVG(StandardCost) AS Average_Cost
FROM CTE_sequence
WHERE sequence_min > 0
  AND sequence_max > 0
GROUP BY [ProductSubcategoryKey]

There are more functions that can be used and the user even has the option to dice the data further by using HAVING and slicing with GROUP BY CUBE | ROLLUP which plots the data in “pivotable” format.