Taking Assumptions with Peruvian Presidential Voting Data with Azure Synapse Serverless

Last week I published a post that allows users use Peruvian voting data to Azure Synapse Serverless.

This post will allow us take assumptions to visualize votes on the 1st round that aligned to either left or right, and compare them to the 2nd round, where Pedro Castillo (Perú Libre) represents the left, and Keiko Fujimori (Fuerza Popular) represents the right.

Continue reading

Peruvian Presidential Voting Data with Azure Synapse Serverless

It has been a controversial voting scenario for Peruvians, on the second round of voting Between Pedro Castillo (far left) and Keiko Fujimori (far right). There have been allegations of fraud, though no tangible evidence has been provided yet.

Continue reading

Windowed Functions empowering analytics [#TSQL2sday]

T-SQL Tuesday #16

T-SQL Tuesday #16

This blog entry is participating in the T-SQL Tuesday #16 for the month of February, hosted by Jes Schultz Borland (Blog|Twitter).

Back in 2001 I was involved in a project migrating an e-commerce site to a new platform. We were mapping and moving data from an Oracle 8.1.5 to an Oracle 8.1.7 instance. The framework was the beloved Dynamo running on Sun Solaris. The Dynamo engineers decided to use sequencers on each entity (e.g. product, category, etc). I was trying to figure out the best reusable PL/SQL block to move the data creating the correct sequence until I was pointed out that Oracle 8.1.6 supported Analytical Functions. After reading and practicing I discovered the power of those functions.

For the purpose of this blog I will refer the Analytical Functions as Windowed Functions instead. They are very useful as it allows the user to crunch measures of subsets of data maintaining the “raw” detail level data. For example, it allows us to get the average price of a product category while also displaying the unit price of each product. It can also help us rank a product by its price and “window” it by category.

Examples

Basic Aggregation:

1
2
3
4
5
-- Qty of products and average price
SELECT
    COUNT(productKey) AS [Quantity]
   ,AVG(StandardCost) AS [AverageCost]
FROM [dbo].[DimProduct]

Distribution of Data:

1
2
3
4
5
6
-- Tier by Product Cost (4 tiers)
SELECT
    NTILE(4) OVER(ORDER BY [StandardCost]) AS [Tier]
   ,productKey
   ,StandardCost
FROM [dbo].[DimProduct]

Average per Category with raw data:

1
2
3
4
5
6
7
-- Take the Average for each Product Subcategory and attach to the detail level data
SELECT
    AVG(StandardCost) OVER(PARTITION BY [ProductSubcategoryKey]) AS [AverageCostSubPerCat]
   ,productKey
   ,[ProductSubcategoryKey]
   ,StandardCost
FROM [dbo].[DimProduct]

Ranking:

1
2
3
4
5
6
7
8
-- Rank Products based on the minimum and maximum cost per Product Subcategory
SELECT [ProductSubcategoryKey]
   ,RANK() OVER(PARTITION BY [ProductSubcategoryKey] ORDER BY StandardCost, [ProductKey]) AS sequence_min
   ,RANK() OVER(PARTITION BY [ProductSubcategoryKey] ORDER BY StandardCost DESC, [ProductKey] DESC) AS sequence_max
   ,[ProductKey]
   ,StandardCost
FROM [dbo].[DimProduct]
WHERE StandardCost IS NOT NULL

Averages excluding certain ranks:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
-- Get the Average and Qty. of the products per category excluding the minimum and maximum cost per Product Subcategory (above)
WITH CTE_sequence AS (
  SELECT [ProductSubcategoryKey]
     ,RANK() OVER(PARTITION BY [ProductSubcategoryKey] ORDER BY StandardCost, [ProductKey]) AS sequence_min
     ,RANK() OVER(PARTITION BY [ProductSubcategoryKey] ORDER BY StandardCost DESC, [ProductKey] DESC) AS sequence_max
     ,[ProductKey]
     ,StandardCost
  FROM [dbo].[DimProduct]
  WHERE StandardCost IS NOT NULL
)
SELECT [ProductSubcategoryKey]
  ,COUNT([ProductKey]) AS Quantity
  ,AVG(StandardCost) AS Average_Cost
FROM CTE_sequence
WHERE sequence_min > 0
  AND sequence_max > 0
GROUP BY [ProductSubcategoryKey]

There are more functions that can be used and the user even has the option to dice the data further by using HAVING and slicing with GROUP BY CUBE | ROLLUP which plots the data in “pivotable” format.

Allow me to sleep and focus on number crunching [#TSQL2sday]

T-SQL Tuesday #014

T-SQL Tuesday #014

It’s 2011 already and the first T-SQL Tuesday for the year has arrived.

This time the event is hosted by Jen McCown (Blog|Twitter), who just received an MVP nomination (congratulations Jen).

I have come up with a list that I want covered by my team this year, but my main goals are two: Allowing me to sleep properly and investing time on number crunching.

What does “allow me to sleep” mean?

Last year we started our migration project of all production database servers to High Availability solutions. We shall continue this year and finalize it. On the same token, homogenize our environment so every single business unit runs on SQL Server 2008 R2. We really spent time figuring out processes that cannot run in SQL Server 2005 (e.g. change data capture) and I need to make sure it does not happen anymore.

On the storage side, there are 2 projects running in parallel. One is to allow growth on a scalable solution which is already in place (stop concerning about hitting volume limits), and the second one to reduce data file footprint. The first project is trivial as we can hookup additional trays and expand volumes. The second one is more complex as we need to identify tables that hold historical data, partition them (if applicable), phase out old data and archive without impacting production applications.

On the monitoring side, we are creating a new project to see if we shall stick with Management Data Warehouse or get a 3rd party tool to give us bird’s eye view of our systems at the DB level, define KPIs, and report based on them. We got proper HW and Service level monitoring in place already.

What does “focus on number crunching” mean?

This has 2 phases.

The first one is to work on a Master Metadata repository project and figure out relationships between different business entities. Then work with Business Owners and Data Analysts to figure out what should we report on, how far back should we do so, and what KPIs should be defined. This can allow us to continue building our Data Warehouse, extending ETLs, and allowing business unit A to correlate its data with Business Unit B.

This project is also tied with data file footprint reduction as we will be able to rollup information that shall be available on our DSS without the need of all the transactional data.

The second phase is analytics. This can be tied with data mining and statistics and will definitely involve some of the information we have stored in our cubes. We are still figuring this one out.

That is it for me on the technical side. On the managerial and personal? I would write about them on future blog posts.

What are your resolutions for 2011?

What’s happening in 2011

2010 came by pretty quickly. It has been a very busy year with all kinds of expansions, migrations and consolidations. We installed HP Blade Servers, VMware, SQL Server 2008 R2, Windows Server 2008 R2, CentOS 4.2 X64, F5s, Dell R710s/610s, Dell Equallogic Storage, TFS 2010, Windows 7 X64 for all DBAs and BI developers, etc. The list continues.

So what is happening next year? Many exiting things, with a core focus on HA, DR, Analytics, SharePoint 2010, and VMware on production databases.

On the Database side:

  • Homogenous environment. All instance running under SQL Server 2008 R2 (some edition downgrades in the mix)
  • Isolated Development, QA and UAT environments with proper edition (Developer, Standard, Enterprise). VMWare and MSDN come to play.
  • Management Data Warehouse deployment to collect health information from all production instances. Birds-eye view of all instances in one location.
  • Central Management Servers Implementation
  • Policy Based Management Implementation
  • Data-Tier Application Implementation
  • RML utilities for pre-production load testing
  • PowerShell

On the Analytics side:

  • All legacy cubes migrated to SQL Server Analysis Services 2008 R2
  • Deployment of Star Schemas for the remaining business units
  • Dedicated instance for transformations (SSIS)
  • Fire up 3rd and 4th VM for our Scaled-out Deployment of Reporting Services
  • Mastering DAX. PowerPivot for Excel 2010
  • GUI implementation for browsing our cubes. We are in the process of evaluating products.

High Availability & Disaster Recovery:

  • All instances to be moved to HA (2, 3, 4 node clusters). No more single point of failures.
  • Zero-Downtime database migration with Mirroring
  • Block Level Backup and restore offsite
  • Full-Log-Daily Differential Backups across the board.
  • Backup file test by restoring to our UAT environments

Storage:

  • Footprint Reduction by archiving historical data that can be rolled up
  • Storage Expansion to benefit from the spindle count
  • SSD for the most demanding applications (over iSCSI)

Training:

  • I am encouraging my team members to use twitter, to follow knowledgeable folks and read tweets from the #sqlserver and #sqlhelp hash tags.
  • To Read Blog Posts and subscribe to RSS feeds
  • To attend to SQL Saturdays
  • To buy books (and e-books)
  • To get trained online and if necessary bring someone to get us trained.
  • To attend the SQL Rally 2011 or SQL Pass Summit 2011
  • To read about SQL Server Denali and install on a VM to test.

I see 2011 being an excellent year for us.

Happy Holidays everyone.

2011