Peruvian Presidential Voting Data with Azure Synapse Serverless

It has been a controversial voting scenario for Peruvians, on the second round of voting Between Pedro Castillo (far left) and Keiko Fujimori (far right). There have been allegations of fraud, though no tangible evidence has been provided yet.

ONPE, the National Office of Electoral Processes of Peru, has made the voting dataset public, for both 1st and 2nd rounds. That shows the level of transparency the entity is providing, to try mitigate disinformation.

I took the data and loaded into Azure Synapse Serverless, to get a few graphs. One in particular that has skeptics pointing to “curation” for the irregular curve of voting tables with low votes for candidate Keiko Fujimori.

Loading the Dataset into Azure Data Lake Store

This process is rather trivial, and requires an active access to your Azure resources to load data into the Data Lake Store. Steps can be performed by either using Azure Storage Explorer, or leveraging your Azure Synapse Analytics Workspace.

You can then opt to preview the data by right clicking on the file:

Selecting Data from the Data Lake Store

We just need to ensure that the proper delimiters and options are in place:

SELECT 
    TOP 100 * 
FROM
    OPENROWSET(
        BULK 'https://account.dfs.core.windows.net/container/ONPE/Resultados_1ra_vuelta_Version_PCM.csv',
        FORMAT = 'CSV', -- File Type
        PARSER_VERSION='2.0', -- Faster Parsing with CSV files
        FIRSTROW = 1, -- First Row starts here
        FIELDTERMINATOR = ';', -- Field Terminator
        FIELDQUOTE ='"', -- Field Quote
        HEADER_ROW = TRUE -- It has a header
    ) AS [result]
;

And the results like below:

Consolidate Data Creating a View

Make sure that you have create a new database under Synapse Serverless.

CREATE DATABASE Onpe;
ALTER DATABASE Onpe COLLATE Latin1_General_100_BIN2_UTF8;
USE Onpe;
CREATE VIEW ONPE AS 
SELECT
    COALESCE(r1.DEPARTAMENTO, r2.DEPARTAMENTO) AS DEPARTAMENTO,
    COALESCE(r1.PROVINCIA, r2.PROVINCIA) AS PROVINCIA,
    COALESCE(r1.DISTRITO, r2.DISTRITO) AS DISTRITO,
    COALESCE(r1.MESA_DE_VOTACION, r2.MESA_DE_VOTACION) AS MESA_DE_VOTACION,
    r1.DESCRIP_ESTADO_ACTA AS DESCRIP_ESTADO_ACTA_1ra,
    r2.DESCRIP_ESTADO_ACTA AS DESCRIP_ESTADO_ACTA_2da,
    r1.N_CVAS AS N_CVAS_1ra,
    r2.N_CVAS AS N_CVAS_2da,
    COALESCE(r1.VOTOS_P1, 0) AS VOTOS_P1,
    COALESCE(r1.VOTOS_P2, 0) AS VOTOS_P2,
    COALESCE(r1.VOTOS_P3, 0) AS VOTOS_P3,
    COALESCE(r1.VOTOS_P4, 0) AS VOTOS_P4,
    COALESCE(r1.VOTOS_P5, 0) AS VOTOS_P5,
    COALESCE(r1.VOTOS_P6, 0) AS VOTOS_P6,
    COALESCE(r1.VOTOS_P7, 0) AS VOTOS_P7,
    COALESCE(r1.VOTOS_P8, 0) AS VOTOS_P8,
    COALESCE(r1.VOTOS_P9, 0) AS VOTOS_P9,
    COALESCE(r1.VOTOS_P10, 0) AS VOTOS_P10,
    COALESCE(r1.VOTOS_P12, 0) AS VOTOS_P12,
    COALESCE(r1.VOTOS_P13, 0) AS VOTOS_P13,
    COALESCE(r1.VOTOS_P14, 0) AS VOTOS_P14,
    COALESCE(r1.VOTOS_P15, 0) AS VOTOS_P15,
    COALESCE(r1.VOTOS_P17, 0) AS VOTOS_P17,
    COALESCE(r1.VOTOS_P18, 0) AS VOTOS_P18,
    COALESCE(r1.VOTOS_P11, 0) AS FP_1ra,
    COALESCE(r1.VOTOS_P16, 0) AS PL_1ra,
    COALESCE(r1.VOTOS_VB, 0) AS VOTOS_VB_1ra,
    COALESCE(r1.VOTOS_VN, 0) AS VOTOS_VN_1ra,
    COALESCE(r2.VOTOS_P2, 0) AS FP_2da,
    COALESCE(r2.VOTOS_P1, 0) AS PL_2da,
    COALESCE(r2.VOTOS_VB, 0) AS VOTOS_VB_2da,
    COALESCE(r2.VOTOS_VN, 0) AS VOTOS_VN_2da
FROM
    OPENROWSET(
        BULK 'https://account.dfs.core.windows.net/container/ONPE/Resultados_2da_vuelta_Version_PCM.csv',
        FORMAT = 'CSV',
        PARSER_VERSION='2.0',
        FIRSTROW = 1,
        FIELDTERMINATOR = ';',
        FIELDQUOTE ='"',
        HEADER_ROW = TRUE
    ) AS [r2]
FULL OUTER JOIN
    OPENROWSET(
        BULK 'https://account.dfs.core.windows.net/container/ONPE/Resultados_1ra_vuelta_Version_PCM.csv',
        FORMAT = 'CSV',
        PARSER_VERSION='2.0',
        FIRSTROW = 1,
        FIELDTERMINATOR = ';',
        FIELDQUOTE ='"',
        HEADER_ROW = TRUE
    ) AS [r1] ON r2.MESA_DE_VOTACION = r1.MESA_DE_VOTACION
;

Select a few rows from the view:

SELECT TOP 100 * FROM ONPE;

Recreate the curve of votes for each candidate

This is the graph that folks are pointing as anomaly for candidate Keiko Fujimori. They expect a bell curve for both candidates, and are instead getting an irregular shape for Fujimori.

-- Curva de Cantidad de mesas por numero de votos por ambos candidatos
WITH CTE_FP AS (
    SELECT
        FP_2da AS NUMERO_DE_VOTOS,
        COUNT(MESA_DE_VOTACION) AS CANTIDAD_DEMESAS
    FROM ONPE
    WHERE DESCRIP_ESTADO_ACTA_2da = 'CONTABILIZADA'
    GROUP BY FP_2da
), CTE_PL AS (
    SELECT
        PL_2da AS NUMERO_DE_VOTOS,
        COUNT(MESA_DE_VOTACION) AS CANTIDAD_DEMESAS
    FROM ONPE
    WHERE DESCRIP_ESTADO_ACTA_2da = 'CONTABILIZADA'
    GROUP BY PL_2da
)
SELECT 
    COALESCE(FP.NUMERO_DE_VOTOS, PL.NUMERO_DE_VOTOS) AS NUMERO_DE_VOTOS,
    COALESCE(FP.CANTIDAD_DEMESAS, 0) AS CANTIDAD_DEMESAS_FP,
    COALESCE(PL.CANTIDAD_DEMESAS, 0) AS CANTIDAD_DEMESAS_PL
FROM CTE_FP AS FP
    FULL OUTER JOIN CTE_PL PL ON (FP.NUMERO_DE_VOTOS = PL.NUMERO_DE_VOTOS)
ORDER BY COALESCE(FP.NUMERO_DE_VOTOS, PL.NUMERO_DE_VOTOS);

If you switch to chart, then select “Area” under Chart Type, and exclude “NUMERO_DE_VOTOS” under Legend, you end up with the graph below:

The learnings

2 main things we have accomplished. The first one, on very easy steps, how to load and query data with Azure Synapse Serverless. The Second one, we can plot graphs based on query result sets that interact with data in the Data Lake Store.

Do I see an anomaly? Yes. Should it indicate fraud? Not necessarily. Will let the investigations continue.