Pivoting reports into tables

Pivoting data in ClickHouse is fun and everyone should do it

A very long time ago my mentor at the time took a peek at a spreadsheet full of data I was tasked to read into QlikView. "That data is in a reporting format.", he said hinting at the fact that there was one column for each year of metrics. "You will have to transpose those columns into rows so you get the year in one column and the metric in another, like transactions."

In QlikView this was a fairly common practice at the time since many data sources came in spreadsheet format. There was a function named CrossTable that took care of that effortlessly.

In ClickHouse the same can be achieved by grouping the year columns into an array and then doing an ARRAY JOIN which will produce a new table by iterating through the array creating a separate record for each item.

Let's pretend that we have a copy of The Economist's Democracy Index list by country ingested into ClickHouse. It would look something like this when queried:

RegionCountryRegime type202220212020
North AmericaCanadaFull democracy8.888.879.24
North AmericaUnited StatesFlawed democracy7.857.857.92
Western EuropeAustriaFull democracy8.28.078.16

To get this table into a transactional state where we can perform aggregation functions like Min(), Max() and Avg() we can create a dbt model like this:

{{
    config(
        materialized='table', 
        engine='MergeTree()', 
        order_by='(country, year)'
    )
}}

WITH pivoted AS (

    SELECT
        Region AS region,
        Country AS country,
        `Regime type` AS regime_type,
        year,
        democracy_index
    FROM ingest.democracy_index_by_country_2022
    LEFT ARRAY JOIN
        splitByString(',', '2022,2021,2020') AS year,
        [`2022`,`2021`,`2020`] AS democracy_index

), final AS (

    SELECT
        toLowCardinality(region) AS region,
        toLowCardinality(assumeNotNull(country)) AS country,
        toLowCardinality(regime_type) AS regime_type,
        toUInt16(year) AS year,
        round(democracy_index, 2) as democracy_index
    FROM pivoted

)
SELECT * FROM final

There is a lot of stuff going on in here! The double curly brackets at the top tell dbt that I want the result of this SQL query to be materialized as a new table in the database. Where and how this table is created is defined by how we've set up our dbt project.

I always recommend typing out SQL in a CTE (Common Table Expression) format. It is so much easier to read by both myself a week from now and by my colleagues.

The real magic happens in the LEFT ARRAY JOIN clause. The first statement after is a trick to create an array of year names by splitting a string into individual values by the splitByString() function. In the second statement, we treat all our year columns (escaped by backtick since column names are only numbers) as an array. When LEFT ARRAY JOIN is executed the year and democracy_index columns are created by iterating through the two arrays.

The final bit is some best practices to get the data in a healthy shape before running analytic queries. Since sorting columns shouldn't contain NULL and my table will be sorted by country and year we apply assumeNotNull() to the country column.

Did you find this article valuable?

Support Jesper Bagge by becoming a sponsor. Any amount is appreciated!