Peek(ing) through windows

Peek(ing) through windows

It's something everyone should do

Setting the stage

When analysing data you'll sooner or later end up with the task to calculate capacity utilization rates based on some raw, murky machine log. Well, at least I have. On multiple occasions. The data usually comes in the form of an event log where each record denotes the time of day, a state that has been reached and typically also which entity that has reached said state. Let's pretend that the states in question are start and stop and now we need to figure out a way to calculate how long an entity has remained in a state until a new state has been reached. Since we only have start times on each record this task is not entirely trivial.

Back in the days when the BI platform Qlik Sense was the bees knees (and everyone thought that this was a great tool to transform data in - yes this was before we realised we needed observability, data contracts and data unit tests) the solution to this problem was spelled peek(). This was a function for inter-record processing and by sorting the data in some form of descending order we could use peek() to find out stuff from records that had come before. Like the timestamp on the previous record, by which we now could calculate the duration between two timestamps.

Enter the protagonists

So how do we solve this using the dynamic superduo dbt Core and ClickHouse? Well, I thought we could give this a try with window functions. Using these tools will give us a democratised and version controlled data transformation, observability through orchestration of our dbt models and we will be doing it at lightspeed thanks to the powerhouse that is spelled ClickHouse. Although the part about observability falls outside the scope of this article, I cannot stress enough the importance of keeping books on how your data warehouse is performing.

Add props

Before diving headfirst into data we need to have... well, data. ClickHouse comes with the possibility to create temporary in-memory tables. Neat, right?! So let's start by creating an imaginary log table and fill it with some data. Two machines are reporting when they start and stop running:

CREATE OR REPLACE TABLE machine_log (
    `machine` String,
    `start` Datetime32,
    `state` String
)
ENGINE = Memory;

INSERT INTO machine_log Values
    ('M1', '2024-07-01 12:31:44', 'Start'),
    ('M1', '2024-07-01 18:15:32', 'Stop'),
    ('M1', '2024-07-02 05:14:22', 'Start'),
    ('M1', '2024-07-02 10:33:01', 'Stop'),
    ('M2', '2024-07-01 12:51:04', 'Start'),
    ('M2', '2024-07-02 03:21:17', 'Stop'),
    ('M2', '2024-07-02 15:44:15', 'Start'),
    ('M2', '2024-07-02 22:43:29', 'Stop');

Windows are for peeking

Now that we have data to work with, it's time to let ClickHouse shine. By declaring a WINDOW named above_me along with how to partition, its sort order and window size (or range) we can use some special functions to work with this window. The function lagInFrame() is exactly what we are after. In this context it will do almost exactly what the old peek() function did in Qlik Sense. The difference is that with ClickHouse, the PARTITION bit "resets" the window at the end of the partition, making our work easier since we don't have to keep track of if we've reached data for a new machine. With peek() we had to have extra control-statements for this.

SELECT
    machine,
    start,
    lagInFrame(start) OVER above_me AS next_event,
    state
FROM machine_log
WINDOW above_me AS(
    PARTITION BY machine
    ORDER BY machine, start DESCENDING
    ROWS BETWEEN 1 PRECEDING AND CURRENT ROW
)

Now that we've placed an additional timestamp - the next one within the same machine code or zero if we're at the end of the window - on each record we can use timestampDiff() to calculate the duration for the current state. Just to make sure that the last state is also calculated correctly we replace the zero with a now()-value since we assume that the machine has been in this state ever since. We also subtract a second from the next_event value to not have overlapping intervals.

The recommended way to write transformations in dbt is by typing it out as a CTE. The reason for this is that since SQL is a language that many other humans will read, CTE is much easier to parse compared to arbitrarily nested SELECT-statements. For humans, at least...

WITH
    raw AS (

        SELECT
            machine,
            start,
            lagInFrame(start) OVER above_me AS next_event,
            state
        FROM machine_log
        WINDOW above_me AS(
            PARTITION BY machine
            ORDER BY machine, start DESCENDING
            ROWS BETWEEN 1 PRECEDING AND CURRENT ROW
        )

    ),
    duration AS (

        SELECT
            machine,
            start,
            if(
                next_event = 0, now(),
                next_event - 1
            ) AS end,
            timestampDiff(MINUTE, start, end) as duration_minutes,
            state
        FROM raw

    )

SELECT * FROM duration;

And the results are in!

Now that we have a duration for each state and for each machine the analytics part is a nice stroll in the park. This data could easily be visualised in any modern BI tool. My personal favourite is Apache Superset because it sits so nicely on top of ClickHouse. If you're a cloud person dbt, ClickHouse and Preset offer hosted versions. If you just want to quickly summarize the durations and move on to the next task then this aggregation finisher might be just for you. Just replace the final SELECT-clause in the model above:

SELECT
    machine,
    state,
    sum(duration_minutes) as total_running_minutes
FROM duration
GROUP BY machine, state
ORDER BY machine, state;

Hands up if you didn't know data engineering could be this fun!

Did you find this article valuable?

Support Jesper Bagge by becoming a sponsor. Any amount is appreciated!