Let your table have the same structure as your JSON

Use Arrays and Tuples to keep nested data nested

I love dbt Core and how sources enable me to test the freshness of hopefully recently ingested data by running

dbt source freshness

to test that my ingestion pipelines are delivering. However, when running dbt Core the output is just a JSON file placed in the target folder at the root level of your project. Since I’m running my dbt jobs with GitHub Actions that file is lost as soon as the job is finished.

Unless I HTTP-post the JSON data to ClickHouse, that is…

So I’ll start by creating a table that will hold parts of the sources.json data that I want to analyze and take action on as they happen. The JSON file has a structure looking something like this:

{
  "metadata": {
    "dbt_schema_version": "https://schemas.getdbt.com/dbt/sources/v3.json",
    "dbt_version": "1.4.6",
    "generated_at": "2023-06-27T08:31:30.583005Z",
    ....
  },
  "results": [
    {
      "unique_id": "arbitrary.model.name",
      "max_loaded_at": "2023-03-31T11:36:29.701000+00:00",
      "max_loaded_at_time_ago_in_s": 7599300.299,
      "status": "error",
      ...
    },
    ...
  ],
  ...
}

I’d like to recreate this structure in a table in ClickHouse so that I can do an HTTP-post using cURL with the contents from sources.json in my payload, with zero transformations. I will create a column for metadata that will be a named tuple and a column for result that will be an array of named tuples. And finish off with a timestamp for ingestion time. But first I need to shut off a ClickHouse-feature that flattens tuples. The docs only mention this for the Nested data type but this is also necessary for the Tuple data type.

SET flatten_nested=0; 

CREATE TABLE ingest.dbt_source_freshness (
     metadata Tuple(dbt_schema_version String, dbt_version String, generated_at String),
     results Array(Tuple(unique_id String, status String, max_loaded_at_time_ago_in_s Float64)),
     inserted_utc DateTime64(3) DEFAULT now64(3, 'UTC')
)
ENGINE = MergeTree()
ORDER BY (inserted_utc,)
PARTITION BY toYYYYMM(inserted_utc);

Now I’m able to post the results of my dbt freshness test directly into this table by a cURL call similar to this

curl -X POST -H "Content-Type: application/json" -H "X-ClickHouse-User: username" -H "X-ClickHouse-Key: password" -d @./target/sources.json https://my.clickhouse.server:8443/?query=INSERT%20INTO%20ingest.dbt_source_freshness%20FORMAT%20JSONEachRow

Did you find this article valuable?

Support Jesper Bagge by becoming a sponsor. Any amount is appreciated!