Automagic schema inference killed the data engineer
Todays lesson in working with legacy output is: data contracts, data contracts and data contracts.
The great thing about shipping structured data as text files is it is so darn easy! Just store it as a CSV and boom: minimum overhead and just the right amount of structure to not risking to be called out a liar when saying that the data is structured.
But it does not come entirely without risk. There's zero context to this data so you're leaving the receiving end with a lot of guesswork. Does this column contain text? Integers or floats? Are there alphanumeric keys in this column aptly named 'id'? Perhaps I should parse everything as text just to be safe?
I spent a hefty portion of two working days, struggling to parse tab separated data from MySQL dumps. There were errors. ClickHouse couldn't parse these files because it expected tabs where there apparently was none. However, I could easily spot the tabs in the raw files and without difficulty read the same data into a pandas DataFrame. I even did a manual count just to make sure that the data wasn't dirty.
And then I spotted it! Not in the files, no. And not in the error message, no absolutely not! It was over on ClickHouses excellent documentation pages. There I found the setting:
input_format_tsv_use_best_effort_in_schema_inference=1
It might be fun, at first. Cool, some might say. Leaving the kitchen door wide open to ghouls, I would argue. In my case, a column that had its schema inferred as Int64
suddenly contained alphanumeric caracters.
The lesson I take with me from this experience is to keep telling the parser exactly what data types to expect. You configure this as a long string after telling the enginge what format to parse. Yes, the parameters for the s3 table engine will be a short novel, but at least the explicitly typed out columns and data types serves as a minimum data contract. Those lines of code in dbt, version controlled in all its glory will be the 1:st line of contracts, describing what type of data is expected in that pipeline.