A (ware)House Divided
A dimensional modeling technique we should bring back.
In the world of data analytics, it seems like dimensional modeling is always on trial. There are several talks(Coalesce 2020, Coalesce 2022) and articles on whether or not it should still be the go-to process for organizing data for analytics.
One idea from the Data Warehouse Toolkit (DWT), the book that introduced dimensional modeling, that I see less often discussed is the back room/front room. In my experience, it’s one of the more effective methodologies introduced in the DWT. In this article, I’ll briefly describe how the concepts fit into data analytics nowadays.
The Back Room and Front Room
The DWT describes separating your analytics system into two parts. The first is the back room where Extract-Transform-Load (ETL) happens, and the second is the front room where business intelligence happens.
The primary metaphor from the book is a restaurant. The back room is the restaurant’s kitchen, where food (data) is first prepared. This preparation means abstracting away the nuances of the source data and how it’s collected. It includes steps like doing lookups, renaming fields, doing complex calculations, applying defaults, dropping malformed data, adding keys, converting data types, and much more. These are done here once so that analysts won’t have to.
The front room is the dining area where the finished product is brought out to customers. In dimensional modeling, the format of that prepared data is facts and conformed dimensions. Essentially the front room provides a consistent interface to the data for analysts to access using dependable patterns.
This approach might look familiar to some because the back room/front room paradigm fits the ETL architecture exactly. After extraction, transformation happens in the back room, and the results are loaded into the front room. This setup starkly contrasts Extract-Load-Transform (ELT), another popular way to build an analytics system in the industry nowadays.
Where does it fit today?
In ELT, data is extracted from source systems, loaded into a warehouse such as Snowflake or BigQuery, and transformed as needed at, or just ahead of, analytics time. This approach is appealing in terms of flexibility but raises a few questions when you try to square it away with the back room/front room architecture:
The transformed, ready data in the front room effectively has a fixed shape (facts and dimensions if you’re doing dimensional modeling). How do you reconcile this with the idea of on-demand and flexible table transformations?
If you’re using a specific warehouse for transformation, you’ll probably also use it for business intelligence. How do you separate transformed or ready data from non-transformed data when there’s no longer a physical loading step?
I’ll give my opinion on question one in future articles. For question two, my answer would be to simulate the load step in the warehouse. In other words, create a step in the data pipeline that signals that data passed this point is ready for analytics.
This step is effectively a hook for adding SLAs, documentation, tests, and quality checks. It places the responsibility on data producers, whether data scientists or analytics engineers, to verify that the data is ready for consumption and that the pipelines producing the data are robust. You might even do what the DWT suggests and restrict access to this back room data or, in our case, data that hasn’t gone through this loading step.
Data modeling on the modern data stack is much more about information architecture requirements than physical requirements. Treating the front room as a question of information architecture means organizing tables, schemas, databases, access control, and other mechanisms to clarify the difference between ready and not ready for consumption.
Additionally, while the DWT is opinionated about what the front room should look like, the back room/front room architecture makes sense regardless of whether you have dimensional models. The key is making a judgment call about what ready-for-consumption means and adding a step, or multiple steps, in the pipeline to reify that decision.