Grains of Data
When you mention data modelling, people typically imagine drawn-out meetings with too many participants and endless Google docs. They'll then find a reason to end the conversation shortly after. This reaction is understandable; data modelling requires having many people agree on definitions, which can be tedious. A metric layer, for example, requires having definitions for those metrics. Similarly, a dimensional model requires definitions for dimensions and facts.
In a previous article, I talked about the front room of a warehouse as a consistent interface to your data. One powerful way dimensional modelling implements consistency is by focusing on data grains. Fortunately, grains also have plenty of benefits without implementing dimensional modelling.
The grain (also known as the level of detail or granularity) of a tabular dataset is a description of what each row is meant to represent. For example, as an urban planner, you might have access to census data at the grain of households and residents. For privacy reasons, a third party might only be able to access data at the metropolitan or neighbourhood grain - a summarized version of the more granular data.
The analysis that’s possible and easy is determined by the grain of the available data. You can only answer a few questions about households and residents with metropolitan and neighbourhood summary data. At the same time, it’s easier and quicker to answer some questions using summary data.
Analytics engineering is about managing this balance at the scale of vast amounts of data. A lot of my job can be boiled down to figuring out the kind of analysis that benefits the business, then designing datasets with the appropriate grain to enable that analysis. Sometimes that means meetings and documents!
Useful data grains
A common misconception in the data world is that table grain is a synonym for primary or composite key. While this might be true operationally, it misses the descriptive power of grains. It’s possible to have different tables with the same unique identifiers but different grains.
Consider the following sales data from two retail stores:
These tables might both have a `Date` primary key, but they raise several questions about missing data, particularly in Table 2. Describing the table grains might clarify a few of these at analysis time:
Table 1: One row per calendar day. Days are filled in with Sales = 0 when no sales are made.
Table 2: One row per sale day. Days are recorded when the first sale of the day is made.
These descriptions are much more informative than just `Date.` You know immediately that Table 2 needs some padding before it can be comparable to Table 1.
There might also be a mountain of other concerns that reveal themselves only through hundreds of lines of SQL, documentation, and, yes, Thursday afternoon meetings. For example, the tables might also differ by how the stores:
1. Handle online vs. brick-and-mortar sales.
2. Record digital vs. physical product sales.
3. Modify data based on returns.
This information, or domain knowledge, can all be included in a verbal grain declaration.
Data grain declarations are pretty great, but alas, trust can be fragile. Like most documentation, grain declarations can become outdated and are not easily operationalized. While you can add a primary key constraint to a table, you can only express semantic constraints directly through words and indirectly through the actual data recording and transformation processes.
Despite those drawbacks, data grains are a powerful piece of metadata that drastically improve table creation and usage, whether you're going full Kimball or not.