Knowing Your Data
Hey folks, since my recent switch to Analytics Engineering, I’ve been exploring more varied data than usual. This article is, in part, a reflection of that — enjoy!
Data skills vs. Data context
Organizations in many industries work with data. They also tend to use the same or similar techniques and technologies to do this work. This situation leads to data skills being pretty transferable. Knowing how to plan, run and interpret online controlled experiments, for example, is a skill that can apply to travel, streaming entertainment, or social media. The same goes for skills like data-driven marketing, exploratory analysis, data visualization, and maintaining data infrastructure.
While data skills might be transferable, each organization has its unique history, culture, and processes surrounding data. Experiments in those organizations likely look very different in their intended goals, analysis units, randomization units, power calculations, metrics, etc. Context determines how a skill applies when using data and building a platform for others to use it. In this article, I want to explore this context more deeply.
Data all the way down
The TLDR is that gathering data context means collecting data about your data. There’s a word for data about data — it’s metadata! Having enough relevant metadata, like having enough relevant data, means informed decisions. You can decide whether to trust a particular dataset, use it as input to an algorithm, or use it for making a strategic decision, among other things.
Like data, metadata is also abundant. So much so that Google uses the same techniques they use for data warehousing in BigQuery, to efficiently store and make use of metadata1.
Here’s a laundry list of things that resemble metadata. Some are more qualitative than quantitative, and others are a step or two removed from any particular dataset you might be working with, but I think they all help build data context. This list is also far from comprehensive, and the items can vary in importance depending on the situation. I’d love to hear which you find most important or what you think is missing. Please feel free to drop me a note!
Laundry.List
Descriptive statistics
Calculating statistics about, or visualizing a sample of, the data is a pretty good introduction. The usual starting point is looking at summary statistics, which includes averages (mean, median, mode), standard deviation, and category frequencies. These can be kept handy in a dashboard, query, or notebook for validating new or unexpected results.
Definitions
These are especially important if they differ from existing industry standard definitions or are hidden away in code, processes, or tacit knowledge. Some examples:
What columns, tables, and schemas/datasets represent.
The definitions of KPIs and other business metrics.
Artificial/Arbitrary cutoffs - In some applications, it might be reasonable to consider a “session” paused or ended if there has been no activity for three minutes, thirty minutes, or even three hundred minutes. Similarly, the time before a user “churns” could be weeks, months, or years.
The data definitions behind everyday vocabulary - I’ve found it helpful to know the technical details behind the terms seemingly everyone uses. For example, what exactly is a user? What uniquely identifies a user, and when is that identifier created, destroyed, archived, and reused? This knowledge can sometimes explain strange analysis results.
Expected analysis confounders
Some ways of categorizing data can hide or emphasize a result. These might be related to time, geography, or other potential groupings. Building up a list of likely confounding variables can save time.
Variation over time
Historical data can be used for benchmarking and forecasting, especially before and after notable events. For example, when sizing an A/B test, if previous A/B tests have had, on average, a 0.1% lift on a particular metric, preparing for a lift of 0.01% is probably not worth the time. It’s also helpful to know the start date of time-series data, whether there have been outages or incidents, and the points at which a dataset has been modified, meaning when values, categories, or columns have been added.
Data volume
Volume has implications for latency, cost, and occasionally technical capabilities. Querying a large dataset for results from the last ten years will likely be prohibitively slow and expensive. On the other hand, a minimum amount of data is required for things like statistical significance testing and some machine learning approaches.
Service Level Agreements (SLAs)
How late can this data be before it presents a problem? From a platform perspective, the risk here could be data loss due to retention issues. From a data user perspective, the danger could be not having essential data for a publication or meeting.
The source of your data
At any point after actual data collection, it’s reasonable to ask what the source of the data is. This applies to slide decks, dashboards, streams, tables, machine learning models, etc. The answer to this question depends highly on how your analytics system looks.
Suitability
It’s worth asking whether a dataset is suitable rather than simply available for a particular use case. Consider a static website landing page that redirects to an application. Data collected on landing might be relevant to marketing and conversion tracking but not particularly relevant to, or suitable for, in-app performance. If the application also doesn’t load reliably on lower-end devices, the data is unsuitable for decisions that affect those devices.
The access patterns of the data.
Who or what (a scheduled process, for example) is accessing this data?
Why is this data generally accessed?
How is the data usually accessed? Does it tend to be via dashboard, queries, slide decks, or something else?
How frequently is this data accessed?
What are the access restrictions?
Wrap-up
I’m a believer in just-in-time learning. I don’t see this context as a checklist or even a getting-started guide, but rather a toolkit for navigating uncertainty. This could be uncertainty about the data or uncertainty about how the data will be used.
Not discussed here (but might be in the future) is data culture more generally. Data serves different purposes in different organizations, and understanding its role is a huge part of building context.
I’ll also note that nothing I’ve described here is immutable. Both the data itself and the organization’s culture and processes around data can change. If that’s the case for your situation, I hope you’ll consult this article for a refresher on what to reevaluate 🙂
Thanks to Serena Peruzzo for early feedback and discussion and thank you for reading!
‘til next time
— Alex
Granted, most of the metadata described in the linked article is for internal bookkeeping of the storage system, but I think the point still stands.