Three data analysis mistakes
Hey folks, most of this piece is from a lightning talk script I wrote a while back but never actually gave (maybe one day), so it might read more conversational than usual. Enjoy!
A lot of data work looks like a series of abridged research cycles, whether that's making a case for a result or exploring potential hypotheses. This means lots of data analysis and creating data stories. In this short article, I'll go through some easy-to-make data analysis mistakes and how to avoid them.
Mistake #1
After doing the difficult work of actually getting some data, you might start with exploratory analysis. You do some aggregation, calculate some summary statistics, look at some correlations, and take it from there, right? Maybe, only if you don't make Mistake #1 - Not (re)-acquainting yourself with the source of the data.
The source might be a composition of other datasets. What were those datasets, and what limitations do they have? The source might be manual data collection, like a survey or labeling task. How well does the collection reflect the intended use case(s)? Questions like these add context and can help you address issues before spending time analyzing a dataset that isn't appropriate.
Mistake #2
Okay, so you've drafted some visualizations for one or two compelling ideas. That's great, but you shouldn't like them too much, or you'll make Mistake #2 - Over-polishing your drafts.
In the same way that you might not focus on grammar or perfect word choice in an early writing draft, it's not that important to focus on visual grammar or perfectly written code for intermediate findings. It's more important to focus on correctness and consistency.
This is the moment to double-check that your visualizations align with your interpretation and that you're graphing precisely what you intend! In other words, it's okay if your intermediate ideas aren't quite what you intend visually. Polish comes later, during the editing process (that you should definitely have).
Mistake #3
So now you have your results. They're all statistically significant and significantly statistical. How do you fit them into your story? Well, not by making Mistake #3 - Including everything.
Your story should be coherent in a visual and a narrative sense. You might have spent a lot of time on a result that only ends up as a footnote (probably after making mistake number 2), and that's fine. A lot of time spent on a result doesn't warrant a proportionally large role in your data story! Some ideas end up as minor caveats, additional context, or the foundation for another, more specific analysis.
"Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away." — Antoine de Saint-Exupéry, Airman's Odyssey.
Thank you for reading! I haven't done much data visualization recently, but I do miss it! If you have thoughts on other data analysis mistakes or want to share your favourite data viz, please feel free to respond to this e-mail 🙂
'til next time
— Alex