Published on March 05, 2016 by Dr. Randal S. Olson
8 min READ
Last year, the vaccination debate was all the rage again. "Pro-vaxxers" were loudly proclaiming that everyone should get vaccinated and discussing the science behind it, and "anti-vaxxers" were casting their doubts and still refusing to get vaccinated for personal reasons. Around that time, The Wall Street Journal released a brilliant series of heat maps showing infection rates for various diseases over time, broken down by state. These heat maps easily demonstrated one of the most important facts in the vaccination debate: Time and time again, vaccines work.
Today, I would like to revisit the WSJ's heat maps through the lens of a data visualization practitioner. In particular, I would like to show how these heat maps can possibly be improved upon by reviewing some basic rules of data visualization, and trying out some other methods for displaying the data. Below, I'm going to walk through four major criticisms and show how addressing them can possibly improve the original work.
For the curious, I've released my notebook with the Python code used to generate the new visualizations.
Perhaps one of the most straining issues with the original WSJ heat maps was their use of a custom categorical color palette to display the infection rates. The palette runs through most of the colors of the rainbow at seemingly-random intervals. It's possible that they calculated the quantiles to determine the ranges for the color bins (as they should!), but that wasn't indicated in their methodology.
In any case, it's rarely a good idea to use multiple colors to display a single continuous variable. Here, all we want to do is use color to show the infection rates for each year. If we use more than one color, our readers have to constantly refer back to the legend to figure out what each color means, which is an unnecessary cognitive strain on our reader. Instead, we should use a single-color sequential palette, where lighter shades indicate lower values and darker shades indicate higher values. I've reworked the Polio heat map to do just that below.
One exception to this "rule," of course, is diverging color palettes. If there is a clear divide in our continuous variable -- for example, if we're displaying gains and losses for a company -- then it could be appropriate to use a diverging color palette with one color to represent gains (values >= $0) and another to represent losses (values <$0).
Just for fun, I recreated the same chart above for Measles so we can compare it to the originals on WSJ.
Color blindness is probably one of the most-overlooked issues in data visualization, and the WSJ heat maps are a great example. I ran the WSJ heat map above through a color blindness simulator for red-green color blindness -- the most common form of color blindness -- and below is the result.
Disastrous! Much of the color gradient is lost in some yellow/grey abyss, and the dark purple colors represent low values whereas the lighter yellow and dark grey colors represent higher values. This color palette survives better than most and the main message is still (mostly) communicated, but the WSJ color palette is certainly far from ideal here.
For comparison, I ran my rework from above through the same red-green color blindness simulator. As we can see, the simple sequential color palette is practically unaffected by this form of color blindness. Problem solved!
The main lesson here is that we should always run our color palette through a color blindness simulator before committing to it. Roughly 5% of our audience will experience our data visualizations through that lens.
One of the major drawbacks of heat maps is that they rely on color to communicate the specific values in each cell. While it's not always important to display a precise value, there can sometimes be important trends hiding in these small differences. For that reason, I reworked the Polio heat map into a simple line chart below, where each light line is a state and the dark line is the median value between all the states for each year.
The above chart isn't too useful, and the data is too messy to make much sense of the state-by-state trends. However, the decline in infection rates after the introduction of the vaccine is abundantly clear even in this case.
No post of mine is complete without small multiples, so let's give that a try. Below, each state has its own chart, and all 50 states (+ D.C.) are put on the same time axis.
Each line tells its own story, and these are stories that were masked in the heat maps. Small multiples allow use to see specific state-by-state trends, for example, Polio outbreaks were already on the decline in South Dakota even before the introduction of the Polio vaccine. Meanwhile, Polio outbreaks were at their worst in New Hampshire just prior to the introduction of the Polio vaccine, which made short order of Polio immediately thereafter.
We should always ask ourselves when designing data visualizations: Do we care about the broader story, or the smaller stories? In this case we could go either way, but the direction we go depends on the story we want to tell.
Another fair criticism of all the data visualizations shown so far is that they show too much data. After all, the main message of the WSJ heat maps was simple: When introduced to human populations, vaccines work. There's no need to show the state-by-state trends then; in fact, we may be overwhelming our reader by providing too much data that doesn't get right to the point. For example, what happened with Polio in Utah, with the infection rate more than doubling after the introduction of the Polio vaccine? Or what about South Dakota, where Polio seems to have been mostly eliminated even before the vaccines were made available?
These outliers are distractions to the overall trend. We can overcome these distractions by applying a simple statistical analysis to the data, and show the overall trend with confidence bounds. Below, I've done just that by plotting the median Polio infection rate across all states (dark line) with bootstrapped 95% confidence intervals (shaded area).
By summarizing the data with some basic statistics, we've removed the distractions and gotten straight to the point: Overall in the U.S., Polio outbreaks were on the rise from the 1940s onward. Right at the introduction of the Polio vaccine in 1955, we immediately saw a decline in Polio outbreaks until it was practically eliminated in the 1960s.
Again, we should always consider our story when designing data visualizations. If we have one clear story that we want to communicate, we should consider reducing the amount of data we show to the point that we can effectively -- and honestly -- communicate our story. There's no point in confusing our reader with unnecessary details, unless those details contain an important caveat.
At face value, these charts only demonstrate correlations: When vaccinations were introduced to the population, the prevalence of infectious disease decreased shortly thereafter. I believe it's important to point out here that even though I want to focus on data visualization techniques in this post, the science behind vaccination is not up for debate, and these charts are in fact demonstrating a proven causal relationship. Please don't waste your time typing out "correlation != causation" in the comments.
To wrap up, these are the lessons we've drawn from revisiting the popularized vaccine visualizations:
If you liked what you saw in this post and want to learn more, check out my Python data visualization video course that I made in collaboration with O'Reilly. In just one hour, I will cover these topics and much more, which will provide you with a strong starting point for your career in data visualization.