Written by Kevin Soo, PhD, a Senior Applied Data Scientist at Civis Analytics, where he helps government teams solve tough data-related problems.
Data science and predictive modeling are increasingly important in modern-day governments, and this is especially true as teams focus on responding to COVID-19.
Addressing this public health crisis requires collaboration among team members with varying levels of technical expertise. Because most government data science teams are small, the individuals who understand it become the go-to for questions about modeling, data and statistics. This data science knowledge gap can be tricky to navigate: experts may spend time explaining the nuances of technical concepts to colleagues who may not know the right questions to ask.
To help bridge this gap, Civis identified some important considerations for teams thinking about data and models related to COVID-19. Our hope is that this serves as a good high-level starting point for beginners, and can be a resource for data scientists to share with colleagues.
First of all: what is a model and why are there so many of them?
At the most general level: statistical models take data about what has already happened, and use them to make projections about what will happen using math. In the case of COVID-19, there are a number of models, and they may differ for several reasons.
First, they may be predicting different outcomes. For example, the IHME model projects the number of COVID-19-related deaths, while the CHIME model makes projections about hospital capacity. Both are important predictions that can help us navigate the crisis, but we can’t compare them apples-to-apples, so it’s important to know what models are being used by your state or city government.
Second, various models rely on different data and assumptions. Some use data on the number of current confirmed cases, while others focus only on hospitalizations. Some may assume that strong adherence to social distancing will continue indefinitely, while others assume that adherence to social distancing will decrease when the rate of new infections drops.
Learning the ins and outs of each model can be time consuming, especially for those without a technical background. Here’s what to pay attention to:
Rely on interpretations from public health experts. Any data enthusiast with a COVID-19-related dataset (and a Twitter account) can share their hot take, but it’s vital we focus on the advice of epidemiologists and public health professionals -- they’re subject matter experts for a reason. Even a seemingly simple health statistic like a state’s mortality rate needs to be interpreted in light of many factors and nuances that these experts understand better than anyone else. Things like: How many tests have been administered? Did everyone who needs a test get one? How do possible co-infections with other diseases factor in? What about false negatives/positives?
Look at trends and averages, not single data points. Single-day spikes or drops can be outliers due to issues with data quality or quirks in how the data are recorded. For example, in some regions, new cases and deaths come in spikes, reflecting reporting lags and the reality of the weekend. States also use different metric definitions (e.g. only some include likely COVID deaths) in their reporting, making them difficult to compare. The tl;dr: don’t read too much into the difference between a 5% increase in cases in your county and a 6% increase in a neighboring county.
Be mindful of overly-broad summary statistics. Sometimes, overall summary numbers can gloss over how the COVID-19 crisis may be affecting certain groups of people differently. For example, our own polling from April showed that 57% of adults “strongly agreed” with shelter-in-place policies. However, breaking this result down by age shows that only 43% of those between 18-34 “strongly agreed,” compared to 71% of those 65 and older.
Pay attention to the predicted range of possible outcomes. No statistical model can predict the future – they make projections, which always contain uncertainty. Most models will report a range of likely predicted outcomes (may also be referred to as “prediction interval/range,” “confidence interval,” or “margin of error”) and may specify one outcome as the most likely.
Think of it like rolling two dice: the sum will fall between 2 and 12 (the prediction interval), with 7 as the likeliest outcome. Just because 7 is the likeliest outcome, doesn’t mean it will happen every time (or the game Craps wouldn’t exist).
The narrower a model’s prediction interval, the more certain we can be about its projections. If a model has a wide prediction interval, it is telling us that we lack the information to make projections with confidence.
Change isn’t always a bad thing. Most models are dynamic: they change over time as new data is incorporated. If something about the world changes (for example, people stay at home) and that’s reflected in the data, models will pick up on these changes and adjust their projections. If a model previously projected that hospitals would be overrun, but now shows that is no longer the case, the “change” may reflect our adherence to social distancing guidelines: hospitals won’t be overrun precisely because we have taken steps to avoid the trajectory we were on.
A handy checklist to consider when trying to understand a model’s projections
A good model should meet all of the criteria below. A negative answer on one isn’t fatal, but a model that fails on multiple counts warrants extra skepticism.
Is the model’s methodology clear? Even if the details are too technical to fully grasp, there should be a clear description of how the data is used to generate projections.
Have the modelers specified the data used by the model? Check if others have reported any issues with the dataset – perhaps it is incomplete, unreliable, or contains bias (does the data represent everyone of interest in your community? Is the sample size large enough?)
Have the modelers specified the assumptions made by their model? Perhaps a model assumes perfect adherence to social distancing across an entire state. Any assumptions in the model should be reasonable.
Has the data and code behind the model been released so others can replicate and check their work?
Any other tips that government data teams have found helpful? We’d love to hear them: ksoo@civisanalytics.com.
Comments