Bad data? Fix it, don't scrap it. Even when it's about a pandemic.

Trump now says hospitals should send patient information to a database in the Department of Health and Human Services, bypassing the Centers for Disease Control (CDC). This will doubtless be followed by complaints about it being bad data, after which they’ll cut off access to the data — data which at the CDC was publicly available.

All data is imperfect. (And yes, I’ll be sticking to the singular “data” for this post.) This does not make it right or wrong, or good or bad. Any time you hear complaints about “bad data,” especially in a political setting, keep this in mind.

For a politician, “good data” means “data that backs up my position” and “bad data” is “numbers that make us look bad.” Consider unemployment rates, which were cited widely by the Trump administration except when they made it look bad.

The pandemic numbers — cases, hospitalizations, deaths, mortality rates, and infection rates, for example — are noisy and flawed. Among other problems, they include:

Repeat testing of some of the same people.
Asymptomatic people who don’t get tested, but may have the virus.
Inaccuracies in tests results.
Variations in reporting criteria among states.
Variations in what is reported as a case of COVID-19 vs. pneumonia or other conditions.
Lags in reporting of cases and deaths.
Variations in what is reported as a “probable” case.
Questions about whether COVID-19 or other health conditions caused deaths.

This is less than ideal. It’s also typical of any data collection. Data always has problems. These problems include inaccuracies, false positives and negatives, human errors, missing data, biases, imprecise definitions, and uncertainty.

Do you throw out the data? No. It is better to have a flawed set of indicators than no information at all.

What to do about flawed data

When data has problems and you know about them, you keep working to improve it — as well as to improve the ways you use it. These are all strategies that work:

Find and fix problems. If there’s a reporting issue, improve reporting systems.
Create consistent methods and definitions. For example, there should be a single definition of a death from COVID-19, or a set of definitions like “caused by COVID-19,” “COVID-19 contributing factor,” and “suspected COVID-19.” The pursuit of consistency is an endless task, but it’s always possible to improve.
Learn from others. How did other countries or states do it? Can you use their methods?
Identify sources of bias. Is something skewing the data? Can you compensate?
Smooth out noise. This is why you often see COVID-19 statistics with a 7- or 14-day moving average. Such a method removes daily variations due to coincidences or glitches in data collection and replaces them with a steadier and more accurate picture less subject to daily anomalies.

Even with flawed data, it’s possible to see trends. You can make comparisons between states with different strategies. You can observe lag times between case reports, hospitalizations, deaths, and recoveries. You can identify correlations between mask usage and infections, or treatments and deaths. Most importantly, you can see how an indicator, flawed thought it might be, changes over time, to see when the news is good, and when disaster is on the way.

Hide the data and you’re flying blind — and you can do none of these things. We, the public, deserve to know the data that is availble.

Epidemiologists and statisticians can address the flaws in the data and the limitations in the conclusions you can draw from it. They can tell you where the uncertainty is — and where the certainty is, as well. They can help you interpret what you are seeing, and possible ways to interpret it.

None of this is perfect. It’s all subject to debate. But there can be no debate and no interpretation if the data is not public, or is being manipulated to a specific political end.

So here’s to flawed data. We’ll work to make it better. But when a politician says “this is bad data” as an excuse to hide it twist it, then you’d better recognize that you’re being hoodwinked.

4 Comments

Tim Eiler says:

July 15, 2020 at 12:25 pm

The fundamental problem is two-fold, I believe:
1) The administration is driven by branding – if the President doesn’t look good, the thing that causes such is 100% bad.
2) If “science” causes the President to look bad, see #1. Statistics and epidemiology and public health are “science.” If they make the President look bad with “facts,” again, see #1.

Phil Simon says:

July 15, 2020 at 1:01 pm

What Tim said.

Yet another disgraceful and indefensible position by this administration. I have little doubt that they’ll lie or remove what they consider outliers.

Meanwhile, thousands of people will unnecessarily die.

The GOP should be ashamed of itself.

Terry Nugent says:

July 15, 2020 at 4:24 pm

Great post! I’m a data guy and you’re spot on. It doesn’t really matter where the data is housed as long as it is handled honestly and properly.

BM says:

July 17, 2020 at 12:26 pm

Had a co-worker who came from a former Soviet-bloc country. That person worked in a government department, in a role of collecting data for a segment of the economy.

That co-worker’s department boss always made “adjustments” to the data. That boss’ boss made their “adjustments” and so on all the way to the top.

In the end, what was published was essentially fake data that had nothing to do with reality – that is, they were lies.

Transparency – making the raw data available, goes a long way to establishing trust in what we are told. Obscuring, or hiding the data will create suspicion, even if what we are told is honest and truthful (without salient omission).

The rationale to move the data doesn’t make sense (vs other alternatives), thus smells like that Soviet-bloc story – an attempt to reduce transparency, to feed a government leader’s desired story.

Bad data? Fix it, don’t scrap it. Even when it’s about a pandemic.

What to do about flawed data

Related

Leave a ReplyCancel reply

4 Comments

Subscribe to blog

PDF Download