A bluffers guide to evaluating scientific results, Part 1: Systematic Bias

-February 12, 2016

Science journalism is more popular than ever, but it comes with plenty of sensationalism. Stories reporting that "everything you ever thought about xxx was wrong" generate the most page-clicks so editors love them. Plus, journalists who genuinely love science often lack the research experience or understanding of statistical analysis necessary to guide their readers. That is, uncertainties in research results are almost never reported. If they were, you could gauge how many grains of salt you need to take with the claims.

In this three-part series, I'll give you the concepts you need to distinguish strong results from weak and to understand why some results seem more conclusive than they really are. With these tools, you can estimate the uncertainties yourself and decide how much to believe the next thing you read from "I love science" or whatever your friends share on Facebook and Twitter.

The spectrum of research results: inconclusive to conclusive
Scientific results cover a spectrum from inconclusive to conclusive. They range from "weak indications of" to "evidence for" to "discovery" or "confirmation." It's not a spectrum of bad science to good. As humanity writes our book of knowledge, inconclusive results are just as important as conclusive results—"the worst data are better than the best theory," said Antonio Ereditato —inconclusive results just aren't fascinating to casual observers. As for bad science, let's assume the goodwill of researchers and worry about fraud some other time.

Misunderstandings arise when conclusions are inflated. You see it all the time,

It's not that the claims are wrong, just that the evidence reported isn't nearly as strong as the articles indicate.

Every measurement is uncertain
No measurement is exact. Experimental precision is limited by experimental uncertainty. Independent of that uncertainty, measurements have no meaning. If I tell you that my random survey indicates that 100% of football fans think the Raiders are going to the Super Bowl, you might ask how many people I polled, where I conducted the poll, and what question I asked (40,000, at the Oakland Coliseum, "Who’s the best?"). You might reasonably conclude a bias in my measurement.

A group of unbiased observers suitable for polling.

Experimental uncertainties can be filed under two categories, statistical and systematic. Statistical uncertainties come from the amount of data that goes into the measurement. Because we have rigorous tools for analyzing probability and statistics, statistical uncertainties are easy to find. We'll cover them in Part 2.

No experiment can be performed without some bias. Systematic uncertainties come primarily from unknown biases.

Comparison of systematic uncertainty and statistical uncertainty.

Estimating Systematic Uncertainties
Systematic uncertainties aren't sinister. When experimenters discover their biases, they find ways to remove them by using control tests such as double-blind testing. To estimate biases that they're unaware of, researchers (and readers of popular science should) try to estimate how the results change if the experimental techniques are altered. By approaching a measurement from different perspectives and comparing their best results to those that they know are biased, experimenters can determine the scale for the unavoidable inherent bias. From that scale, they can estimate their systematic uncertainty. For example, time domain and frequency domain measurements should always agree but never exactly, the difference measures the bias.

Here's another example, a much trickier one than is unlikely to be faced by engineers, physicists, or chemists. Consider the paper published in Science reporting that readers of literary fiction are more empathic than readers of genre fiction (like thrillers, mysteries, science fiction, romance, etc). If the researchers had examined their systematic uncertainty, they either wouldn't have made the claim or Science wouldn't have published it (Science has a reputation for publishing a few sensational results each year, probably to recruit subscribers).

They could have estimated their systematic uncertainty by performing the experiment with separate but consistent definitions of “literary” and “genre.” They used excerpts of an anthology for literary fiction but could easily have used critically acclaimed but diverse works of fiction. By comparing their results under separate definitions of “literary,” they could have estimated their systematic error and it probably would have dwarfed their statistical error.

The paper quotes results derived by commercial statistical analysis software that indicate a rather convincing level of statistical significance, Without the systematic uncertainty, however, their conclusions are as specious as my poll of the Raider Nation. There was, however, one claim they could have made: careful analysis of their results gives compelling if not quite conclusive evidence that reading fiction helps people develop empathy. Their experiment just didn’t have the precision to resolve any dependence of the level of empathy on the category of fiction.

Also see

Loading comments...

Write a Comment

To comment please Log In