A bluffers guide to evaluating scientific results, Part 3: Why Some Results are Irreproducible
In the last few years, many scientific results that society deems important have been difficult if not impossible to reproduce. The phenomenon has been called everything from a widespread problem to a crisis.
In Part 1, we discussed the important but difficult process of estimating unknown measurement biases called systematic uncertainties and how to deal with them when interpreting results. Systematic uncertainties can be estimated by comparing results of a given measurement performed with different techniques. In Part 2, we developed rules of thumb for quick and dirty estimation of statistical uncertainties and statistical significance; statistical uncertainty is roughly the square root of the number of observations, , and statistical significance is roughly the ratio of a signal to the square root of its background noise, NSIGNAL/.
In part 3, the last part of this series (as far as you and I know, anyway). I'll show you how easy it is for the demons of random processes to wipe out otherwise convincing results. As in Part 1 and Part 2, we'll assume researchers are sincere and leave fraud for another time.
Before we get in too deep, let me disclaim. While we should maintain our skeptical scrutiny, many irreproducible results come from attempts by neuroscientists, sociologists, and psychologists to reverse engineer the brain by analyzing the behavior of fewer than 1000 people, usually fewer than fifty. They're doing something far more fraught with uncertainty than anything in the physical sciences or engineering. Plus, the instrumentation available just isn't up to the task. For example, neuroscience experiments rely on fMRI (functional magnetic resonance imaging) that has spatial resolution of a few millimeters and temporal resolution of seconds. The equipment attemps to measure the behavior of neurons, whose axons have diameters of a thousandth of a millimeter as they exchange signals in dozens of milliseconds—like searching for a needle in a haystack with a backhoe.
Find the Signal!
The six graphics in Figure 1 show the number of times that something happens, the vertical axis, at some time or place or energy, the horizontal axis. In which of these plots has something special happened at a specific time, place, or energy?
Can you find the signal without being tricked by the noise? Or, equivalently, can you determine whether the signal you see is real or just a fluctuation of the noise? In which plots do you see nothing but noise, which show evidence for a signal, and which have enough evidence for you to say that it’s conclusive?
Figure 1. Where is the signal in each of these plots and would you consider that signal significant evidence for the existence of something other than noise?Let's start with a simple example. In Figure 2, the background noise barely fluctuates and the signal is quite pronounced. Count the number of signal events above the noise to get NSIGNAL—don't put any more effort into it than you would while reading "I Freaking Love Science" or the New York Times. Then. count the number of background events below the signal to get NBACKGROUND. Because NSIGNAL/ = 6, It's a 6-sigma effect; the sort of signal that random processes could conspire to in less than one in every twenty-trillion repetitions of the experiment (c.f., Table 1 and Table 2 in Part 2 of this series), convincing but still in need of independent verification.
Figure 2: In this plot, the signal is obvious.
Now do the same thing for the six plots in Figure 1 and determine which have significant signals. Your calculations aren't likely to be the same as mine, but they're probably close enough.
Because I made the six plots with a simple simulation, I can peer under the rug and tell you that all six have the same signal and the same background. The only thing that differs is the role of random fluctuations. That is, I used the same parameters to create each plot except for the initial random number seed. I chose the six plots from 30 different runs and (I admit it!) I picked two where the signal looked drowned by background, two where it peaked over the noise, and two that looked like most of the other 24 runs.
A perfect measurement of this system would show a 3-σ effect, which means that the signal would disappear under the noise less than 0.3% of the time. Our estimates differ over the six plots, not because the rule-of-thumb is inadequate, but because random fluctuations shift things around in ways that, to our pattern-predicting brains, don't look random. In other words, our results might be irreproducible, not because of any systematic bias, but because this is how people interact with nature.
On the other hand, the "crisis" of irreproducible results (at least those not caused by fraud) could be averted if (a) the experimenters always quoted both systematic and statistical uncertainties, and (b) if journalists reported both uncertainties in a way that advised the reader where the results belong on the scale from inconclusive (evidence for) to conclusive (discovery of). But, that would de-sensationalize the results and reduce the click-bait so editors won't allow it, except here at EDN?
Generally, when reading science journalism, it's useful to keep in mind the immortal words of Miles Dylan, from his book, Everything: "There's more to it than that."
- A bluffers guide to evaluating scientific results, Part 1: Systematic Bias
- A bluffers guide to evaluating scientific results, Part 2: Rules of Thumb
- Test and Diagnoses Strategy Metrics: A New Perspective, Part I, which looks at how engineers interpret test results as compared to doctors