By John McLaughlin
After completing an experiment, most of us dutifully perform statistical tests to determine whether our results are “significant.” These tests heavily determine whether experimental findings are considered robust, interesting, and publishable. P-values are commonly used to report statistical significance in the biology literature, but biologists have been chastised in recent years for misunderstanding and misusing this statistic. Underscoring this problem, a recent paper in PLOS Biology surveyed the scientific literature and found widespread evidence of “p-hacking”, or the manipulation of experimental parameters, such as sample size and the removal of outlier data points, for the sole purpose of obtaining statistically significant p-values.
What is the precise definition of the p-value, as it is most commonly used in biological research? It is important to note that there are several different interpretations of the concept of “probability”, perhaps the two most notable belonging to the Bayesian and Frequent schools of statistics. According to the Bayesian approach (developed by 18th century mathematician Thomas Bayes), probability is best thought of as the likelihood of a particular outcome, given our prior knowledge of the situation in addition to newly acquired data. To give a commonplace example: when searching for a lost set of keys in your home, you will want to estimate the probability that they are in a given location — most likely by remembering previous occasions that the keys were lost and where they were recovered. This “prior” knowledge will factor heavily into your probability estimate. You can then contribute new data to update this probability estimate, for example if it is known with certainty that the keys are not in one of these locations. The Bayesian interpretation of probability accords more with our common, everyday usage of the term.
However, the understanding of probability that dominates in the biological sciences is known as Frequentism; most p-value statistics in biological research are computed using this school’s methods. According to frequentist statistics, the probability of a given event is simply the frequency with which it occurs. To give a simple example: If a coin is flipped 100 times and lands “heads” on 58 flips, the probability of the coin’s landing heads is 0.58. Presumably, as the number of coin flips approaches infinity, the observed frequency of heads will approach the “true” probability of 0.5. Frequentism is based on the notion that repeated randomized trials, or experiments, will in the long run approximate the true probability of an event.
When running an experiment in the lab, a biologist may want to know the probability of her hypothesis being true, given the experimental data she observes. A p-value calculated using a standard t-test, however, would tell her the converse of this: the probability of observing the experimental data, given the null hypothesis being true. A common experimental “null hypothesis” is a statement of no relationship between the variables under observation (e.g. the means of two data sets are roughly equal). The p-value is therefore the probability of observing the experimental data or a data set more extreme, when assuming that this null hypothesis is correct – a lower p-value makes a stronger case to reject this null hypothesis.
There are a few things that the p-value statistic definitely does not tell a scientist. First, do experimental results with a low p-value tell a scientist that her hypothesis is correct? No. Rejecting the statistical null hypothesis is not equivalent to accepting her particular biological hypothesis. Is the p-value the probability that the null hypothesis is correct? Again, no. Biologists and statisticians use the term “hypothesis” very differently. When the statistician and evolutionary biologist Ronald Fisher popularized use of the p-value in the 1920s, it was never intended as a metric for confirming or refuting biological hypotheses. It was meant to be a general heuristic for judging whether a data set might warrant a second look or follow-up experiments; the p-value itself does not decisively settle any experimental questions.
What should researchers do to avoid p-hacking? One recent paper on this topic recommends choosing the experimental sample sizes in advance, detailing the removal of any outlier data points, and allowing other researchers access to the raw data. P-value statistics can be useful when employed properly, but they are not the whole story. As scientists face continued pressure to report “significant” findings and publish in high-tier journals, understanding procedures for proper data interpretation will be increasingly important. Hopefully, the trend towards open access publication will encourage greater transparency and scrutiny of experimental data reporting, along with a better understanding of p-value statistics and their applications.