P-Values Part 2: The Basic Statistics
Statistics Terminology: Hypotheses
Following my previous blog post on why science needs statistics, here I will introduce a more technical definition of a p-value. But first, I need to introduce a couple terms. In statistics, we talk about hypotheses and models all the time. Perhaps surprisingly, these two words are almost used interchangeably. Both hypotheses and models are quantitative ways of describing how something about the world might work. Hypotheses typically come in pairs -- the null and the alternative -- and are formally set up in order to conduct a statistical test and produce a p-value. In contrast, the term model can refer to just about anything: a hypothesis, a probability distribution, an equation, or set of equations. For our purposes however, the term model will only refer to the null hypothesis. You'll see why.
In any Introduction to Statistics class, students learn to set up null and alternative hypotheses, often with little idea why. In the case of our hairy relatives, the null hypothesis (Ho) would be that the average heights of the two genders are equal, and the alternative hypothesis would be that they are not. Typically, the alternative hypothesis is the pattern or effect you're looking to test for. It will be your headline -- if you find it. The null hypothesis is simply the absence of that pattern. It's the baseline. The null hypothesis is the basis upon which the statistical test is performed. So null hypothesis = Ho = "the model" = "the baseline scenario."
Hypothesis Testing
The fundamental principal of any statistical test is this: you collect some data and compare it to your null hypothesis. If it looks similar to the data the null hypothesis produces, then you have no reason to think that there's anything going on differently from the null hypothesis. This is called "failing to reject" Ho. If the data you collected looks very different than what Ho produces, it leads you to reject Ho. The metric of how similar your data is to the null hypothesis is the p-value.
Students also learn that when you reject the null hypothesis, that does not mean that you accept the alternative hypothesis. This critical point depends on the fact that the statistical test is conducted based only on the null hypothesis. The alternative hypothesis is not involved; honestly, it's mostly there for show. It's just a statement: "the average height is different between males and females." How different? We don't specify, and therefore we cannot model it. If we cannot model it, it can't be used for a statistical test, because it can't produce any data for comparison. We can, however, model the null hypothesis, because statistically we know what it would look like if the average heights are equal. That is a single, well-defined scenario. We can therefore test our actual measurements against that particular -- albeit theoretical -- baseline scenario.
So, to review: First, we set up a null and alternative hypothesis. The null hypothesis (Ho) is what we know already, with all other effects set to zero. The alternative hypothesis is what we want to show. Ho provides a model that predicts expected baseline measurements. Separately, we make actual measurements, and then compare them to those predicted by the baseline model. If the two sets of measurements are similar, we fail to reject Ho based on what we measured, and seek out comfort food. If they are different, we reject Ho in favor of the more interesting alternative hypothesis, and hope the tenure committee will notice our outstanding work. We make this comparison -- between the baseline model and the real world measurements -- by looking at the p-value.
P-Values: A Good-Enough Definition
Okay, here we go.
Roughly speaking, a p-value is the rate at which the baseline scenario (Ho) produces results like those you observed.
That's not so bad, right?
A high p-value means that Ho often produces results like those observed, and therefore there's little reason to think the baseline model is wrong. In contrast, a low p-value means that the baseline model doesn't often produce results like those observed -- those unusual observations make you think the model might be wrong. Maybe there's something more going on! Yay, science!
The key is that a p-value does not tell you the probability that a model (or hypothesis) is right or wrong. Rather, it only tells you how often the null hypothesis would produce results like those you observed. From there, it is up to a person to interpret the p-value, in context.
If after running an experiment and analyzing the data, you got a p-value of 0.1, that means that 10% of the time, the baseline model would produce results like those you observed. So is that model wrong, since what you actually measured would only be an outcome of the model 10% of the time? Maybe. Or maybe the model is right and this is just one of the 10% of cases. There's no way to know. But what you can say is that there's much more reason to doubt the baseline model if the p-value were 0.001 than if it's 0.1. The finding at 0.001 is much more robust against noise, because the model of baseline noise would only produce that result 1 in 1000 times. Making those kinds of distinctions, as we did for the hairy height example, is the primary purpose of p-values. P-values are only really useful as a measure of relative robustness.
Evaluating a P-Value
Let's return to particle physics. Recall that the criteria for "evidence of a particle" is 0.003. This criteria, widely denoted by the Greek letter alpha and also known as the significance level, can theoretically be set at any value between 0.5 and 0. It typically set below 0.1. Once the significance level is set, we calculate the p-value based on the experimental data and then compare those two numbers. If the p-value is > alpha, we fail to reject Ho. If the p-value < alpha, we reject Ho. Therefore, the lower the alpha, the more strict a standard is being set for how small a p-value needs to be in order to reject the baseline model. Remember that when we reject the baseline model, it implies that something other than the baseline is happening, like a the presence of subatomic particles we didn't expect to produce.
It really frustrated me in my Introduction to Statistics class that this criteria, alpha, with which we evaluate the p-value, is almost arbitrarily set. It makes the whole exercise -- designing the experiment, collecting data, comparing the null hypothesis to the observed data, and either rejecting or failing to reject Ho -- seem meaningless, if you can set whatever you want for the ultimate criteria. Many researchers feel similarly, and as a result much of the scientific community has converged onto 0.05 as a default significance level. Not because it's the most appropriate value in most cases, but just because it's a number. As a result, we have now reached a point where almost any study that produces a p-value < 0.05 can be called a scientific discovery, and any study that doesn't produce such a p-value goes unpublished or overlooked. This is quite problematic for four reasons:
Overlooked findings: Many studies that do not produce a p-value < 0.05 still deserve serious attention due to a number of critical but often-forgotten factors, which I attempt to describe in my second post.
Negative studies: Studies which show no significant result typically don't make a big impact, but they nevertheless they ought to be widely available as examples of ideas that didn't pan out.
Deliberate p-hacking: Sadly but realistically, some studies are carefully engineered to produce a p-value < 0.05, but nevertheless are not meaningful and therefore risk leading researchers down dead ends.
Accidental p-hacking: Even highly ethical researchers dedicate considerable time (perhaps subconsciously) to designing and tweaking analyses such that they will produce a p-value < 0.05, rather than to the most intellectually honest investigation possible.
Learning from our Peers
In some fields, like particle physics, they do a somewhat better job. Based on the kinds of experiments they conduct and the kind of questions they're trying to answer, that community set a rational threshold of 0.003 for evidence of a particle. However, in order for a new particle to be recognized as an official discovery, like the Higgs Boson, the criteria is much higher: the p-value must be below 0.0000003, or about 1 in 3.5 million. That means that the baseline model -- in this case the Standard Model of particle physics -- would produce the results measured at CERN only once in 3.5 million times. That fact made them pretty confident that there was something more than just the Standard Model going on. That something more was the Higgs Boson. And that number, 0.0000003, was a p-value put to good use.
The lesson many researchers should learn from physicists is that they set appropriate values of alpha. If physicists want to show evidence of a particle, that finding should have a fairly strict criteria. Thus, alpha = 0.003. However, if physicists want to officially announce that a new particle has been discovered, that should require an even stricter criteria. The Standard Model doesn't change lightly (get it?!), thus alpha is set to 0.0000003 in this case. That decision -- made long before the actual experiments were conducted -- was based not only on knowledge of what kind of p-values are possible to produce in particle accelerators, but also on the sociotechnical history of particle physics and how false discoveries have managed to slip through previously.
So setting an appropriate level for alpha is difficult, but important. Another important point is that p-values are not the be-all and end-all of scientific robustness. There are several other statistical calculations and methodological considerations that should play a role as well. I discuss these and other more advanced topics in subsequent blog posts.