P-Values Part 4: Understanding Errors
Reintroduction
Through a series of blog posts (part 1, part 2, part 3), I've attempted a perhaps futile effort to try to explain p-values. Not to define them -- I think that's largely an objective without a purpose -- but to explain them. I'll assume you (mostly) understand the previous post for the purposes of this more advanced discussion. While the previous post was largely intended for those who just sometimes encounter p-values, this post is aimed more at those who use them.
First, let me summarize a few key points from earlier:
A p-value is a way of communicating the robustness a scientific finding.
The lower the p-value, the lower the risk that the scientific finding is a false positive.
The p-value is not, however, the probability that the finding is a false positive.
The null hypothesis (Ho) is a model of the baseline scenario.
Interesting potential findings are set to zero and only currently known or expected variation is included in Ho.
Observed data (from an experiment) is compared to expected baseline data (from Ho) at a predefined level of stringency.
Specifically, a p-value is calculated and checked against alpha (the level of stringency).
If p > alpha, you fail to reject Ho. This suggests your data is just a result of expected variation, or noise.
If p < alpha, you reject Ho. This suggests your data is the result of some new effect or pattern.
P-values are not the be-all and end-all of scientific robustness.
The "good-enough" definition of a p-value we've been working with is this: the rate at which the baseline scenario (Ho) produces results like those you observed.
Errors and Rates
Much of the confusion around p-values stems from more fundamental confusion about the concept of a false positive. In particular, we often gloss over the distinctions between a false positive event, and rates or probabilities that involve false positive events. It's overwhelming, at times, how many terms there are that have to do with false positives and/or false negatives:
Type I Error
Type II Error
Accuracy
Significance
Power
False Discovery Rate
False Omission Rate
Positive Predictive Value
Negative Predictive Value
False Positive Rate
False Negative Rate
Sensitivity
Specificity
Likelihood Ratios
These are all different quantities. I often rely on this table to keep it all straight for myself. The terms Type I and Type II error have very specific definitions for hypothesis testing, as described above. The terms false positive and false negative are loosely associated with Type I and Type II errors, respectively, but they are also used widely outside of the context of hypothesis testing. Broadly speaking, a false positive event occurs when you conclude something is true when in fact it is not (p-values are related to this problem). A false negative event occurs when you conclude something is not true, when in fact it is. This terminology is incredibly common in the medical diagnostics literature. That list of quantities above represents calculations that can be performed in a variety of contexts based on the occurrence of or concern about false positive and negative events.
Formulating a Rate
The most important question to ask whenever you look at a probability or rate such as these is: what exactly is in the numerator, and what exactly is in the denominator? As simple as it sounds, most misinterpretations of p-values come down to not sufficiently wrestling that question. The numerator is typically easy: it's the main thing you're counting or measuring. The more of it there is, the higher your rate or probability. The rate of rainy days would have the number of rainy days in the numerator. The False Positive Rate has the number of false positive events in the numerator, and it increases as the number of false positives increases. That's also true, however, of the False Discovery Rate -- uhoh!
The False Positive Rate and False Discovery Rate are critically different quantities with critically different meanings -- and it all comes down to the denominator. The denominator is your assumed scenario. It's the population in which you are looking for those events that you count in the numerator. The rate of rainy days per year could be calculated as the number of rainy days in 2015 divided by the number of calendar days in 2015. 85 rainy days / 365 days means 1 rainy day per 4.29 days, or a 23.3% chance of rain on any given day. But when you count false positive events instead of simply rainy days, picking a denominator is trickier -- there are two options to choose from.
Using Errors and Rates
Imagine that you are the meteorologist for your local news station and you must each day announce whether it's going to rain tomorrow. Each day you make a prediction (rain or no rain) and then you find out if you were right or wrong. You aren't always right, and it's losing your channel viewers! Some days you say it will rain and then it doesn't (a false positive), and other days you say it'll be dry and then it rains (a false negative). In the last year on the job, you had 12 false negative days and 27 false positive days. The station is more worried about the 27 false positives. They want to know how bad that is. Well, there are (at least) two ways to answer that question: with a False Positive Rate, or a False Discovery Rate.
The key difference between them is the denominator, the population from which you are going to count false positive events. Do you care about the rate at which you make false predictions of rain out of all the days that have no rain? Or are you interested in how often you make false predictions of rain, out of all your predictions of rainy days? Maybe you pick one of the two as more important, or maybe you answer "I don't know," "Both?" or "Huh?" You could look at it from the viewer's perspective: do they want to know a) given that it's going to rain, how likely are you to miss it, or b) given that you made a prediction of rain, whether it will actually rain? This may again sound like a distinction without a difference, but it certainly is not. Those two probabilities can be quite different.
I would posit that from the viewer's perspective, they'd be able to make better intuitive use of b): given that you made a prediction of rain, will it actually rain? Because that's what they actually get, your prediction, and then they have to interpret it and make decisions accordingly. Most people don't think in terms of "given that this hypothesis is true, what's the probability that it would then lead to a certain conclusion," and they certainly don't readily know how to interpret and make decisions based on that sort of probability. Unfortunately, that's exactly the kind of probability that a p-value is.
Wrapping up the example, you would calculate the False Discovery Rate (FDR) as the number of false rainy day predictions divided by the number of total rainy day predictions. Given 85 actually rainy days, 12 false negative days, and 27 false positive days, you find that FDR = 27/100 = 27%. So given that you predict rain, there's a 27% chance it won't rain. That's a useful piece of information. You could also calculate the False Positive Rate = 27/280 = 9.6%. See, it's different! Quite different. And not as useful.
Available Statistics
So this begs the question: why do researchers use p-values, which are analogous to the non-intuitive False Positive Rate, at all? Why don't they use something analogous to the False Discovery Rate that would tell them the probability of the truth, given what they observed? It helped you as a meteorologist explain your science to viewers; why can't it help researchers? Researchers would use p-values this way if they could. In fact they do, mistakenly. That's the whole reason I'm writing these blog posts.
The distinction between the meteorologist and the researchers I've described is their philosophy: the researchers are Frequentists, but the meteorologist is a Bayesian. What we did to estimate that 27% FDR was a Bayesian calculation. We estimated a probability based only on the data available. See, Bayesian statistics can be pretty helpful, huh?
Frequentists cannot ever produce a probability analogous to a False Discovery Rate. This is because of the denominator required. Surprise, denominators are important! Recall that the denominator of the FDR is all the positive predictions. This is fine for the meteorologist, because he is making predictions every day, and a bunch of them are positive, so he has data on which to perform a Bayesian analysis.
The researcher, however, only conducts her experiment once. So she can't organize her data that way. Instead, her frequentist philosophy based on universal long term truths suggests a different baseline scenario. The Frequentist's denominator when analyzing the risk of a false positive is the null hypothesis. Whoa, we're coming full circle here. The p-value is the probability of obtaining results like those you got, given the null hypothesis is true. It's the non-intuitive way to measure the risk of the false positive, but it's the only option available to the Frequentist.
A Family of Methods
It's important to recognize that researchers need not follow one of these philosophies arbitrarily or absolutely. Rather, the two approaches are complimentary. Bayesian statistics are much more appropriate for the engineer measuring the same phenomenon over and over again, trying to do increasingly well at predicting it. Frequentist statistics are for the researcher who has a single hypothesis about a universal pattern or principle that they would like to test. Bayesians try to predict probabilistic outcomes, while Frequentists try to explain why those outcomes happened. Of course, effects and their causes are intimately tied, so research about either benefits the other.
Finally, recall all the other rates and probabilities on that list. I'm not going to go through them all (you're welcome), but I want to point out that identifying the denominator correctly is a common problem people encounter across that list. Numerators are certainly important too, but they're typically easier to grasp. Across disciplines, needs and preferences vary, so plenty of researchers and others do use terms I'm skipping over here -- such as Power, Predictive Values, Sensitivity/Specificity, and Likelihood Ratios -- to describe the risk of errors in various unique but complimentary ways.
Up next, I discuss common statistical mistakes and how to resolve them.