Intuition for the p-value

set.seed(1917)

In Null Hypothesis Significance Testing, a $p$-value is the probability under $H_0$ of observing a value of $t(x)$ (your test statistic) as or more extreme than the value that you did observe.

Rough intuition for NHST:

The idea of NHST boils down to the following aphorism: if the data doesn’t fit the hypothesis well, then the hypothesis doesn’t fit the data well.

If the null hypothesis is a good one, then the data that we observe should be pretty common under the null hypothesis: we should expect higher-probability events to occur (by definition!).

If we observe events that are not very likely (i.e. extreme) under the null hypothesis, then we should start to view the null hypothesis with skepticism. Of course, such events are possible, but they shouldn’t be common, i.e. we shouldn’t expect them. We won’t simply throw away our null hypothesis because the data doesn’t perfectly fit our expectations under that hypothesis.

However, we will decide at some point to “reject” the null hypothesis. If the data we observe is too extreme, then it’s clear to us that the null hypothesis is not a good model of the world, so we can safely conclude that it’s not true. Of course, we haven’t disproven it, and we might be wrong to say it’s not true, but we’re comfortable taking that chance.

Example: coin-flipping

Suppose we have a coin. We want to test whether the coin is fair. We flip it a bunch of times - say, 100 - and record the outcomes.

We set up the following null and alternative hypotheses for the binomial parameter $p$.

\[ \begin{aligned} H_0: p = 0.5\\ H_1: p \neq 0.5\\ \end{aligned} \]

So, we’ll observe how many heads we get, and use that number to determine what to do with our hypotheses: in particular, whether to “reject” $H_0$.

Of course, we should expect to see 50 heads and 50 tails if the coin is fair, i.e. the null hypothesis is true. But what if we see 49 heads? Do we conclude that the coin is unfair? Do we conclude that the true probability of heads must be, say, 0.49?

Obviously not: if the coin really is fair, then getting 49 heads is almost as likely as getting 50 heads. This could easily happen due to random chance if we have a fair coin.

n_exp = 1000
x_mc = rbinom(n = n_exp, size = 100, prob = 0.5)
hist(x_mc)

Above, I’ve repeated our experiment 1000 times and plotted the distribution. There were exactly 50 heads 76 times and exactly 49 heads 83 times.

p_val_49 = 2 * pbinom(49, 100, 0.5)
p_val_49

[1] 0.9204108

In the language of NHST, the $p$-value is 0.9204108. That is, 92.0410763% of the time, we would observe data as or more extreme if $H_0$ is true. In other words, if $H_0$ is true, then we observe data less extreme 7.9589237% of the time. Should we reject $H_0$?

It may be more intuitive to focus on $1 -$ the $p$-value: the events that are more likely than the event you actually observed. So, rejecting $H_0$ would be tantamount to saying we only believe $H_0$ when we see data that it should produce 7.9589237% of the time, which doesn’t seem very reasonable to me. In the best-case scenario when $H_0$ is true, we’re rejecting it 92.0410763% of the time. Therefore, observing 49 heads isn’t exactly compelling evidence against the null hypothesis in my book.

x_seq = 1:100
plot(dbinom(x_seq, 100, 0.5), type = "l")
abline(v = c(49, 51), col = "orange")
title(main = "Should we reject the null if the data is outside the orange lines?")

What about 48, though? 47? 46? There is some point at which we will say “enough is enough”: that is, our observations will be extreme enough to reject the null hypothesis.

# two-sided alternative
p_val_34 = 2 * pbinom(34, size = 100, prob = 0.5)
p_val_34

[1] 0.00178993

Let’s say you only actually observed 34 heads. Do you believe the coin is fair? Theoretically, this could happen if the coin is fair, but the $p$-value for your sample is 0.0017899. In other words, if the null hypothesis is true, then what just happened is extremely unlikely.

Imagine 1000 parallel universes where you repeated this 100-toss sample. Results that are more likely should occur around 998 times (rounded down): results “as or more extreme” (read: as or less likely) should occur 2 times (rounded up). In fact, you don’t have to imagine: it’s exactly what I simulated at the beginning. Such a result happened exactly 1 time. Do you believe that your real sample is that one time ($H_0$)? Or maybe the parallel universes actually behave in a different way ($H_1$)?

It may be more intuitive to focus on $1 -$ the $p$-value: the events that are more likely than the event you actually observed. In other words, with our $p$-value of 0.0017899, we observe a less extreme value (8 heads, 89 heads, etc.) with probability 0.9982101, i.e. 99.821007% of the time. In other words, a proportion of heads closer to 0.5 (the null hypothesis) is overwhelmingly more likely. So, if the null hypothesis is true, you have to be in the 0.178993% of cases where data this extreme is truly generated by $H_0$, i.e. the fair coin in this example. Do you believe that? This would be so unlikely under the null hypothesis that the null hypothesis does not seem compelling: we might decide that, instead of saying we just got an extreme sample by chance, we should reject the null hypothesis altogether.

Recall the main idea of NHST: if the data doesn’t fit the hypothesis well, then the hypothesis doesn’t fit the data well. In this example, when the data fit the null hypothesis well (49 heads), we didn’t reject the null hypothesis: the hypothesis fit the data well. When the data didn’t fit the null hypothesis well (34 heads), we rejected the null hypothesis: the hypothesis didn’t fit the data well.

So, why do we care about data “as or more extreme” than our sample? Usually, our $p$-values will be small, so the reasoning goes: under $H_0$, I will usually observe data less extreme (with probability $1 - $ my $p$-value, a pretty high value). So, if I observed extreme data, I must be in the very unlikely situation that I observed extreme data under $H_0$ (fail to reject $H_0$), OR I believe that is so unlikely that I no longer believe in $H_0$ at all (reject $H_0$).

Of course, if you were conducting a “proper” hypothesis test, you would pre-specify a decision rule before collecting data: if the $p$-value you observe is less than some number $\alpha$, then you will reject the null hypothesis.