Confidence Intervals and Hypothesis Testing
Let’s return to our affective polarization example.
As it turns out, the Heiniken experiment was part of a larger project called the Strengthening Democracy Challenge which sought to crowdsource different interventions.
Another intervention they tried involved a chatbot quiz.
Suppose that we have the following data generating process under this intervention:
\[ Y_i(0) = 50 + \epsilon \] where \(\epsilon_i\) is randomly drawn from a normal distribution with mean = 0 and sd = 10:
\[ \epsilon_i \sim \mathcal{N}(0, 10) \]
The potential treated outcome is given by:
\[ Y_i(1) = Y_i(0) - 10 \]
Let’s code that up:
# load packages
library(tidyverse)
library(randomizr)
# set seed
set.seed(123)
# Parameters
# ----------
# total number of people in the study
<- 600
N
# generate data:
# --------------
# creating potential outcomes
<- rnorm(N, mean=50, sd=10)
y0 <- y0 - 10
y1
<- tibble(y0,y1)
dta
# run a single experiment and reveal the outcome
<- dta |>
exp mutate(tx = complete_ra(N, prob = 0.5),
y = ifelse(tx == 1, y1, y0))
Next let’s estimate a model:
\[ Y_i = \beta_0 + \beta_1 * Tx + \epsilon_i \]
library(broom)
<- lm(y ~ tx, data=exp)
m
<- tidy(m, conf.int = TRUE)
results
results
# A tibble: 2 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 50.1 0.559 89.7 0 49.0 51.2
2 tx -9.82 0.790 -12.4 1.00e-31 -11.4 -8.27
Look at the last two columns of output: we see conf.low
and conf.high
. You probably guessed that these two numbers define the confidence interval surrounding our estimates.
You may even remember how a 95% confidence interval is constructed:
\[ \hat{\beta} \pm 1.96 \times SE(\hat{\beta}) \]
But conceptually, what IS a confidence interval?
An Analogy to Fishing
To understand confidence intervals, here’s a nice analogy I borrowed from moderndive.
Say you are trying to catch a fish. On the one hand, you could use a spear, while on the other you could use a net. Using the net will probably allow you to catch more fish!
Now suppose that the “true” treatment effect in the population is the fish. Let’s denote this population parameter \(\tau\).
We could use our point estimate from our experiment (\(\hat{\beta_1}\) = -9.82) to estimate \(\tau\). Think of this as “fishing with a spear.”
What about “fishing with a net”? That’s like using the confidence interval: [-11.37, -8.27].
Since we are using a 95% confidence interval, we are saying that our “net” [-11.37, -8.27] has a 95% chance of catching the fish (or, a 95% of containing or “bracketing” \(\tau\)).
A More Precise Explanation
That statement should strike you as kind of weird.
Either the interval [-11.37, -8.27] contains the true population parameter \(\tau\) or it doesn’t.
So what does it mean that CI has a 95% chance of bracketing the “truth”?
Now here’s more precisely what we mean:
Suppose we were to repeat our experiment 1000 times. For each experiment:
- we estimate \(\hat{\beta}\) and SE(\(\hat{\beta}\))
- we construct a 95% CI as \(\hat{\beta} \pm 1.96 \times SE(\hat{\beta})\)
If we followed this procedure, then about 950 times out of 1000, our constructed CIs will “bracket” the true treatment effect.
Let’s try this out via a simulation:
# preliminaries
set.seed(2)
<- 1000
n_sims
# empty vectors to store the results
<- c()
v_low <- c()
v_high
# looping
for (i in 1:n_sims) {
# re-randomizing the treatment and revealing observed outcomes
<- dta |>
exp mutate(tx = complete_ra(N, prob = 0.5),
y = ifelse(tx == 1, y1, y0))
# estimating the model and storing the results
<- tidy(lm(y ~ tx, data=exp), conf.int=TRUE)
results
# saving the estimates to the empty vectors
<- as.numeric(results[2,6])
v_low[i] <- as.numeric(results[2,7])
v_high[i]
}
# putting together the dataset of simulated results
# and checking if the CI contains the "truth" == -10
<- tibble(v_low, v_high) |>
to_plot mutate(bracket = ifelse(v_low < -10 & v_high > -10, 1, 0))
What percentage of CI’s caught the fish?
mean(to_plot$bracket)
[1] 0.948
That’s pretty close to 95%!
In fact, we can make it arbitrarily close to 95% by running more and more simulations.
Here’s how those estimates look like:
# sorting estimates from lowest to highest
# and indexing them
<- to_plot |>
to_plot arrange(v_low) |>
mutate(index = row_number())
ggplot(to_plot, aes(y = index,
xmin = v_low,
xmax = v_high,
color = factor(bracket))) +
geom_errorbar(alpha= 0.6) +
geom_vline(xintercept = -10,
color = "black",
lty="dashed") +
theme_minimal() +
labs(y = "Simulation Number",
x = "Treatment Effect",
color = "Contains Tau?") +
scale_color_discrete(label=c("No", "Yes")) +
theme(legend.position = "inside",
legend.position.inside = c(0.85,0.15))
Statistical Significance
Suppose we were to test this same intervention again, but we only had enough time / money to recruit 16 participants:
set.seed(1)
<- 16
N
# sampling 16 people from the original data
# and assigning 8 to treatment
<- sample_n(dta, N) |>
small_exp mutate(tx = complete_ra(N, prob = 0.5),
y = ifelse(tx == 1, y1, y0))
We nonetheless conduct the study and estimate our ATE:
# estimate the model
<- lm(y ~ tx, data= small_exp)
model
# look at it
summary(model)
Call:
lm(formula = y ~ tx, data = small_exp)
Residuals:
Min 1Q Median 3Q Max
-28.064 -7.197 2.405 6.944 19.788
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 49.966 4.322 11.562 1.51e-08 ***
tx -12.630 6.112 -2.067 0.0578 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 12.22 on 14 degrees of freedom
Multiple R-squared: 0.2338, Adjusted R-squared: 0.179
F-statistic: 4.271 on 1 and 14 DF, p-value: 0.05778
# look specifically at the coefficients
$coefficients model
(Intercept) tx
49.96587 -12.63046
# saving the effect for later
<- model$coefficients[2] exp_effect
We estimate a treatment effect of around -10.
But since we only have 16 people in our study, maybe our results are produced by chance?
You may remember that to answer such a question, you check whether the t-statistic is > 1.96.
Equivalently, you may also remember to check whether the p.value
< 0.05.
Or, you may just look at your regression output and see if there are any stars.
But what is the logic behind doing any this stuff?
Hypothesis Testing
The basic idea behind hypothesis testing is to ask:
If the true treatment effect were 0, how likely is it that I would observe an effect as large as I obtained?
The answer to this question is provided by the p.value.
This is a bit like asking: if I had a fair coin (that’s the null hypothesis), what is the likelihood that I would have obtained 10 heads in a row?
If it is extremely unlikely (less than 1 in 20 chance, or 0.05) that a fair coin would have produced your actual results, you reject the null hypothesis that the coin is fair.
Similarly, if it is extremely unlikely that a treatment which has truly no effect would produce a difference between treatment and control groups as large as what you observed in your experiment, you can conclude that the true treatment effect is (probably) not 0.
Randomization Inference
To illustrate the logic, let’s run a simulation!
We need to imagine for the moment that we have only our experimental data. We don’t know the true treatment effect, and we don’t know that data generating process, or the potential outcomes.
<- small_exp |>
small_exp select(-y0, -y1)
head(small_exp)
# A tibble: 6 × 2
tx y
<int> <dbl>
1 0 40.4
2 1 41.5
3 0 69.8
4 1 40.2
5 1 40.9
6 1 51.1
We begin with a special kind of null hypothesis called the sharp null:
The sharp null is particularly useful for us because it allows us to fill in the full schedule of potential outcomes:
<- small_exp |>
small_exp mutate(y0 = y,
y1 = y)
We then ask: how many ways could my experiment have allocated 16 people evenly into treatment and control groups?
The answer is \(\frac{16!}{8! \times 8!} = 12870\)
Well, we have the full schedule of potential outcomes (under the sharp null). Let’s just simulate each one of these 12870 possible randomizations and calculate the ATE.
We then see how many of these experiments – under the assumption that the treatment has 0 effect for each and every individual – produced a result as large as -12.63.
This is our (exact) p-value!
Let’s simulate this:
# setting a seed
set.seed(1)
# creating a vector of 16 subjects
<- 1:16
subjects
# Generate all possible ways of assigning 8 out of 16 subjects to treatment
<- combn(subjects, 8)
all_permutations
# a scalar storing exactly how many permutations to loop through
<- ncol(all_permutations)
n_permutations
# an empty vector to store all of the simulation results
<- c()
v_ri
# looping
for (i in 1:n_permutations) {
# assigning each column one-at-a-time to treatment
<- all_permutations[, i]
treatment_group
# assigning the rest to control
<- setdiff(subjects, treatment_group)
control_group
# reveal observed outcomes based on the current permutation
<- small_exp |>
temp_experiment select(y1,y0) |>
mutate(id = row_number(),
tx = ifelse(id %in% treatment_group, 1, 0),
y = ifelse(id %in% treatment_group, y1, y0)
)
# calculate the ate
<- lm(y ~ tx, data=temp_experiment)
model
# store the result
<- model$coefficients[2]
v_ri[i]
}
# saving the vector to a dataframe
# creating a variable that counts whether a simulated effect is larger than what we obtained in our experiment
<- tibble(v_ri) |>
ri_pvalues mutate(larger = ifelse(abs(v_ri) > abs(exp_effect), 1, 0))
# exact p.value
mean(ri_pvalues$larger)
[1] 0.05703186
And that’s pretty close to what our regression gave us!