Confidence Intervals and Hypothesis Testing
Let’s return to our affective polarization example.
As it turns out, the Heiniken experiment was part of a larger project called the Strengthening Democracy Challenge which sought to crowdsource different interventions.
Another intervention they tried involved a chatbot quiz.
Suppose that we have the following data generating process under this intervention:
\[ Y_i(0) = 50 + \epsilon \] where \(\epsilon_i\) is randomly drawn from a normal distribution with mean = 0 and sd = 10:
\[ \epsilon_i \sim \mathcal{N}(0, 10) \]
The potential treated outcome is given by:
\[ Y_i(1) = Y_i(0) - 10 \]
Let’s code that up:
# load packages
library(tidyverse)
library(randomizr)
# set seed
set.seed(123)
# Parameters
# ----------
# total number of people in the study
N <- 600
# generate data:
# --------------
# creating potential outcomes
y0 <- rnorm(N, mean=50, sd=10)
y1 <- y0 - 10
dta <- tibble(y0,y1)
# run a single experiment and reveal the outcome
exp <- dta |>
mutate(tx = complete_ra(N, prob = 0.5),
y = ifelse(tx == 1, y1, y0))Next let’s estimate a model:
\[ Y_i = \beta_0 + \beta_1 * Tx + \epsilon_i \]
library(broom)
m <- lm(y ~ tx, data=exp)
results <- tidy(m, conf.int = TRUE)
results# A tibble: 2 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 50.1 0.559 89.7 0 49.0 51.2
2 tx -9.82 0.790 -12.4 1.00e-31 -11.4 -8.27
Look at the last two columns of output: we see conf.low and conf.high. You probably guessed that these two numbers define the confidence interval surrounding our estimates.
You may even remember how a 95% confidence interval is constructed:
\[ \hat{\beta} \pm 1.96 \times SE(\hat{\beta}) \]
But conceptually, what IS a confidence interval?
An Analogy to Fishing
To understand confidence intervals, here’s a nice analogy I borrowed from moderndive.
Say you are trying to catch a fish. On the one hand, you could use a spear, while on the other you could use a net. Using the net will probably allow you to catch more fish!
Now suppose that the “true” treatment effect in the population is the fish. Let’s denote this population parameter \(\tau\).
We could use our point estimate from our experiment (\(\hat{\beta_1}\) = -9.82) to estimate \(\tau\). Think of this as “fishing with a spear.”
What about “fishing with a net”? That’s like using the confidence interval: [-11.37, -8.27].
Since we are using a 95% confidence interval, we are saying that our “net” [-11.37, -8.27] has a 95% chance of catching the fish (or, a 95% of containing or “bracketing” \(\tau\)).
A More Precise Explanation
That statement should strike you as kind of weird.
Either the interval [-11.37, -8.27] contains the true population parameter \(\tau\) or it doesn’t.
So what does it mean that CI has a 95% chance of bracketing the “truth”?
Now here’s more precisely what we mean:
Suppose we were to repeat our experiment 1000 times. For each experiment:
- we estimate \(\hat{\beta}\) and SE(\(\hat{\beta}\))
- we construct a 95% CI as \(\hat{\beta} \pm 1.96 \times SE(\hat{\beta})\)
If we followed this procedure, then about 950 times out of 1000, our constructed CIs will “bracket” the true treatment effect.
Let’s try this out via a simulation:
# preliminaries
set.seed(2)
n_sims <- 1000
# empty vectors to store the results
v_low <- c()
v_high <- c()
# looping
for (i in 1:n_sims) {
# re-randomizing the treatment and revealing observed outcomes
exp <- dta |>
mutate(tx = complete_ra(N, prob = 0.5),
y = ifelse(tx == 1, y1, y0))
# estimating the model and storing the results
results <- tidy(lm(y ~ tx, data=exp), conf.int=TRUE)
# saving the estimates to the empty vectors
v_low[i] <- as.numeric(results[2,6])
v_high[i] <- as.numeric(results[2,7])
}
# putting together the dataset of simulated results
# and checking if the CI contains the "truth" == -10
to_plot <- tibble(v_low, v_high) |>
mutate(bracket = ifelse(v_low < -10 & v_high > -10, 1, 0))What percentage of CI’s caught the fish?
mean(to_plot$bracket)[1] 0.948
That’s pretty close to 95%!
In fact, we can make it arbitrarily close to 95% by running more and more simulations.
Here’s how those estimates look like:
# sorting estimates from lowest to highest
# and indexing them
to_plot <- to_plot |>
arrange(v_low) |>
mutate(index = row_number())
ggplot(to_plot, aes(y = index,
xmin = v_low,
xmax = v_high,
color = factor(bracket))) +
geom_errorbar(alpha= 0.6) +
geom_vline(xintercept = -10,
color = "black",
lty="dashed") +
theme_minimal() +
labs(y = "Simulation Number",
x = "Treatment Effect",
color = "Contains Tau?") +
scale_color_discrete(label=c("No", "Yes")) +
theme(legend.position = "inside",
legend.position.inside = c(0.85,0.15))Statistical Significance
Suppose we were to test this same intervention again, but we only had enough time / money to recruit 12 participants:
set.seed(2)
N <- 12
# sampling 12 people from the original data
# and assigning 6 to treatment
small_exp <- sample_n(dta, N) |>
mutate(tx = complete_ra(N, prob = 0.5),
y = ifelse(tx == 1, y1, y0))We nonetheless conduct the study and estimate our ATE:
# estimate the model
model <- lm(y ~ tx, data= small_exp)
# look at it
summary(model)
Call:
lm(formula = y ~ tx, data = small_exp)
Residuals:
Min 1Q Median 3Q Max
-23.6197 -10.1682 0.2508 7.0305 26.9795
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 50.379 6.263 8.044 1.12e-05 ***
tx -13.191 8.857 -1.489 0.167
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 15.34 on 10 degrees of freedom
Multiple R-squared: 0.1815, Adjusted R-squared: 0.09968
F-statistic: 2.218 on 1 and 10 DF, p-value: 0.1673
# look specifically at the coefficients
model$coefficients(Intercept) tx
50.37887 -13.19068
# saving the effect for later
exp_effect <- model$coefficients[2]We estimate a treatment effect of around -10.
But since we only have 12 people in our study, maybe our results are produced by chance?
You may remember that to answer such a question, you check whether the t-statistic is > 1.96.
Equivalently, you may also remember to check whether the p.value < 0.05.
Or, you may just look at your regression output and see if there are any stars.
But what is the logic behind doing any this stuff?
Hypothesis Testing
The basic idea behind hypothesis testing is to ask:
If the true treatment effect were 0, how likely is it that I would observe an effect as large as I obtained?
The answer to this question is provided by the p.value.
This is a bit like asking: if I had a fair coin (that’s the null hypothesis), what is the likelihood that I would have obtained 10 heads in a row?
If it is extremely unlikely (less than 1 in 20 chance, or 0.05) that a fair coin would have produced your actual results, you reject the null hypothesis that the coin is fair.
Similarly, if it is extremely unlikely that a treatment which has truly no effect would produce a difference between treatment and control groups as large as what you observed in your experiment, you can conclude that the true treatment effect is (probably) not 0.
Randomization Inference
To illustrate the logic, let’s run a simulation!
We need to imagine for the moment that we have only our experimental data. We don’t know the true treatment effect, and we don’t know that data generating process, or the potential outcomes.
small_exp <- small_exp |>
select(-y0, -y1)
head(small_exp)# A tibble: 6 × 2
tx y
<int> <dbl>
1 1 13.6
2 1 18.4
3 0 56.8
4 0 40.3
5 1 45.4
6 0 50.3
We begin with a special kind of null hypothesis called the sharp null:
The sharp null is particularly useful for us because it allows us to fill in the full schedule of potential outcomes:
small_exp <- small_exp |>
mutate(y0 = y,
y1 = y)We then ask: how many ways could my experiment have allocated 12 people evenly into treatment and control groups?
The answer is \(\frac{12!}{6! \times 6!} = 924\)
Well, we have the full schedule of potential outcomes (under the sharp null). Let’s just simulate each one of these 924 possible randomizations and calculate the ATE.
We then see how many of these experiments – under the assumption that the treatment has 0 effect for each and every individual – produced a result as large as -13.19.
This is our (exact) p-value!
Let’s simulate this:
# setting a seed
set.seed(2)
# creating a vector of 12 subjects
subjects <- 1:N
# Generate all possible ways of assigning 6 out of 12 subjects to treatment
all_permutations <- combn(subjects, N/2)
# a scalar storing exactly how many permutations to loop through
n_permutations <- ncol(all_permutations)
# an empty vector to store all of the simulation results
v_ri <- c()
# looping
for (i in 1:n_permutations) {
# assigning each column one-at-a-time to treatment
treatment_group <- all_permutations[, i]
# assigning the rest to control
control_group <- setdiff(subjects, treatment_group)
# reveal observed outcomes based on the current permutation
temp_experiment <- small_exp |>
select(y1,y0) |>
mutate(id = row_number(),
tx = ifelse(id %in% treatment_group, 1, 0),
y = ifelse(id %in% treatment_group, y1, y0)
)
# calculate the ate
model <- lm(y ~ tx, data=temp_experiment)
# store the result
v_ri[i] <- model$coefficients[2]
}
# saving the vector to a dataframe
# creating a variable that counts whether a simulated effect is larger than what we obtained in our experiment
ri_pvalues <- tibble(v_ri) |>
mutate(larger = ifelse(abs(v_ri) >= abs(exp_effect), 1, 0))
# exact p.value
mean(ri_pvalues$larger)[1] 0.1655844
And that’s pretty close to what our regression gave us!
In-class Exercise
To really nail down this concept, let’s do another example by hand.
Suppose you had the following very small experiment:
| Subject ID | Tx | Y |
|---|---|---|
| A | 0 | 3 |
| B | 1 | 6 |
| C | 0 | 2 |
| D | 0 | 4 |
| E | 1 | 8 |
RI in practice
Now what happens when your experiments get big?
Let’s imagine you had even just 100 subjects, 50 of whom are assigned to treatment. Here are the total possible randomizations:
\(\frac{100!}{50! \times 50!} = 1.0089134 \times 10^{29}\)
That will crash your computer!
So instead of calculating an ATE for every possible randomization, we just sample from this universe (e.g. 1000 times):
# set a seed
set.seed(1)
# start with our large experimental dataset and calculate / store the ATE from our single experiment
model <- lm(y ~ tx, data=exp)
our_ate <- model$coefficients["tx"]
# also calculate the number of subjects
nsubjects <- nrow(exp)
# define the number of permutations you want to sample
num_permutations <- 1000
# Create an empty vector to store the ATEs
ri_results <- c()
# looping
for (i in 1:num_permutations) {
# shuffle the treatments
rilarge <- exp |>
mutate(shuffletx = complete_ra(nsubjects, prob=0.5))
# estimate the ATE under the shuffled treatment
model <- lm(y ~ shuffletx, data=rilarge)
# store the estimated ate
ri_results[i] <- model$coefficients["shuffletx"]
}
# turning our simulation results into a dataset, and calculating what percentage of estimates are larger (in absolute value) than our actual results
ri_results <- tibble(ri_results) |>
mutate(larger = ifelse(abs(ri_results) >= abs(our_ate), 1, 0))
# calculating the exact p.value
mean(ri_results$larger)[1] 0