Confidence Intervals and Hypothesis Testing

Let’s return to our affective polarization example.

As it turns out, the Heiniken experiment was part of a larger project called the Strengthening Democracy Challenge which sought to crowdsource different interventions.

Another intervention they tried involved a chatbot quiz.

Suppose that we have the following data generating process under this intervention:

\[ Y_i(0) = 50 + \epsilon \] where \(\epsilon_i\) is randomly drawn from a normal distribution with mean = 0 and sd = 10:

\[ \epsilon_i \sim \mathcal{N}(0, 10) \]

The potential treated outcome is given by:

\[ Y_i(1) = Y_i(0) - 10 \]

Let’s code that up:

# load packages
library(tidyverse)
library(randomizr)

# set seed
set.seed(123)

# Parameters
# ----------

# total number of people in the study
N <- 600


# generate data:
# --------------

# creating potential outcomes
y0 <- rnorm(N, mean=50, sd=10)
y1 <- y0 - 10

dta <- tibble(y0,y1)


# run a single experiment and reveal the outcome
exp <- dta |> 
  mutate(tx = complete_ra(N, prob = 0.5),
         y = ifelse(tx == 1, y1, y0))

Next let’s estimate a model:

\[ Y_i = \beta_0 + \beta_1 * Tx + \epsilon_i \]

library(broom)

m <- lm(y ~ tx, data=exp)

results <- tidy(m, conf.int = TRUE)

results
# A tibble: 2 × 7
  term        estimate std.error statistic  p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)    50.1      0.559      89.7 0            49.0     51.2 
2 tx             -9.82     0.790     -12.4 1.00e-31    -11.4     -8.27

Look at the last two columns of output: we see conf.low and conf.high. You probably guessed that these two numbers define the confidence interval surrounding our estimates.

You may even remember how a 95% confidence interval is constructed:

\[ \hat{\beta} \pm 1.96 \times SE(\hat{\beta}) \]

But conceptually, what IS a confidence interval?


An Analogy to Fishing

To understand confidence intervals, here’s a nice analogy I borrowed from moderndive.

Say you are trying to catch a fish. On the one hand, you could use a spear, while on the other you could use a net. Using the net will probably allow you to catch more fish!

Now suppose that the “true” treatment effect in the population is the fish. Let’s denote this population parameter \(\tau\).

We could use our point estimate from our experiment (\(\hat{\beta_1}\) = -9.82) to estimate \(\tau\). Think of this as “fishing with a spear.”

What about “fishing with a net”? That’s like using the confidence interval: [-11.37, -8.27].

Since we are using a 95% confidence interval, we are saying that our “net” [-11.37, -8.27] has a 95% chance of catching the fish (or, a 95% of containing or “bracketing” \(\tau\)).


A More Precise Explanation

That statement should strike you as kind of weird.

Either the interval [-11.37, -8.27] contains the true population parameter \(\tau\) or it doesn’t.

So what does it mean that CI has a 95% chance of bracketing the “truth”?

Now here’s more precisely what we mean:

Suppose we were to repeat our experiment 1000 times. For each experiment:

  • we estimate \(\hat{\beta}\) and SE(\(\hat{\beta}\))
  • we construct a 95% CI as \(\hat{\beta} \pm 1.96 \times SE(\hat{\beta})\)

If we followed this procedure, then about 950 times out of 1000, our constructed CIs will “bracket” the true treatment effect.

Let’s try this out via a simulation:

# preliminaries
set.seed(2)
n_sims <- 1000

# empty vectors to store the results
v_low <- c()
v_high <- c()

# looping
for (i in 1:n_sims) {
  
  # re-randomizing the treatment and revealing observed outcomes
  exp <- dta |> 
  mutate(tx = complete_ra(N, prob = 0.5),
         y = ifelse(tx == 1, y1, y0))
  
  # estimating the model and storing the results
  results <- tidy(lm(y ~ tx, data=exp), conf.int=TRUE)
  
  # saving the estimates to the empty vectors
  v_low[i] <- as.numeric(results[2,6])
  v_high[i] <- as.numeric(results[2,7])

} 

# putting together the dataset of simulated results
# and checking if the CI contains the "truth" == -10
to_plot <- tibble(v_low, v_high) |> 
  mutate(bracket = ifelse(v_low < -10 & v_high > -10, 1, 0))

What percentage of CI’s caught the fish?

mean(to_plot$bracket)
[1] 0.948

That’s pretty close to 95%!

In fact, we can make it arbitrarily close to 95% by running more and more simulations.

Here’s how those estimates look like:

# sorting estimates from lowest to highest
# and indexing them
to_plot <- to_plot |> 
  arrange(v_low) |> 
  mutate(index = row_number())


ggplot(to_plot, aes(y = index, 
                    xmin = v_low,
                    xmax = v_high,
                    color = factor(bracket))) + 
  geom_errorbar(alpha= 0.6) +
  geom_vline(xintercept = -10,
             color = "black",
             lty="dashed") +
  theme_minimal() +
  labs(y = "Simulation Number",
       x = "Treatment Effect",
       color = "Contains Tau?") +
  scale_color_discrete(label=c("No", "Yes")) +
  theme(legend.position = "inside",
        legend.position.inside = c(0.85,0.15))

Understanding Checks:

What happens to our confidence intervals as our standard errors decrease?

Is it better or worse to have narrower confidence intervals?

What should we conclude if the confidence interval contains 0?


Statistical Significance

Suppose we were to test this same intervention again, but we only had enough time / money to recruit 16 participants:

set.seed(1)
N <- 16

# sampling 16 people from the original data
# and assigning 8 to treatment
small_exp <- sample_n(dta, N) |> 
  mutate(tx = complete_ra(N, prob = 0.5),
         y = ifelse(tx == 1, y1, y0))

We nonetheless conduct the study and estimate our ATE:

# estimate the model
model <- lm(y ~ tx, data= small_exp)

# look at it
summary(model)

Call:
lm(formula = y ~ tx, data = small_exp)

Residuals:
    Min      1Q  Median      3Q     Max 
-28.064  -7.197   2.405   6.944  19.788 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   49.966      4.322  11.562 1.51e-08 ***
tx           -12.630      6.112  -2.067   0.0578 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.22 on 14 degrees of freedom
Multiple R-squared:  0.2338,    Adjusted R-squared:  0.179 
F-statistic: 4.271 on 1 and 14 DF,  p-value: 0.05778
# look specifically at the coefficients
model$coefficients
(Intercept)          tx 
   49.96587   -12.63046 
# saving the effect for later
exp_effect <- model$coefficients[2]

We estimate a treatment effect of around -10.

But since we only have 16 people in our study, maybe our results are produced by chance?

You may remember that to answer such a question, you check whether the t-statistic is > 1.96.

Equivalently, you may also remember to check whether the p.value < 0.05.

Or, you may just look at your regression output and see if there are any stars.

But what is the logic behind doing any this stuff?

Understanding Checks:

What is a p-value?

Or better yet: write down the precise question to which the p-value provides an answer.


Hypothesis Testing

The basic idea behind hypothesis testing is to ask:

If the true treatment effect were 0, how likely is it that I would observe an effect as large as I obtained?

The answer to this question is provided by the p.value.

This is a bit like asking: if I had a fair coin (that’s the null hypothesis), what is the likelihood that I would have obtained 10 heads in a row?

If it is extremely unlikely (less than 1 in 20 chance, or 0.05) that a fair coin would have produced your actual results, you reject the null hypothesis that the coin is fair.

Similarly, if it is extremely unlikely that a treatment which has truly no effect would produce a difference between treatment and control groups as large as what you observed in your experiment, you can conclude that the true treatment effect is (probably) not 0.


Randomization Inference

To illustrate the logic, let’s run a simulation!

We need to imagine for the moment that we have only our experimental data. We don’t know the true treatment effect, and we don’t know that data generating process, or the potential outcomes.

small_exp <- small_exp |> 
  select(-y0, -y1)

head(small_exp)
# A tibble: 6 × 2
     tx     y
  <int> <dbl>
1     0  40.4
2     1  41.5
3     0  69.8
4     1  40.2
5     1  40.9
6     1  51.1

We begin with a special kind of null hypothesis called the sharp null:

The sharp null posits that the true treatment effect is exactly 0 for each and every single individual.

Understanding Checks:

How is this sharp null different from the null hypothesis that your are used to?

The sharp null is particularly useful for us because it allows us to fill in the full schedule of potential outcomes:

small_exp <- small_exp |> 
  mutate(y0 = y,
         y1 = y)

We then ask: how many ways could my experiment have allocated 16 people evenly into treatment and control groups?

The answer is \(\frac{16!}{8! \times 8!} = 12870\)

Well, we have the full schedule of potential outcomes (under the sharp null). Let’s just simulate each one of these 12870 possible randomizations and calculate the ATE.

We then see how many of these experiments – under the assumption that the treatment has 0 effect for each and every individual – produced a result as large as -12.63.

This is our (exact) p-value!

Let’s simulate this:

# setting a seed
set.seed(1)

# creating a vector of 16 subjects
subjects <- 1:16

# Generate all possible ways of assigning 8 out of 16 subjects to treatment
all_permutations <- combn(subjects, 8)

# a scalar storing exactly how many permutations to loop through
n_permutations <- ncol(all_permutations)

# an empty vector to store all of the simulation results
v_ri <- c()

# looping
for (i in 1:n_permutations) {
  
  # assigning each column one-at-a-time to treatment
  treatment_group <- all_permutations[, i]
  
  # assigning the rest to control
  control_group <- setdiff(subjects, treatment_group)
  
  # reveal observed outcomes based on the current permutation
  temp_experiment <- small_exp |> 
    select(y1,y0) |> 
    mutate(id = row_number(),
           tx = ifelse(id  %in% treatment_group, 1, 0),
           y = ifelse(id  %in% treatment_group, y1, y0)
    )
  
  # calculate the ate
  model <- lm(y ~ tx, data=temp_experiment)
  
  # store the result
  v_ri[i] <- model$coefficients[2]
  
}

# saving the vector to a dataframe
# creating a variable that counts whether a simulated effect is larger than what we obtained in our experiment
ri_pvalues <- tibble(v_ri) |> 
  mutate(larger = ifelse(abs(v_ri) > abs(exp_effect), 1, 0))

# exact p.value
mean(ri_pvalues$larger)
[1] 0.05703186

And that’s pretty close to what our regression gave us!