Confidence Intervals and Hypothesis Testing

Let’s return to our affective polarization example.

As it turns out, the Heiniken experiment was part of a larger project called the Strengthening Democracy Challenge which sought to crowdsource different interventions.

Another intervention they tried involved a chatbot quiz.

Suppose that we have the following data generating process under this intervention:

\[ Y_i(0) = 50 + \epsilon \] where \(\epsilon_i\) is randomly drawn from a normal distribution with mean = 0 and sd = 10:

\[ \epsilon_i \sim \mathcal{N}(0, 10) \]

The potential treated outcome is given by:

\[ Y_i(1) = Y_i(0) - 10 \]

Let’s code that up:

# load packages
library(tidyverse)
library(randomizr)

# set seed
set.seed(123)

# Parameters
# ----------

# total number of people in the study
N <- 600


# generate data:
# --------------

# creating potential outcomes
y0 <- rnorm(N, mean=50, sd=10)
y1 <- y0 - 10

dta <- tibble(y0,y1)


# run a single experiment and reveal the outcome
exp <- dta |> 
  mutate(tx = complete_ra(N, prob = 0.5),
         y = ifelse(tx == 1, y1, y0))

Next let’s estimate a model:

\[ Y_i = \beta_0 + \beta_1 * Tx + \epsilon_i \]

library(broom)

m <- lm(y ~ tx, data=exp)

results <- tidy(m, conf.int = TRUE)

results
# A tibble: 2 × 7
  term        estimate std.error statistic  p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)    50.1      0.559      89.7 0            49.0     51.2 
2 tx             -9.82     0.790     -12.4 1.00e-31    -11.4     -8.27

Look at the last two columns of output: we see conf.low and conf.high. You probably guessed that these two numbers define the confidence interval surrounding our estimates.

You may even remember how a 95% confidence interval is constructed:

\[ \hat{\beta} \pm 1.96 \times SE(\hat{\beta}) \]

But conceptually, what IS a confidence interval?


An Analogy to Fishing

To understand confidence intervals, here’s a nice analogy I borrowed from moderndive.

Say you are trying to catch a fish. On the one hand, you could use a spear, while on the other you could use a net. Using the net will probably allow you to catch more fish!

Now suppose that the “true” treatment effect in the population is the fish. Let’s denote this population parameter \(\tau\).

We could use our point estimate from our experiment (\(\hat{\beta_1}\) = -9.82) to estimate \(\tau\). Think of this as “fishing with a spear.”

What about “fishing with a net”? That’s like using the confidence interval: [-11.37, -8.27].

Since we are using a 95% confidence interval, we are saying that our “net” [-11.37, -8.27] has a 95% chance of catching the fish (or, a 95% of containing or “bracketing” \(\tau\)).


A More Precise Explanation

That statement should strike you as kind of weird.

Either the interval [-11.37, -8.27] contains the true population parameter \(\tau\) or it doesn’t.

So what does it mean that CI has a 95% chance of bracketing the “truth”?

Now here’s more precisely what we mean:

Suppose we were to repeat our experiment 1000 times. For each experiment:

  • we estimate \(\hat{\beta}\) and SE(\(\hat{\beta}\))
  • we construct a 95% CI as \(\hat{\beta} \pm 1.96 \times SE(\hat{\beta})\)

If we followed this procedure, then about 950 times out of 1000, our constructed CIs will “bracket” the true treatment effect.

Let’s try this out via a simulation:

# preliminaries
set.seed(2)
n_sims <- 1000

# empty vectors to store the results
v_low <- c()
v_high <- c()

# looping
for (i in 1:n_sims) {
  
  # re-randomizing the treatment and revealing observed outcomes
  exp <- dta |> 
  mutate(tx = complete_ra(N, prob = 0.5),
         y = ifelse(tx == 1, y1, y0))
  
  # estimating the model and storing the results
  results <- tidy(lm(y ~ tx, data=exp), conf.int=TRUE)
  
  # saving the estimates to the empty vectors
  v_low[i] <- as.numeric(results[2,6])
  v_high[i] <- as.numeric(results[2,7])

} 

# putting together the dataset of simulated results
# and checking if the CI contains the "truth" == -10
to_plot <- tibble(v_low, v_high) |> 
  mutate(bracket = ifelse(v_low < -10 & v_high > -10, 1, 0))

What percentage of CI’s caught the fish?

mean(to_plot$bracket)
[1] 0.948

That’s pretty close to 95%!

In fact, we can make it arbitrarily close to 95% by running more and more simulations.

Here’s how those estimates look like:

# sorting estimates from lowest to highest
# and indexing them
to_plot <- to_plot |> 
  arrange(v_low) |> 
  mutate(index = row_number())


ggplot(to_plot, aes(y = index, 
                    xmin = v_low,
                    xmax = v_high,
                    color = factor(bracket))) + 
  geom_errorbar(alpha= 0.6) +
  geom_vline(xintercept = -10,
             color = "black",
             lty="dashed") +
  theme_minimal() +
  labs(y = "Simulation Number",
       x = "Treatment Effect",
       color = "Contains Tau?") +
  scale_color_discrete(label=c("No", "Yes")) +
  theme(legend.position = "inside",
        legend.position.inside = c(0.85,0.15))

Understanding Checks:

What happens to our confidence intervals as our standard errors decrease?

Is it better or worse to have narrower confidence intervals?

What should we conclude if the confidence interval contains 0?


Statistical Significance

Suppose we were to test this same intervention again, but we only had enough time / money to recruit 12 participants:

set.seed(2)
N <- 12

# sampling 12 people from the original data
# and assigning 6 to treatment
small_exp <- sample_n(dta, N) |> 
  mutate(tx = complete_ra(N, prob = 0.5),
         y = ifelse(tx == 1, y1, y0))

We nonetheless conduct the study and estimate our ATE:

# estimate the model
model <- lm(y ~ tx, data= small_exp)

# look at it
summary(model)

Call:
lm(formula = y ~ tx, data = small_exp)

Residuals:
     Min       1Q   Median       3Q      Max 
-23.6197 -10.1682   0.2508   7.0305  26.9795 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   50.379      6.263   8.044 1.12e-05 ***
tx           -13.191      8.857  -1.489    0.167    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.34 on 10 degrees of freedom
Multiple R-squared:  0.1815,    Adjusted R-squared:  0.09968 
F-statistic: 2.218 on 1 and 10 DF,  p-value: 0.1673
# look specifically at the coefficients
model$coefficients
(Intercept)          tx 
   50.37887   -13.19068 
# saving the effect for later
exp_effect <- model$coefficients[2]

We estimate a treatment effect of around -10.

But since we only have 12 people in our study, maybe our results are produced by chance?

You may remember that to answer such a question, you check whether the t-statistic is > 1.96.

Equivalently, you may also remember to check whether the p.value < 0.05.

Or, you may just look at your regression output and see if there are any stars.

But what is the logic behind doing any this stuff?

Understanding Checks:

What is a p-value?

Or better yet: write down the precise question to which the p-value provides an answer.


Hypothesis Testing

The basic idea behind hypothesis testing is to ask:

If the true treatment effect were 0, how likely is it that I would observe an effect as large as I obtained?

The answer to this question is provided by the p.value.

This is a bit like asking: if I had a fair coin (that’s the null hypothesis), what is the likelihood that I would have obtained 10 heads in a row?

If it is extremely unlikely (less than 1 in 20 chance, or 0.05) that a fair coin would have produced your actual results, you reject the null hypothesis that the coin is fair.

Similarly, if it is extremely unlikely that a treatment which has truly no effect would produce a difference between treatment and control groups as large as what you observed in your experiment, you can conclude that the true treatment effect is (probably) not 0.


Randomization Inference

To illustrate the logic, let’s run a simulation!

We need to imagine for the moment that we have only our experimental data. We don’t know the true treatment effect, and we don’t know that data generating process, or the potential outcomes.

small_exp <- small_exp |> 
  select(-y0, -y1)

head(small_exp)
# A tibble: 6 × 2
     tx     y
  <int> <dbl>
1     1  13.6
2     1  18.4
3     0  56.8
4     0  40.3
5     1  45.4
6     0  50.3

We begin with a special kind of null hypothesis called the sharp null:

The sharp null posits that the true treatment effect is exactly 0 for each and every single individual.

Understanding Check:

How is this sharp null different from the null hypothesis that your are used to?

The sharp null is particularly useful for us because it allows us to fill in the full schedule of potential outcomes:

small_exp <- small_exp |> 
  mutate(y0 = y,
         y1 = y)

We then ask: how many ways could my experiment have allocated 12 people evenly into treatment and control groups?

The answer is \(\frac{12!}{6! \times 6!} = 924\)

Well, we have the full schedule of potential outcomes (under the sharp null). Let’s just simulate each one of these 924 possible randomizations and calculate the ATE.

We then see how many of these experiments – under the assumption that the treatment has 0 effect for each and every individual – produced a result as large as -13.19.

This is our (exact) p-value!

Let’s simulate this:

# setting a seed
set.seed(2)

# creating a vector of 12 subjects
subjects <- 1:N

# Generate all possible ways of assigning 6 out of 12 subjects to treatment
all_permutations <- combn(subjects, N/2)

# a scalar storing exactly how many permutations to loop through
n_permutations <- ncol(all_permutations)

# an empty vector to store all of the simulation results
v_ri <- c()

# looping
for (i in 1:n_permutations) {
  
  # assigning each column one-at-a-time to treatment
  treatment_group <- all_permutations[, i]
  
  # assigning the rest to control
  control_group <- setdiff(subjects, treatment_group)
  
  # reveal observed outcomes based on the current permutation
  temp_experiment <- small_exp |> 
    select(y1,y0) |> 
    mutate(id = row_number(),
           tx = ifelse(id  %in% treatment_group, 1, 0),
           y = ifelse(id  %in% treatment_group, y1, y0)
    )
  
  # calculate the ate
  model <- lm(y ~ tx, data=temp_experiment)
  
  # store the result
  v_ri[i] <- model$coefficients[2]
  
}

# saving the vector to a dataframe
# creating a variable that counts whether a simulated effect is larger than what we obtained in our experiment
ri_pvalues <- tibble(v_ri) |> 
  mutate(larger = ifelse(abs(v_ri) >= abs(exp_effect), 1, 0))

# exact p.value
mean(ri_pvalues$larger)
[1] 0.1655844

And that’s pretty close to what our regression gave us!


In-class Exercise

To really nail down this concept, let’s do another example by hand.

Suppose you had the following very small experiment:

Subject ID Tx Y
A 0 3
B 1 6
C 0 2
D 0 4
E 1 8
  1. Calculate the ATE from this single experiment.
  2. How many different randomizations were possible? Write them all down.
  3. For each possible randomization, calculate the treatment effect under the sharp null.
  4. Calculate the exact p-value for our single experiment.

RI in practice

Now what happens when your experiments get big?

Let’s imagine you had even just 100 subjects, 50 of whom are assigned to treatment. Here are the total possible randomizations:

\(\frac{100!}{50! \times 50!} = 1.0089134 \times 10^{29}\)

That will crash your computer!

So instead of calculating an ATE for every possible randomization, we just sample from this universe (e.g. 1000 times):

# set a seed
set.seed(1)

# start with our large experimental dataset and calculate / store the ATE from our single experiment
model <- lm(y ~ tx, data=exp)

our_ate <- model$coefficients["tx"]

# also calculate the number of subjects
nsubjects <- nrow(exp)

# define the number of permutations you want to sample
num_permutations <- 1000

# Create an empty vector to store the ATEs
ri_results <- c()

# looping 
for (i in 1:num_permutations) {

    # shuffle the treatments
    rilarge <- exp |> 
      mutate(shuffletx = complete_ra(nsubjects, prob=0.5))
    
    # estimate the ATE under the shuffled treatment
    model <- lm(y ~ shuffletx, data=rilarge)
    
    # store the estimated ate
    ri_results[i] <- model$coefficients["shuffletx"]
    
}

# turning our simulation results into a dataset, and calculating what percentage of estimates are larger (in absolute value) than our actual results

ri_results <- tibble(ri_results) |> 
  mutate(larger = ifelse(abs(ri_results) >= abs(our_ate), 1, 0))

# calculating the exact p.value
mean(ri_results$larger)
[1] 0