Standard Errors and Covariates

A Motivating Example

In the US, affective polarization – dislike of opposing partisans – has increased dramatically in recent decades.

In response to these trends, scholars and policymakers have tested interventions aimed at reducing affective polarization.

As it turns out, one of the most effective interventions involves watching a Heiniken commericial.

Suppose we ran an experiment involving 600 Democrats and Republicans, half of whom were randomly assigned to view the commercial. These people include both “strong partisans” and “partisan leaners”. Strong partisans are more affectively polarized than “leaners.”

Here’s a model of the data generating process:

\[ Y_i(0) = \begin{cases} 50 + \epsilon_i & \text{if} \;\; i \;\; \text{is a partisan} \\ 25 + \epsilon_i & \text{if} \;\; i \;\; \text{is a "leaner"} \\ \end{cases} \] where \(\epsilon_i\) is randomly drawn from a normal distribution with mean = 0 and sd = 5:

\[ \epsilon_i \sim \mathcal{N}(0, 5) \]

The potential treated outcome is given by:

\[ Y_i(1) = Y_i(0) - 10 \]

Note that the treatment effect (-10) is constant across all individuals.

Let’s code that up:

# load packages
library(tidyverse)
library(randomizr)

# set seed
set.seed(123)

# Parameters
# ----------

# total number of people in the study
N <- 600

# types
types <- c("partisan", "leaner")


# generate data:
# --------------

type <- sample(types, N, replace = TRUE)

# creating potential outcomes
dta <- tibble(type) |> 
  mutate(y0 = ifelse(type == "partisan", 
                     rnorm(N, mean=50, sd=5),
                     rnorm(N, mean=25, sd=5)),
         y1 = y0 - 10)



# run a single experiment and reveal the outcome
exp <- dta |> 
  mutate(tx = complete_ra(N, prob = 0.5),
         y = ifelse(tx == 1, y1, y0))

We can observe that the mean level of affective polarization is 37.82 in the control condition, compared to 27.74 in the treatment condition.

The difference is -10.08.

Since we made up the data, we know that this is pretty close to the “truth”.

However, if we were to run this experiment for real, we wouldn’t have that information.

Instead, all we would have is our single dataset, and we would use it to make inferences about the data generating process.

We do so by estimating the following regression model (which is equivalent to running a t-test):

\[ Y_i = \beta_0 + \beta_1 * Tx_i + \epsilon_i \]

library(estimatr)
library(broom)

# running the model
mod <- lm_robust(y ~ tx, data=exp)


# displaying the results
results <- tidy(mod, conf.int = TRUE)

results[, 1:7]

         term  estimate std.error statistic       p.value  conf.low conf.high
1 (Intercept)  37.81852 0.7902284 47.857709 1.154751e-206  36.26656 39.370482
2          tx -10.08129 1.1078340 -9.099998  1.332626e-18 -12.25701 -7.905569

Exercise:

Can you interpret the coefficients \(\widehat{\beta_0}\) and \(\widehat{\beta_1}\)?

Write up these results as if you were writing for your BA thesis. What does this table tell you?

My answer here

Standard Errors

In addition to coefficient estimates, we also obtain:

SE(\(\widehat{\beta_0}\)) = 0.7902284
SE(\(\widehat{\beta_1}\)) = 1.107834

We know that when the coefficient is more than twice the standard error (in absolute value), the result is statistically significant.

You may even know the formula for the standard error of the slope estimate in a bivariate regression:

\[ SE(\hat{\beta_1}) = \frac{\sigma_e}{\sqrt{n}} \times \frac{1}{\sigma_{X}} \] where \(\sigma_e\) is the standard deviation of the regression residuals, and \(\sigma_X\) is the standard deviation of the regressor, \(X_i\).

Understanding check:

But conceptually, what does the standard error represent?

The textbook definition is: “The standard error summarizes the variability in an estimate due to random sampling.”

What does that mean (and where does the random sampling come in)?

Let’s go back to our simulated data.

This time, we will run 1000 simulations and see how much variability there is in our estimates:

set.seed(123)

# vectors to store the simulated effects:
v_tx <- c()

# number of simulations
nsims <- 1000

# looping
for (i in 1:nsims) {
  
  # randomly re-assigning treatments
  # and revealing outcomes
  exp <- dta |> 
    mutate(tx = complete_ra(N, prob = 0.5),
           y = ifelse(tx == 1, y1, y0))
  
  # running the model
  mod <- lm_robust(y ~ tx, data=exp)

  
  # storing the estimated ATEs
  v_tx[i] <- mod$coefficients[2]
  
  
}

Let’s again confirm that the mean across our simulations is close to “the truth”:

mean(v_tx)

[1] -10.01909

Here’s the sampling distribution of our estimates:

# putting together the data for plotting
toplot <- tibble(v_tx)

# making a histogram
hist <- ggplot(data=toplot, aes(x=v_tx)) + 
  geom_histogram(color="white",
                 bins=40) +
  geom_vline(xintercept = mean(v_tx),
             color="red",
             linetype = "dashed") +
  theme_minimal() +
  coord_cartesian(xlim=c(-14,-5))


hist

While the average across our estimates is spot on, we see that our estimates also have some variability.

We can look at the standard deviation to quantify this variability:

sd(v_tx)

[1] 1.094896

The standard deviation of this distribution is 1.0949.

Understanding check:

Does that number look familiar?

Now what is the standard error again?

Larger Samples Yield Smaller Standard Errors

As we talked about before, one way to reduce sampling variability is to increase sample size.

Let’s see that now:

# doubling the number of observations
N2 <- 1200

# generating data with twice the number of observations
type2 <- sample(types, N2, replace = TRUE)

# creating potential outcomes
dta2 <- tibble(type2) |> 
  mutate(y0 = ifelse(type2 == "partisan", 
                     rnorm(N2, mean=50, sd=5),
                     rnorm(N2, mean=25, sd=5)),
         y1 = y0 - 10)


# setting seed
set.seed(123)

# vectors to store the simulated effects:
v_tx2 <- c()

# number of simulations
nsims <- 1000


# looping
for (i in 1:nsims) {
  
  # randomly re-assigning treatments
  # and revealing outcomes
  exp2 <- dta2 |> 
    mutate(tx = complete_ra(N2, prob = 0.5),
           y = ifelse(tx == 1, y1, y0))
  
  # running the model
  mod2 <- lm_robust(y ~ tx, data=exp2)

  # storing the estimated ATEs
  v_tx2[i] <- mod2$coefficients[2]
  
}

# putting together the data for plotting
toplot <- tibble(v_tx2)

# making a histogram
hist2 <- ggplot(data=toplot, aes(x=v_tx2)) + 
  geom_histogram(color="white",
                 bins=40) +
  geom_vline(xintercept = mean(v_tx),
             color="red",
             linetype = "dashed") +
  theme_minimal() + 
  coord_cartesian(xlim=c(-14,-5))

# plotting side by side
library(patchwork)

hist / hist2 +
   plot_layout(axes = "collect")

Here’s the standard error with twice the number of observations:

sd(v_tx2)

[1] 0.7733885

Notice that sd(v_tx2) = \(\frac{\sqrt{600}}{\sqrt{1200}} \times\) sd(v_tx):

sd(v_tx2)

[1] 0.7733885

sd(v_tx) * 600^0.5 / 1200^0.5

[1] 0.7742086

That’s the formula in action!

Including Covariates Reduces Standard Errors

Recall that when making up the data, we had both “strong partisans” and “leaners”.

Even though the treatment effect was identical for both types, “leaners” had a lower baseline level of affective polarization:

mean(dta$y0[type=="leaner"])

[1] 25.02798

mean(dta$y0[type=="partisan"])

[1] 50.19225

Here’s what that looks like as a directed acyclic graph (DAG):

Suppose we control for \(type\) in our regression model:

\[ Y_i = \beta_0 + \beta_1 * Tx_i + \beta_2 * \text{Type} + \epsilon_i \]

This means that there’s less “unexplained” variance in the outcome.

In terms of the SE formula:

\[ SE(\hat{\beta_1}) = \frac{\sigma_e}{\sqrt{n}} \times \frac{1}{\sigma_{X}} \]

\(\sigma_e\) goes down, leading to a reduction in the standard error.

We can see that in our model:

mod3 <- lm_robust(y ~ tx + type, data=exp) 

results3 <- tidy(mod3)

results3[, 1:7]

          term  estimate std.error statistic       p.value  conf.low conf.high
1  (Intercept)  25.06359 0.3589955  69.81589 2.228848e-289  24.35854 25.768639
2           tx -10.06844 0.4106571 -24.51789  2.280728e-92 -10.87495 -9.261936
3 typepartisan  25.16153 0.4107754  61.25374 1.300784e-259  24.35479 25.968273

Better still, let’s compare these results against our original results using modelsummary():

library(modelsummary)

models <- list(mod, mod3)

modelsummary(models, 
             coef_map = c('tx' = 'Heineken Treatment',
                          'typepartisan' = 'Strong Partisan',
                          '(Intercept)' = 'Constant'),
             gof_map = c("nobs", "r.squared"))

	(1)	(2)
Heineken Treatment	−11.075	−10.068
	(1.107)	(0.411)
Strong Partisan		25.162
		(0.411)
Constant	38.315	25.064
	(0.790)	(0.359)
Num.Obs.	600	600
R2	0.143	0.882

Exercise:

Write up these results as if you were writing for your BA thesis.

My answer here

We can also see the reduction in standard errors via simulation:

set.seed(123)

# vectors to store the simulated effects:
v_tx3 <- c()

# looping
for (i in 1:nsims) {
  
  # randomly re-assigning treatments
  # and revealing outcomes
  exp <- dta |> 
    mutate(tx = complete_ra(N, prob = 0.5),
           y = ifelse(tx == 1, y1, y0))
  
  # running the model
  mod3 <- lm_robust(y ~ tx + type, data=exp)

  
  # storing the estimated ATEs
  v_tx3[i] <- mod3$coefficients[2]
  
}


# putting together the data for plotting
toplot <- tibble(v_tx3)

# making a histogram
hist3 <- ggplot(data=toplot, aes(x=v_tx3)) + 
  geom_histogram(color="white",
                 bins=40) +
  geom_vline(xintercept = mean(v_tx),
             color="red",
             linetype = "dashed") +
  theme_minimal() + 
  coord_cartesian(xlim=c(-14,-5))

# plotting side by side
library(patchwork)

hist / hist3 +
   plot_layout(axes = "collect")

And here’s the standard error of the sampling distribution:

sd(v_tx3)

[1] 0.3940202

Here’s another way of understanding why the distribution became so much skinnier.

In the top the histogram, where did those very “extreme” estimates come from?

Suppose that, just by chance, we get lots of partisans (high polarization) in the control group, and lots of leaners (lower polarization) in the treatment group. This would make the treatment seem very effective (= a very negative effect).

And if, just by chance, we get lots of partisans in the treatment group, and lots of leaners in the control group, this would make the treatment effect seem quite small.

But controlling for \(type\) is a bit like running two separate experiments: one for “strong partisans” and one for “leaners”.

This is what we mean by “holding \(type\) constant”.

We then estimate separate ATEs for each \(type\), and then average them together.

In this way, we can eliminate the “extreme” randomizations described above. Which is why our sampling distribution shrinks.

Key Takeaway

Even though your treatment and some covariate may be uncorrelated, it’s still often helpful to control for covariates.

Your estimated treatment effect won’t change (much), but your standard errors might shrink.

In other words, your estimates will become more precise.

The more that your covariate predicts the outcome, the more you will gain in precision.

So when you have an experiment, adding controls to your regression model is not “wrong” per se (but see next section)…just be clear that you are doing it to increase precision, not to account for bias!

Mediators and Colliders

When deciding to add covariates to your model, it’s important to only control for variables that are causally prior to treatment.

Adding post-treatment variables to your model will introduce bias to your results.

There are two types of post-treatment variables to be wary of: mediators and colliders.

Mediators are the mechanism through which your treatment affects outcomes:

Example for discussion:

Suppose you were interested in the effects of smoking on life expectancy. And you control for whether people have lung cancer.

What happens to the effect of smoking once you control for lung cancer?

In contrast to mediators, colliders look like this:

Example for discussion:

Suppose you are interested in the relationship between being tall and being good at basketball.

You control for whether someone plays in the NBA.

Amongst NBA players, what is the relationship between height and e.g. points per game?

Moderators (Subgroup Effects)

Finally, covariates can be moderators. That is, they can define subgroups which respond differently to the treatment.

Suppose that “leaners” are much more responsive to treatment that partisans.

For partisans, let’s make the treatment effect 0, and let’s make it 20 for “leaners”.

Let’s modify our simulated data:

# set seed
set.seed(123)

# Parameters
# ----------

# total number of people in the study
N <- 600

# types
types <- c("partisan", "leaner")


# generate data:
# --------------

type <- sample(types, N, replace = TRUE)

# creating potential outcomes
dta <- tibble(type) |> 
  mutate(y0 = ifelse(type == "partisan", 
                     rnorm(N, mean=50, sd=5),
                     rnorm(N, mean=25, sd=5)),
         y1 = ifelse(type == "partisan", 
                     y0,
                     y0 - 20),
         partisan = ifelse(type == "partisan", 1, 0))



# run a single experiment and reveal the outcome
exp <- dta |> 
  mutate(tx = complete_ra(N, prob = 0.5),
         y = ifelse(tx == 1, y1, y0))

We can still estimate our original model to get the ATE:

\[ Y_i = \beta_0 + \beta_1 * Tx_i + \epsilon_i \]

However, this model does not tell us about the extent to which the treatment effect differs between partisans and leaners.

Here’s what the estimated treatment effect looks like:

library(ggforce)

ggplot(data=exp, aes(x = tx, y=y)) +
  geom_point(aes(color=type),
             position = position_jitternormal(sd_x = 0.05, sd_y = 0.05),
             alpha=0.5) +
  geom_smooth(method="lm",
              color = "black",
              se = FALSE) +
  theme_minimal() +
  scale_x_continuous(breaks=c(0,1), 
                     label=c("Control", "Treatment")) +
  labs(x = "",
       y = "Polarization",
       color = "Type")

This model is also not quite right:

\[ Y_i = \beta_0 + \beta_1 * Tx_i + \beta_2 * \text{Partisan}_i + \epsilon_i \]

Exercise:

Can you draw what the model looks like?

Here’s the answer:

# estimate the model first:
m <- lm(y ~ tx + type, data=exp)

# predict the outcomes and store them
exp <- exp |> 
  mutate(m_fit = predict(m))


# plotting
ggplot(data=exp, aes(x = tx, color = type)) +
  geom_point(aes(y=y),
             position = position_jitternormal(sd_x = 0.05, sd_y = 0.05),
             alpha=0.5) +
  geom_smooth(aes(y=m_fit),
              method="lm",
              se = FALSE) +
  theme_minimal() +
  scale_x_continuous(breaks=c(0,1), 
                     label=c("Control", "Treatment")) +
  labs(x = "",
       y = "Polarization",
       color = "Type")

Note that the slopes are still the same!

Instead, we will need an interaction model:

\[ Y_i = \beta_0 + \beta_1 * Tx_i + \beta_2 * \text{Partisan}_i + \beta_3 * Tx_i * \text{Partisan}_i + \epsilon_i \]

This is the correct model visualized:

ggplot(data=exp, aes(x = tx, y=y, color = type)) +
  geom_point(position = position_jitternormal(sd_x = 0.05, sd_y = 0.05),
             alpha=0.5) +
  geom_smooth(method="lm",
              se = FALSE) +
  theme_minimal() +
  scale_x_continuous(breaks=c(0,1), 
                     label=c("Control", "Treatment")) +
  labs(x = "",
       y = "Polarization",
       color = "Type")

Let’s estimate all three models and display them side-by-side:

m1 <- lm(y ~ tx, data = exp)
m2 <- lm(y ~ tx + type, data=exp)
m3 <- lm(y ~ tx*type, data=exp)

library(modelsummary)
models <- list(m1,m2,m3)

modelsummary(models,
             coef_map = c('(Intercept)' = 'Constant',
                          'tx' = 'Treatment',
                          'typepartisan' = 'Partisan',
                          'tx:typepartisan' = 'Treatment x Partisan'),
             gof_map = c("nobs", "r.squared"))

	(1)	(2)	(3)
Constant	37.819	20.464	25.348
	(1.095)	(0.497)	(0.407)
Treatment	−9.681	−10.619	−20.657
	(1.549)	(0.576)	(0.584)
Partisan		35.177	25.279
		(0.577)	(0.580)
Treatment x Partisan			19.811
			(0.821)
Num.Obs.	600	600	600
R2	0.061	0.870	0.934

Understanding check:

Can you interpret the coefficients on the interaction model (model 3)?

According to model 3, what is the treatment effect for strong partisans?

Exercise:

Write up these results as if you were writing for your BA thesis.

Pay attention to how the interpretation of the coefficients change across the models.

Step-by-step guide to interpreting interaction models

Let’s write up the result from model 3:

\[ Y_i = 25 - 20 * Treatment_i + 25 * Partisan_i + 20 * Treatment_i * Partisan_i \]

We also have subgroups:

Subgroup	Treatment Value	Partisan Value
Untreated Leaners	0	0
Treated Leaners	1	0
Untreated Partisans	0	1
Treated Partisans	1	1

By plugging this numbers into the formula, we can calculate the average outcome in each of our four groups.

Exercise:

Calculate these four marginal means.

Substantive Meaning	Regression Coefficient	Value
Average outcome: Untreated Leaners	\(\beta_0\)	25
Average outcome: Treated Leaners	\(\beta_0 + \beta_1\)	25 - 20 = 5
Row2 - Row1: Treatment effect for Leaners only	\((\beta_0 + \beta_1) - \beta_0\) = \(\beta_1\)	(25 - 20) - 25 = -20

Average outcome: Untreated Leaners

\(\beta_0\)

Average outcome: Treated Leaners

\(\beta_0 + \beta_1\)

25 - 20 = 5

Row2 - Row1:

Treatment effect for Leaners only

\((\beta_0 + \beta_1) - \beta_0\)

= \(\beta_1\)

(25 - 20) - 25

= -20

Notice how the interpretation of \(\beta_1\) is now conditional upon a particular subgroup.

We can do a similar exercise for Partisans.

Substantive Meaning	Regression Coefficient	Value
Average outcome: Untreated Partisams	\(\beta_0 + \beta_2\)	25 + 25 = 50
Average outcome: Treated Partisans	\(\beta_0 + \beta_1 + \beta_2 + \beta_3\)	25 - 20 + 25 +20 = 50
Row2 - Row1: Treatment effect for Partisans only	\((\beta_0 + \beta_1 + \beta_2 + \beta_3) - (\beta_0 + \beta_2)\) = \((\beta_1 + \beta_3)\)	(25 - 20 + 25 +20) - (25 + 25) = 0

Average outcome: Untreated Partisams

\(\beta_0 + \beta_2\)

25 + 25 = 50

Average outcome: Treated Partisans

\(\beta_0 + \beta_1 + \beta_2 + \beta_3\)

25 - 20 + 25 +20 = 50

Row2 - Row1:

Treatment effect for Partisans only

\((\beta_0 + \beta_1 + \beta_2 + \beta_3) - (\beta_0 + \beta_2)\)

= \((\beta_1 + \beta_3)\)

(25 - 20 + 25 +20) - (25 + 25)

= 0

Understanding check:

How would you interpret \(\beta_2\)?

And what about \(\beta_3\)?

Linear Combinations

In the exercise above, we had to add the coefficients manually (\(\beta_1 + \beta_3\)) to calculate the treatment effect of partisans.

But is this significantly different from 0?

How do you get the standard errors?

To get the standard errors, you will need to use linear combinations:

library(multcomp)

# showing the standard error
summary(glht(m3, linfct = c("tx + tx:typepartisan = 0")))


     Simultaneous Tests for General Linear Hypotheses

Fit: lm(formula = y ~ tx * type, data = exp)

Linear Hypotheses:
                          Estimate Std. Error t value Pr(>|t|)
tx + tx:typepartisan == 0  -0.8460     0.5763  -1.468    0.143
(Adjusted p values reported -- single-step method)