Standard Errors and Covariates
A Motivating Example
In the US, affective polarization – dislike of opposing partisans – has increased dramatically in recent decades.
In response to these trends, scholars and policymakers have tested interventions aimed at reducing affective polarization.
As it turns out, one of the most effective interventions involves watching a Heiniken commericial.
Suppose we ran an experiment involving 600 Democrats and Republicans, half of whom were randomly assigned to view the commercial. These people include both “strong partisans” and “partisan leaners”. Strong partisans are more affectively polarized than “leaners.”
Here’s a model of the data generating process:
\[ Y_i(0) = \begin{cases} 50 + \epsilon_i & \text{if} \;\; i \;\; \text{is a partisan} \\ 25 + \epsilon_i & \text{if} \;\; i \;\; \text{is a "leaner"} \\ \end{cases} \] where \(\epsilon_i\) is randomly drawn from a normal distribution with mean = 0 and sd = 5:
\[ \epsilon_i \sim \mathcal{N}(0, 5) \]
The potential treated outcome is given by:
\[ Y_i(1) = Y_i(0) - 10 \]
Note that the treatment effect (-10) is constant across all individuals.
Let’s code that up:
# load packages
library(tidyverse)
library(randomizr)
# set seed
set.seed(123)
# Parameters
# ----------
# total number of people in the study
<- 600
N
# types
<- c("partisan", "leaner")
types
# generate data:
# --------------
<- sample(types, N, replace = TRUE)
type
# creating potential outcomes
<- tibble(type) |>
dta mutate(y0 = ifelse(type == "partisan",
rnorm(N, mean=50, sd=5),
rnorm(N, mean=25, sd=5)),
y1 = y0 - 10)
# run a single experiment and reveal the outcome
<- dta |>
exp mutate(tx = complete_ra(N, prob = 0.5),
y = ifelse(tx == 1, y1, y0))
We can observe that the mean level of affective polarization is 37.82 in the control condition, compared to 27.74 in the treatment condition.
The difference is -10.08.
Since we made up the data, we know that this is pretty close to the “truth”.
However, if we were to run this experiment for real, we wouldn’t have that information.
Instead, all we would have is our single dataset, and we would use it to make inferences about the data generating process.
We do so by estimating the following regression model (which is equivalent to running a t-test):
\[ Y_i = \beta_0 + \beta_1 * Tx_i + \epsilon_i \]
library(estimatr)
library(broom)
# running the model
<- lm_robust(y ~ tx, data=exp)
mod
# displaying the results
<- tidy(mod, conf.int = TRUE)
results
1:7] results[,
term estimate std.error statistic p.value conf.low conf.high
1 (Intercept) 37.81852 0.7902284 47.857709 1.154751e-206 36.26656 39.370482
2 tx -10.08129 1.1078340 -9.099998 1.332626e-18 -12.25701 -7.905569
My answer here
Standard Errors
In addition to coefficient estimates, we also obtain:
- SE(\(\widehat{\beta_0}\)) = 0.7902284
- SE(\(\widehat{\beta_1}\)) = 1.107834
We know that when the coefficient is more than twice the standard error (in absolute value), the result is statistically significant.
You may even know the formula for the standard error of the slope estimate in a bivariate regression:
\[ SE(\hat{\beta_1}) = \frac{\sigma_e}{\sqrt{n}} \times \frac{1}{\sigma_{X}} \] where \(\sigma_e\) is the standard deviation of the regression residuals, and \(\sigma_X\) is the standard deviation of the regressor, \(X_i\).
Let’s go back to our simulated data.
This time, we will run 1000 simulations and see how much variability there is in our estimates:
set.seed(123)
# vectors to store the simulated effects:
<- c()
v_tx
# number of simulations
<- 1000
nsims
# looping
for (i in 1:nsims) {
# randomly re-assigning treatments
# and revealing outcomes
<- dta |>
exp mutate(tx = complete_ra(N, prob = 0.5),
y = ifelse(tx == 1, y1, y0))
# running the model
<- lm_robust(y ~ tx, data=exp)
mod
# storing the estimated ATEs
<- mod$coefficients[2]
v_tx[i]
}
Let’s again confirm that the mean across our simulations is close to “the truth”:
mean(v_tx)
[1] -10.01909
Here’s the sampling distribution of our estimates:
# putting together the data for plotting
<- tibble(v_tx)
toplot
# making a histogram
<- ggplot(data=toplot, aes(x=v_tx)) +
hist geom_histogram(color="white",
bins=40) +
geom_vline(xintercept = mean(v_tx),
color="red",
linetype = "dashed") +
theme_minimal() +
coord_cartesian(xlim=c(-14,-5))
hist
While the average across our estimates is spot on, we see that our estimates also have some variability.
We can look at the standard deviation to quantify this variability:
sd(v_tx)
[1] 1.094896
The standard deviation of this distribution is 1.0949.
Larger Samples Yield Smaller Standard Errors
As we talked about before, one way to reduce sampling variability is to increase sample size.
Let’s see that now:
# doubling the number of observations
<- 1200
N2
# generating data with twice the number of observations
<- sample(types, N2, replace = TRUE)
type2
# creating potential outcomes
<- tibble(type2) |>
dta2 mutate(y0 = ifelse(type2 == "partisan",
rnorm(N2, mean=50, sd=5),
rnorm(N2, mean=25, sd=5)),
y1 = y0 - 10)
# setting seed
set.seed(123)
# vectors to store the simulated effects:
<- c()
v_tx2
# number of simulations
<- 1000
nsims
# looping
for (i in 1:nsims) {
# randomly re-assigning treatments
# and revealing outcomes
<- dta2 |>
exp2 mutate(tx = complete_ra(N2, prob = 0.5),
y = ifelse(tx == 1, y1, y0))
# running the model
<- lm_robust(y ~ tx, data=exp2)
mod2
# storing the estimated ATEs
<- mod2$coefficients[2]
v_tx2[i]
}
# putting together the data for plotting
<- tibble(v_tx2)
toplot
# making a histogram
<- ggplot(data=toplot, aes(x=v_tx2)) +
hist2 geom_histogram(color="white",
bins=40) +
geom_vline(xintercept = mean(v_tx),
color="red",
linetype = "dashed") +
theme_minimal() +
coord_cartesian(xlim=c(-14,-5))
# plotting side by side
library(patchwork)
/ hist2 +
hist plot_layout(axes = "collect")
Here’s the standard error with twice the number of observations:
sd(v_tx2)
[1] 0.7733885
Notice that sd(v_tx2)
= \(\frac{\sqrt{600}}{\sqrt{1200}} \times\) sd(v_tx)
:
sd(v_tx2)
[1] 0.7733885
sd(v_tx) * 600^0.5 / 1200^0.5
[1] 0.7742086
That’s the formula in action!
Including Covariates Reduces Standard Errors
Recall that when making up the data, we had both “strong partisans” and “leaners”.
Even though the treatment effect was identical for both types, “leaners” had a lower baseline level of affective polarization:
mean(dta$y0[type=="leaner"])
[1] 25.02798
mean(dta$y0[type=="partisan"])
[1] 50.19225
Here’s what that looks like as a directed acyclic graph (DAG):
Suppose we control for \(type\) in our regression model:
\[ Y_i = \beta_0 + \beta_1 * Tx_i + \beta_2 * \text{Type} + \epsilon_i \]
This means that there’s less “unexplained” variance in the outcome.
In terms of the SE formula:
\[ SE(\hat{\beta_1}) = \frac{\sigma_e}{\sqrt{n}} \times \frac{1}{\sigma_{X}} \]
\(\sigma_e\) goes down, leading to a reduction in the standard error.
We can see that in our model:
<- lm_robust(y ~ tx + type, data=exp)
mod3
<- tidy(mod3)
results3
1:7] results3[,
term estimate std.error statistic p.value conf.low conf.high
1 (Intercept) 25.06359 0.3589955 69.81589 2.228848e-289 24.35854 25.768639
2 tx -10.06844 0.4106571 -24.51789 2.280728e-92 -10.87495 -9.261936
3 typepartisan 25.16153 0.4107754 61.25374 1.300784e-259 24.35479 25.968273
Better still, let’s compare these results against our original results using modelsummary():
library(modelsummary)
<- list(mod, mod3)
models
modelsummary(models,
coef_map = c('tx' = 'Heineken Treatment',
'typepartisan' = 'Strong Partisan',
'(Intercept)' = 'Constant'),
gof_map = c("nobs", "r.squared"))
(1) | (2) | |
---|---|---|
Heineken Treatment | −11.075 | −10.068 |
(1.107) | (0.411) | |
Strong Partisan | 25.162 | |
(0.411) | ||
Constant | 38.315 | 25.064 |
(0.790) | (0.359) | |
Num.Obs. | 600 | 600 |
R2 | 0.143 | 0.882 |
My answer here
We can also see the reduction in standard errors via simulation:
set.seed(123)
# vectors to store the simulated effects:
<- c()
v_tx3
# looping
for (i in 1:nsims) {
# randomly re-assigning treatments
# and revealing outcomes
<- dta |>
exp mutate(tx = complete_ra(N, prob = 0.5),
y = ifelse(tx == 1, y1, y0))
# running the model
<- lm_robust(y ~ tx + type, data=exp)
mod3
# storing the estimated ATEs
<- mod3$coefficients[2]
v_tx3[i]
}
# putting together the data for plotting
<- tibble(v_tx3)
toplot
# making a histogram
<- ggplot(data=toplot, aes(x=v_tx3)) +
hist3 geom_histogram(color="white",
bins=40) +
geom_vline(xintercept = mean(v_tx),
color="red",
linetype = "dashed") +
theme_minimal() +
coord_cartesian(xlim=c(-14,-5))
# plotting side by side
library(patchwork)
/ hist3 +
hist plot_layout(axes = "collect")
And here’s the standard error of the sampling distribution:
sd(v_tx3)
[1] 0.3940202
Here’s another way of understanding why the distribution became so much skinnier.
In the top the histogram, where did those very “extreme” estimates come from?
Suppose that, just by chance, we get lots of partisans (high polarization) in the control group, and lots of leaners (lower polarization) in the treatment group. This would make the treatment seem very effective (= a very negative effect).
And if, just by chance, we get lots of partisans in the treatment group, and lots of leaners in the control group, this would make the treatment effect seem quite small.
But controlling for \(type\) is a bit like running two separate experiments: one for “strong partisans” and one for “leaners”.
This is what we mean by “holding \(type\) constant”.
We then estimate separate ATEs for each \(type\), and then average them together.
In this way, we can eliminate the “extreme” randomizations described above. Which is why our sampling distribution shrinks.
Key Takeaway
Even though your treatment and some covariate may be uncorrelated, it’s still often helpful to control for covariates.
Your estimated treatment effect won’t change (much), but your standard errors might shrink.
In other words, your estimates will become more precise.
The more that your covariate predicts the outcome, the more you will gain in precision.
So when you have an experiment, adding controls to your regression model is not “wrong” per se (but see next section)…just be clear that you are doing it to increase precision, not to account for bias!
Mediators and Colliders
When deciding to add covariates to your model, it’s important to only control for variables that are causally prior to treatment.
Adding post-treatment variables to your model will introduce bias to your results.
There are two types of post-treatment variables to be wary of: mediators and colliders.
Mediators are the mechanism through which your treatment affects outcomes:
In contrast to mediators, colliders look like this:
Moderators (Subgroup Effects)
Finally, covariates can be moderators. That is, they can define subgroups which respond differently to the treatment.
Suppose that “leaners” are much more responsive to treatment that partisans.
For partisans, let’s make the treatment effect 0, and let’s make it 20 for “leaners”.
Let’s modify our simulated data:
# set seed
set.seed(123)
# Parameters
# ----------
# total number of people in the study
<- 600
N
# types
<- c("partisan", "leaner")
types
# generate data:
# --------------
<- sample(types, N, replace = TRUE)
type
# creating potential outcomes
<- tibble(type) |>
dta mutate(y0 = ifelse(type == "partisan",
rnorm(N, mean=50, sd=5),
rnorm(N, mean=25, sd=5)),
y1 = ifelse(type == "partisan",
y0,- 20),
y0 partisan = ifelse(type == "partisan", 1, 0))
# run a single experiment and reveal the outcome
<- dta |>
exp mutate(tx = complete_ra(N, prob = 0.5),
y = ifelse(tx == 1, y1, y0))
We can still estimate our original model to get the ATE:
\[ Y_i = \beta_0 + \beta_1 * Tx_i + \epsilon_i \]
However, this model does not tell us about the extent to which the treatment effect differs between partisans and leaners.
Here’s what the estimated treatment effect looks like:
library(ggforce)
ggplot(data=exp, aes(x = tx, y=y)) +
geom_point(aes(color=type),
position = position_jitternormal(sd_x = 0.05, sd_y = 0.05),
alpha=0.5) +
geom_smooth(method="lm",
color = "black",
se = FALSE) +
theme_minimal() +
scale_x_continuous(breaks=c(0,1),
label=c("Control", "Treatment")) +
labs(x = "",
y = "Polarization",
color = "Type")
This model is also not quite right:
\[ Y_i = \beta_0 + \beta_1 * Tx_i + \beta_2 * \text{Partisan}_i + \epsilon_i \]
Here’s the answer:
# estimate the model first:
<- lm(y ~ tx + type, data=exp)
m
# predict the outcomes and store them
<- exp |>
exp mutate(m_fit = predict(m))
# plotting
ggplot(data=exp, aes(x = tx, color = type)) +
geom_point(aes(y=y),
position = position_jitternormal(sd_x = 0.05, sd_y = 0.05),
alpha=0.5) +
geom_smooth(aes(y=m_fit),
method="lm",
se = FALSE) +
theme_minimal() +
scale_x_continuous(breaks=c(0,1),
label=c("Control", "Treatment")) +
labs(x = "",
y = "Polarization",
color = "Type")
Note that the slopes are still the same!
Instead, we will need an interaction model:
\[ Y_i = \beta_0 + \beta_1 * Tx_i + \beta_2 * \text{Partisan}_i + \beta_3 * Tx_i * \text{Partisan}_i + \epsilon_i \]
This is the correct model visualized:
ggplot(data=exp, aes(x = tx, y=y, color = type)) +
geom_point(position = position_jitternormal(sd_x = 0.05, sd_y = 0.05),
alpha=0.5) +
geom_smooth(method="lm",
se = FALSE) +
theme_minimal() +
scale_x_continuous(breaks=c(0,1),
label=c("Control", "Treatment")) +
labs(x = "",
y = "Polarization",
color = "Type")
Let’s estimate all three models and display them side-by-side:
<- lm(y ~ tx, data = exp)
m1 <- lm(y ~ tx + type, data=exp)
m2 <- lm(y ~ tx*type, data=exp)
m3
library(modelsummary)
<- list(m1,m2,m3)
models
modelsummary(models,
coef_map = c('(Intercept)' = 'Constant',
'tx' = 'Treatment',
'typepartisan' = 'Partisan',
'tx:typepartisan' = 'Treatment x Partisan'),
gof_map = c("nobs", "r.squared"))
(1) | (2) | (3) | |
---|---|---|---|
Constant | 37.819 | 20.464 | 25.348 |
(1.095) | (0.497) | (0.407) | |
Treatment | −9.681 | −10.619 | −20.657 |
(1.549) | (0.576) | (0.584) | |
Partisan | 35.177 | 25.279 | |
(0.577) | (0.580) | ||
Treatment x Partisan | 19.811 | ||
(0.821) | |||
Num.Obs. | 600 | 600 | 600 |
R2 | 0.061 | 0.870 | 0.934 |
Step-by-step guide to interpreting interaction models
Let’s write up the result from model 3:
\[ Y_i = 25 - 20 * Treatment_i + 25 * Partisan_i + 20 * Treatment_i * Partisan_i \]
We also have subgroups:
Subgroup | Treatment Value | Partisan Value |
---|---|---|
Untreated Leaners | 0 | 0 |
Treated Leaners | 1 | 0 |
Untreated Partisans | 0 | 1 |
Treated Partisans | 1 | 1 |
By plugging this numbers into the formula, we can calculate the average outcome in each of our four groups.
Substantive Meaning | Regression Coefficient | Value |
---|---|---|
Average outcome: Untreated Leaners | \(\beta_0\) | 25 |
Average outcome: Treated Leaners | \(\beta_0 + \beta_1\) | 25 - 20 = 5 |
Row2 - Row1: Treatment effect for Leaners only |
\((\beta_0 + \beta_1) - \beta_0\) = \(\beta_1\) |
(25 - 20) - 25 = -20 |
Notice how the interpretation of \(\beta_1\) is now conditional upon a particular subgroup.
We can do a similar exercise for Partisans.
Substantive Meaning | Regression Coefficient | Value |
---|---|---|
Average outcome: Untreated Partisams | \(\beta_0 + \beta_2\) | 25 + 25 = 50 |
Average outcome: Treated Partisans | \(\beta_0 + \beta_1 + \beta_2 + \beta_3\) | 25 - 20 + 25 +20 = 50 |
Row2 - Row1: Treatment effect for Partisans only |
\((\beta_0 + \beta_1 + \beta_2 + \beta_3) - (\beta_0 + \beta_2)\) = \((\beta_1 + \beta_3)\) |
(25 - 20 + 25 +20) - (25 + 25) = 0 |
Linear Combinations
In the exercise above, we had to add the coefficients manually (\(\beta_1 + \beta_3\)) to calculate the treatment effect of partisans.
But is this significantly different from 0?
How do you get the standard errors?
To get the standard errors, you will need to use linear combinations:
library(multcomp)
# showing the standard error
summary(glht(m3, linfct = c("tx + tx:typepartisan = 0")))
Simultaneous Tests for General Linear Hypotheses
Fit: lm(formula = y ~ tx * type, data = exp)
Linear Hypotheses:
Estimate Std. Error t value Pr(>|t|)
tx + tx:typepartisan == 0 -0.8460 0.5763 -1.468 0.143
(Adjusted p values reported -- single-step method)