Instrumental Variables
A Canvassing Experiment
In the US, political campaigns frequently send volunteers door-to-door to encourage people to vote. This is called “canvassing”.
It the US, you can also find out whether someone voted (but not who they voted for).
Suppose we ran an experiment to determine the effectiveness of door-to-door canvassing.
We randomly select 2000 households, and then randomly allocate 1000 of them (the treatment group) to receive a visit from a volunteer. The remaining 1000 households are our control group.
As it turns out, however, the majority of people in the treatment group were not home when the canvasser came knocking.
Suppose here’s how the experiment actually turned out:
The Intent-to-Treat Effect (ITT)
The simplest strategy would be to completely ignore treatment non-compliance and simply compare those who were assigned to treatment vs. those assigned to control.
Let’s use the variable \(Z\) to denote treatment assignment:
\[ Z_i = \begin{cases} 1 & \text{if assigned to treatment} \\ 0 & \text{if assigned to control} \\ \end{cases} \]
Another name for \(Z\) is the instrument.
Since treatment assignment is randomly determined by the researcher, the instrument Z is uncorrelated with potential outcomes, as well as all possible confounders.
Thus, we can just compare the mean outcomes in these two groups determined by random treatment assignment:
\[ \mathbb{E}[Y_i | Z_i = 1] - \mathbb{E}[Y_i | Z_i = 0] = ITT \] The difference constitutes the Intent-to-Treat effect (ITT).
Sometimes, the ITT is what we are after. After all, “real life” canvassing programs must deal with the problem that some people won’t answer the door.
This “real world impact” of such programs is measured by the ITT.
On the other hand, the ITT doesn’t tell us about the causal effect of talking to a canvasser on turnout. But theoretically, maybe that’s what we want to learn about.
So what can we do?
Compliance Types
As a first step to answering this question, it’s helpful to split our sample into two “types” of people:
compliers do what they should: they take the treatment if assigned to the treatment group, and they go untreated if assigned to the control group
never-takers do not take the treatment, no matter what group they are assigned to
More formally, let’s define \(D_i\) as an individuals treatment status (that is, whether or not they were actually treated).
We can think of \(D_i\) in terms of potential outcomes, just like \(Y_i\).
So \(D_i(Z_i = 0)\) denotes the potential treatment status of someone who is assigned to control, and \(D_i(Z_i = 1)\) denotes the potential treatment status of someone who is assigned to treatment.
Thus a complier is someone for whom \(D_i(Z_i = 1) = 1\) and \(D_i(Z_i = 0) = 0\).
By contrast, a never-taker is someone for whom \(D_i(Z_i = 1) = 0\) and \(D_i(Z_i = 0) = 0\).
Now let’s go back to our graph:
Complier Average Causal Effect (CACE)
While we cannot estimate the ATE for the entire population, we can estimate the treatment effect for the subgroup of compliers.
This is called the Complier Average Causal Effect (CACE), or sometimes the Local Average Treatment Effect (LATE).
Let’s just go with CACE.
We can calculate the CACE as:
\[ CACE = \frac{ITT}{\pi_c} \]
Why is that true?
Here’s a graphical illustration:
On the left side, we see the potential outcomes if everyone were assigned to control (Z = 0).
And on the right side, we see the potential outcomes if everyone were assigned to treatment (Z=1).
We can calculate \(\mathbb{E}(Y_i(Z_i =0))\) as the weighted average of the proportion of compliers times the average potential control outcome for compliers, plus the proportion of never-takers times the average potential outcome for never-takers.
In other words:
\[ \mathbb{E}(Y_i(Z_i=0)) = \pi_c \times 0.5 + \pi_{NT} \times 0.4 \]
Geometrically, \(\mathbb{E}(Y_i(Z_i=0))\) is represented by the area of the left side of the graph.
Similarly, we can represent \(\mathbb{E}(Y_i(Z_i=1))\) as area of the right side of the graph:
\[ \mathbb{E}(Y_i(Z_i=1)) = \pi_c \times 0.6 + \pi_{NT} \times 0.4 \]
Notice that the average outcomes for never-takers doesn’t change as we go for the left to the right side, since never-takers don’t “take” the treatment!
By contrast, compliers do respond to treatment, and so their potential outcomes change from 0.5 to 0.6.
The difference in the two areas – represented by the blue box – represents the ITT.
\[ ITT = \mathbb{E}(Y_i(Z_i=1)) - \mathbb{E}(Y_i(Z_i=0)) = (0.6 - 0.5) \times \pi_c \]
The CACE is represented by the height of the blue box.
So we know (or can estimate) the area of the box (ITT), and we know (or can estimate) the width of the box (\(\pi_c\)).
Since \(width \times height = area\), the \(ITT = CACE \times \pi_c\).
Re-arranging, we get the formula for \(CACE = \frac{ITT}{\pi_c}\).
One-sided vs. Two-sided Non-compliance
Let’s do a slightly more complicated example:
Suppose we are interested in the causal effect of military service on an individuals’ political attitudes (measured on a liberal-conservative dimension).
We might think that military services makes people more conservative.
On the other hand, maybe people who are more conservative in the first place volunteer to serve in the military.
Thus, the comparison between military veterans and non-veterans is likely to be biased by self-selection into treatment.
To get around this bias, scholars have exploited draft lotteries which randomize an individual’s chances of military service:
The logic is that people who have “bad” draft numbers are, on average, the same as people with “good” draft numbers.
However, just because you are drafted doesn’t mean that you will 100% go into the military. You might, for example, obtain an exemption based on physical health, education, family status, etc. You might also just “dodge” the draft.
And on the flip side, even people who are not drafted can still volunteer.
Here’s what that situation looks like:
Note that we now have more compliance “types” to deal with:
- compliers join the military if drafted, but stay out if not drafted
- always-takers join the military, regardless of their draft status
- never-takers stay out of the military, regardless of their draft status
- defiers volunteer for the military if not drafted, but stay out if drafted
With four types, we can no longer “back out” the proportion of compliers without an additional assumption: namely, that there exist NO DEFIERS. This is sometimes called the monotonicity assumption (I will explain why below).
For now, once we rule out the existence of defiers, we can figure out the proportions of always-takers (\(\pi_{AT}\)) and never-takers (\(\pi_{NT}\)), and thus “back out” the proportion of compliers (\(\pi_{C}\)).
From there, everything else is the same.
We know that the ITT is only influenced by the response of compliers:
And thus we estimate the CACE in the same way:
\[ CACE = \frac{ITT}{\pi_C} \]
CACE vs. ATE
At this point, it’s worth stressing something:
Let’s look at a hypothetical example:
Name | Type | Effect of being drafted on military service | Effect of military service on conservatism |
---|---|---|---|
Axel | Complier | 1 | 0.5 |
Barbara | Always Taker | 0 | 0.1 |
Chris | Never Taker | 0 | 0.3 |
Note that the last column shows the treatment effect that would theoretically obtain if it were possible to switch the treatment status for each individual (e.g. if we could make somehow make Barbara stay out of the military, even though she is an “always taker”).
From this, it’s clear that the ATE = 0.3. However, the CACE = 0.5.
Of course, in the real world, we don’t get to observe the last column. We can estimate the CACE, but we don’t anything about the average treatment effects for “never-takers” and “always-takers”.
As a result, we cannot make inferences about the ATE, and we cannot generalize from the CACE to the ATE.
Profiling Compliers
Given the above, it’s maybe useful to figure out how the demographic profile of compliers (e.g. in terms of sex, age, etc.) differs from the “never-takers”, “always-takers”, and the sample as a whole.
As it turns out, this is pretty straightforward.
Suppose we care about a characteristic like age. The average age in our sample (\(\mu\)) is simply the mean of the average age amongst compliers, “never-takers” and “always-takers”, weighted by the proportion of each group:
\[ \mu_{sample} = (\mu_{c} * \pi_c) + (\mu_{at} * \pi_{at}) + (\mu_{nt} * \pi_{nt}) \]
The only “unknown” in this equation is \(\mu_{c}\). We can estimate everything else.
Consequently, we can just solve for:
\[ \mu_{c} = \frac{1}{\pi_c}\mu_{sample} - \frac{\pi_{nt}}{\pi_c}\mu_{nt} - \frac{\pi_{at}}{\pi_c}\mu_{at} \]
You can implement the procedure easily using the ivdesc
package from Marbach and Hangartner.
Simulate Some Data
To see this whole machinery in action, let’s simulate some data.
We’ll make a binary instrument (\(drafted_i\)) which denotes whether or not a person was drafted.
We can also make a binary treatment (\(veteran_i\)) which denotes whether a person actually served in the military.
We’ll have compliers, always-takers, and never-takers in the following proportions:
- \(\pi_c = 0.7\)
- \(\pi_{nt} = 0.2\)
- \(\pi_{at} = 0.1\)
Potential untreated outcomes (i.e. potential outcomes if the person doesn’t join the military) are drawn from the following normal distributions:
\[ Y_i(veteran_i=0) = \begin{cases} \mathcal{N}(5,1) & \text{if complier} \\ \mathcal{N}(3,1) & \text{if never-taker} \\ \mathcal{N}(7,1) & \text{if always-taker} \\ \end{cases} \] Finally, potential treated outcomes (i.e. the potential outcomes if the person joins the military) are:
\[ Y_i(veteran_i=1) = \begin{cases} Y_i(veteran_i=0) + 2 & \text{if complier} \\ Y_i(veteran_i=0) + 1 & \text{if never-taker} \\ Y_i(veteran_i=0) & \text{if always-taker} \\ \end{cases} \] Let’s simulate the data:
library(tidyverse)
set.seed(1)
# parameters
<- 0.7
pi_c <- 0.2
pi_nt <- 0.1
pi_at <- 500
N
# making the dataset
<- c(rep("c", pi_c*N), rep("nt", pi_nt*N), rep("at", pi_at*N))
type <- tibble(type)
dta
# adding treatment status as a function of type and draft status
# veteran_draft0 is the treatment status when undrafted
# veteran1 is the treatment status when drafted
<- dta |>
dta mutate(
veteran_draft0 = case_when(
== "c" ~ 0,
type == "nt" ~ 0,
type == "at" ~ 1),
type veteran_draft1 = case_when(
== "c" ~ 1,
type == "nt" ~ 0,
type == "at" ~ 1)
type
)
# adding potential outcomes as a function of type and veteran status
<- dta |>
dta mutate(
y_veteran0 = case_when(
== "c" ~ rnorm(N, mean=5, sd=1),
type == "nt" ~ rnorm(N, mean=3, sd=1),
type == "at" ~ rnorm(N, mean=7, sd=1)),
type y_veteran1 = case_when(
== "c" ~ y_veteran0 + 2,
type == "nt" ~ y_veteran0 + 1,
type == "at" ~ y_veteran0)
type )
So in this dataset, we know the “true” CACE is 2, and the “true” proportion of compliers is 0.7.
So the “true” ITT = 1.4
We can also create the potential outcomes as a function of draft status:
<- dta |>
dta mutate(
y_draft0 = case_when(
== "c" ~ y_veteran0,
type == "nt" ~ y_veteran0,
type == "at" ~ y_veteran1),
type y_draft1 = case_when(
== "c" ~ y_veteran1,
type == "nt" ~ y_veteran0,
type == "at" ~ y_veteran1)
type
)
# confirm that this equals the true ITT
mean(dta$y_draft1) - mean(dta$y_draft0)
[1] 1.4
Yay, our math works!
Estimation
Now let’s run a single experiment and estimate our quantities of interest.
For simplicity, we can set the risk of being drafted at 50%.
library(randomizr)
set.seed(1)
# assigning the instrument (draft status)
<- dta |>
exp mutate(draft = complete_ra(N, prob=0.5))
# revealing treatment status and potential outcomes
<- exp |>
exp mutate(veteran = ifelse(draft == 1, veteran_draft1, veteran_draft0),
y = ifelse(veteran == 1, y_veteran1, y_veteran0)) |>
select(y, draft, veteran)
OK now let’s do a couple of things.
First, let’s calculate the “naive” OLS estimate and store the result:
library(broom)
# ols
tidy(lm(y~veteran, data=exp), conf.int=TRUE)
# A tibble: 2 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 4.29 0.0730 58.7 4.97e-226 4.14 4.43
2 veteran 2.76 0.107 25.9 3.35e- 94 2.55 2.97
<- summary(lm(y~veteran, data=exp))$coefficients[2] ols_est
We see that, on average, veterans are 2.76 points further to the right than non-veterans. That’s much higher than the “truth” of 2, and the confidence interval does not contain the truth.
Maybe we just got unlucky?
Let’s simulate the data a bunch of times, and take the average of the “naive” estimates:
<- 1000
n_sims
<- c()
v_ols
for (i in 1:n_sims) {
# assigning the instrument (draft status)
<- dta |>
exp mutate(draft = complete_ra(N, prob=0.5))
# revealing treatment status and potential outcomes
<- exp |>
exp mutate(veteran = ifelse(draft == 1, veteran_draft1, veteran_draft0),
y = ifelse(veteran == 1, y_veteran1, y_veteran0)) |>
select(y, draft, veteran)
# storing the estimates
<- summary(lm(y~veteran, data=exp))$coefficients[2]
v_ols[i]
}
# mean of the OLS estimates
mean(v_ols)
[1] 2.715546
So the mean across our simulations is 2.72. It looks like we are consistently getting the wrong answer.
Calculating the CACE “by hand”
Now let’s see if we can get the correct answer applying the IV formula.
We will again create a single experiment (in fact, the same one we used before):
set.seed(1)
# assigning the instrument (draft status)
<- dta |>
exp mutate(draft = complete_ra(N, prob=0.5))
# revealing treatment status and potential outcomes
<- exp |>
exp mutate(veteran = ifelse(draft == 1, veteran_draft1, veteran_draft0),
y = ifelse(veteran == 1, y_veteran1, y_veteran0)) |>
select(y, draft, veteran)
This time, let’s begin by estimating the ITT:
# ITT: reg y on draft status
tidy(lm(y ~ draft, data=exp), conf.int = TRUE)
# A tibble: 2 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 4.78 0.104 46.1 5.76e-182 4.58 4.99
2 draft 1.59 0.147 10.9 8.80e- 25 1.30 1.88
<- summary(lm(y~draft, data=exp))$coefficients[2] itt_est
In our case, the ITT is 1.59, and the confidence interval contains the “truth” (1.4).
What about compliance rate?
It’s actually easy to estimate: just regress \(veteran_i\) on \(draft_i\):
# compliance rate: reg veteran on draft status
tidy(lm(veteran ~ draft, data=exp), conf.int = TRUE)
# A tibble: 2 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.0960 0.0211 4.56 6.59e- 6 0.0546 0.137
2 draft 0.744 0.0298 25.0 8.39e-90 0.685 0.803
<- summary(lm(veteran~draft, data=exp))$coefficients[2] pi_est
We get a compliance rate of 0.74, and again, the confidence interval contains the “truth” (0.7).
If we put these two estimates together, we get 2.14. That’s pretty close to the CACE of 2.
In fact, if we simulate it lots of times, the average across those simulations is almost exactly 2:
<- c()
v_cace
for (i in 1:n_sims) {
<- dta |>
exp mutate(draft = complete_ra(N, prob=0.5),
veteran = ifelse(draft == 1, veteran_draft1, veteran_draft0),
y = ifelse(veteran == 1, y_veteran1, y_veteran0))
# estimate and store the itt
<- summary(lm(y ~ draft, data=exp))$coefficients[2]
temp_itt
# estimate and store the compliance rate
<- summary(lm(veteran ~ draft, data=exp))$coefficients[2]
temp_pi
# store the cace
<- temp_itt / temp_pi
v_cace[i]
}
# average across our experiments
mean(v_cace)
[1] 1.994525
Estimation using ivreg()
There’s just one problem with the approach we just took. In real life, we only have one experiment.
We can seprately estimate the ITT and the compliance rate, and put them together to get the CACE, but how do we get a standard error for the CACE estimate?
Actually, there are lots of different ways to do this. We’ll use the ivreg()
function from the AER
package.
library(AER)
# again, creating our single experiment
set.seed(1)
<- dta |>
exp mutate(draft = complete_ra(N, prob=0.5),
veteran = ifelse(draft == 1, veteran_draft1, veteran_draft0),
y = ifelse(veteran == 1, y_veteran1, y_veteran0)) |>
select(y, draft, veteran)
# ivreg
tidy(ivreg(y ~ veteran | draft, data=exp), conf.int=TRUE)
# A tibble: 2 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 4.58 0.0884 51.8 1.40e-202 4.40 4.75
2 veteran 2.14 0.148 14.5 7.96e- 40 1.85 2.43
OK so now we get an estimate of the CACE, as well as a standard error.
Two Stage Least Squares (2SLS) Explained
What did ivreg()
do actually?
It’s implementing a method called “two stage least squares” (2sls) estimation. The name comes from the fact that we are estimating two regression models:
\[ Veteran_i = \gamma_0 + \gamma_1 \; Draft_i + \omega_i \] \[ Y_i = \gamma_0 + \gamma_1 \; \widehat{Veteran_i} + \epsilon_i \]
The first-stage model is just the model for the compliance rate.
After estimating it, we then predict veteran status, and use our predictions as regressors in the second-stage model.
The basic logic is captured in the following DAG:
In this setup, part of the treatment (veteran status) is endogenously driven by self-selection, and part of it is exogenously (i.e. randomly) driven by draft status.
The 2sls procedure uses only the exogenously determined variation in veteran status to explain variation in the outcome.
To get a feel for this, let’s do it “by hand”:
# first stage regression
<- lm(veteran ~ draft, data = exp)
firststage tidy(firststage)
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.0960 0.0211 4.56 6.59e- 6
2 draft 0.744 0.0298 25.0 8.39e-90
# grabbing the predicted values
<- exp |>
exp mutate(vet_hat = predict(firststage))
# secondstage
<- lm(y ~ vet_hat, data = exp)
secondstage tidy(secondstage)
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 4.58 0.118 38.9 6.71e-153
2 vet_hat 2.14 0.197 10.9 8.80e- 25
Notice that the coefficients are the same!
The standard errors, though, are slightly different depending on whether you use ivreg()
or do it in two steps by hand.
And that’s because we didn’t actually measure \(vet\_hat\)…it’s an estimate, not data.
But the lm()
command in second stage regression doesn’t know that.
So we have to adjust the standard errors accordingly, which ivreg
and other packages automatically do for you.
Continuous Instruments and Treatments
So far, we have worked with binary instruments and binary treatments.
And that’s because it keeps the explanations simple.
But there’s no reason why the same logic cannot apply to continuous variables.
Think about instruments as random amounts of encouragement to take the treatment.
And think about the treatment as existing in different “dosages”.
So, in a continous variable context, receiving more encouragement (higher values of \(Z\)) causes an increase in dosage (higher values of \(D\)) amongst compliers, but no increase amongst non-compliers.
Importantly, more encouragement can never cause a decrease in dosage. This shows why the “no defiers” assumption is also called “monotonicity”.
Notice that there are now different “degrees” of compliance. Some people may be strong compliers, and others may be weak compliers.
Conceptually, the CACE is now a weighted average of individual treatments effects, weighted by how responsive that individual observation is to the instrument.
Is the Instrumental (As-If) Randomly Assigned?
You may encounter IV in two different contexts:
- randomized experiments with non-compliance
- natural experiments, where an “exogenous” source of variation is used to instrument an “endogenous” treatment
An famous example of the latter is provided by the work of AJR (who won the Nobel prize in 2024):
These authors are interested in the effects of institutions (e.g. protection of property rights, limited government intervention in free markets) on economic growth.
The problem is, richer places may be able to “afford” better institutions, so there’s endogeneity between X and Y.
To instrument for institutions, they look at a sample of ex-colonies, and use the colonial disease environment. Here’s the argument in brief.
Different colonization policies: “extractive states” (Belgian Congo) vs. “Neo-Europes” (colonial New England) \(\Rightarrow\) variation in institutions.
The colonization strategy was dependent on the feasibility of European settlement, as measured by settler mortality
Early institutions persisted even after colonial independence, setting the stage for “modern” economic growth
The question is, since settler mortality is not randomly assigned, are there any “backdoors” which can bias the IV estimates?
Here’s the DAG:
Of course, it’s possible to control for such backdoors (you just have to stick these covariates in both the first stage and second stage regressions).
The assumption is that, conditional upon these controls, the instrument is randomly assigned.
But this is where IV starts to break down. How do we know that we have controlled for all possible sources of omitted variable bias?
We are on much safer ground when the instrument is truly random (e.g. in the military draft lottery case).
Exclusion Restriction
A second threat to validity comes from violations of something called the exclusion restriction.
The main idea is that the instrument can affect the outcome only through the treatment, and no other channel.
In our example, suppose being drafted also caused people a great deal of stress, and stress itself is related to political attitudes:
In this case, we would be (mis)attributing the part of the effect that runs through the stress pathway to military service.
If you read IV papers, you will find that the authors pay a lot of attention to defending the assumption that the exclusion restriction holds.
But in the end, it’s still an assumption.
Weak Instruments
With IV, we need to also worry about whether our instrument is “strong.” One intuitive way to think about this is to ask: “do we have enough compliers”?
Think about the CACE formula:
\[ CACE = \frac{ITT}{\pi_c} \]
At the extreme, if we have no compliers, the CACE does not exist!
But even if we just had a few compliers, our estimate for the CACE is going to be very noisy (large standard errors).
What’s worse, suppose we have a natural experiment, and our ITT is just slightly biased (maybe due to a small violation of the independence assumption). In this case, since we are dividing our biased ITT by a very small \(\pi_c\), the amount of bias is going to blow up!
Finally, and this is a bit subtle, we care about the actual number of compliers, not just the compliance rate. Even if the instrument is randomly assigned in the population, in a finite sample the relationship between the instrument and the non-instrumented parts of Y is going to be at least a little nonzero, just by random chance.
The smaller the sample is, the more often this “nonzero by random chance” is going to be fairly large, driving the instrument to not be quite valid in a given sample and giving you a biased estimate.
The solution of course is just to not only have a higher compliance rate, but also a larger sample, so that you literally have more compliers.
You can test for weak instruments by looking at the first-stage regression:
\[ Veteran_i = \gamma_0 + \gamma_1 \; Draft_i + \omega_i \]
summary(lm(veteran ~ draft, data= exp))
Call:
lm(formula = veteran ~ draft, data = exp)
Residuals:
Min 1Q Median 3Q Max
-0.840 -0.096 -0.096 0.160 0.904
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.09600 0.02107 4.555 6.59e-06 ***
draft 0.74400 0.02980 24.963 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3332 on 498 degrees of freedom
Multiple R-squared: 0.5558, Adjusted R-squared: 0.5549
F-statistic: 623.1 on 1 and 498 DF, p-value: < 2.2e-16
And look at the F-statistic. If this number is larger than 10, then you are probably OK.
Note that the F-stat increases both when (i) the compliance rate is higher and (ii) the sample size increases.