Two-Way Fixed Effects and Difference-in-Differences
How would you vote if your vote weren’t secret?
So this actually happened in Spain1: in 2019, voting the European Parliamentary Elections in the municipality of Ciutadella de Menorca was conducted in public (they ran out of private voting booths).
Spain has a stigmatized right-wing political party, the Partido Popolar (PP).
Did the public voting procedure reduce PP support?
To answer this question, we can potentially make two comparisons:
Compare PP vote share in the municipality of Ciutadella de Menorca to PP vote share in other municipalities (also in the same region) where private voting booths were in use.
Compare PP vote share in the municipality of Ciutadella de Menorca in 2019 to PP vote share in the same voting stations in 2014 (when private voting stations were used).
Differences-in-Differences (DiD)
As it turns out, we should do both comparisons simultaenously.
Comparing Ciutadella to non-Ciutadella in 2019 is perhaps the most straightforward, but introduces a lot of potential confounds — for example, for reasons completely unrelated to the availability of private voting booths, maybe the PP just less stigmatized in Ciutadella vs other municipalities (even in the same region).
It’s not guaranteed that we can control for all possible confounders, which leads to bias.
Comparing Ciutadella in 2019 vs 2014 also makes sense. However, maybe the change in PP vote share is due to the absence of private voting booths, but maybe it’s also due to other inter-temporal factors – for example, maybe the PP had a corruption scandal in 2018 that lowered its popularity across the board?
However, suppose we compare non-Ciutadella municipalities between 2014 and 2019. This allows us to estimate the effect of the corruption scandal (since it presumably affected all regions).
We then subtract off this temporal change from the over-time difference in Ciutadella to isolate the effect of public voting.
That’s difference-in-differences (DiD)!
Data Generating Process
To solidify our understanding, let’s simulate some data:
Suppose the PP vote share in 2014 from voting stations located outside Ciutadella is given:
\[ Y_{rest, 2014} = \mathcal{N}(15,2) \] Further, suppose that the effect of the corruption scandal is to lower PP vote share by 3 percentage points on average:
\[ \Delta_{corrupt} \sim \mathcal{N}(-3, 0.5) \] We can therefore write the 2019 outcomes in the non-Ciutadella regions of Spain as:
\[ Y_{rest, 2019} = Y_{rest, 2014} - \Delta_{corrupt} \]
Now suppose that the PP vote share in 2014 from voting stations located inside Ciutadella is given:
\[ Y_{ciu, 2014} = \mathcal{N}(12,2) \]
Next, imagine the effect of public voting lowers PP vote share by 2 percentage points on average:
\[ \Delta_{public} \sim \mathcal{N}(-2, 0.2) \]
So we can write the outcome for Ciutadella in 2019 as:
\[ Y_{ciu, 2019} = Y_{ciu, 2014} - \Delta_{public} - \Delta_{corrupt} \]
Let’s see what this data look like:
library(tidyverse)
set.seed(1)
# parameters
<- 25
N_ciu <- 500
N_rest
# creating some "wide" data
<- rnorm(N_rest, -3, 0.5)
delta_c1 <- rnorm(N_rest, 15, 2)
y_rest_2014 <- y_rest_2014 + delta_c1
y_rest_2019
<- rnorm(N_ciu, -3, 0.5)
delta_c2 <- rnorm(N_ciu, -2, 0.2)
delta_p <- rnorm(N_ciu, 12, 2)
y_ciu_2014 <- y_ciu_2014 + delta_c2 + delta_p
y_ciu_2019
# reshaping to "long" data
<- tibble(y_rest_2014, y_rest_2019) |>
rest mutate(id = row_number(),
region = "rest") |>
pivot_longer(
cols = starts_with("y_rest_"),
names_prefix = "y_rest_",
names_to = "year",
values_to = "pp_vote"
)
<- tibble(y_ciu_2014, y_ciu_2019) |>
ciu mutate(id = row_number() + N_rest,
region = "ciutadella") |>
pivot_longer(
cols = starts_with("y_ciu_"),
names_prefix = "y_ciu_",
names_to = "year",
values_to = "pp_vote"
)
<- rbind(rest, ciu) |>
dta mutate(year=as.numeric(year))
Now let’s visualize the means by year and group:
Recall that we want to estimate the effect of public voting in Ciutadella in 2019. This is our observed treated outcome (\(Y_i(1)\)).
We also wanted to use the blue to simulate what would have happened in Ciutadella, had private voting booths been available. For that, we use what actually did happen in the rest of Spain.
In other words, the blue line helps us to estimate the counterfactual untreated potential outcome (\(Y_i(0)\)).:
The vertical distance between the red dot and the grey dot is our causal effect of interest.
Technically, it’s an average treatment effect on the treated (ATT).
Key assumption: Parallel Trends
Notice that our DiD effect depends on the slope of the grey line (which is, by construction, identical to the slope of the blue line).
This illustrates the key assumption underlying DiD: namely that time trends for the untreated units provide a good counterfactual for what would have happened in the treated units in the absence of treatment.
In other words, the observed untreated unit trend and the unobserved counterfactual trend are parallel. This is called the parallel trends assumption.
Here’s what a violation might look like:
Here, we see that the solid grey line is not parallel with the blue line.
As a result, our assumed counterfactual is “off” from the “real” counterfactual, and the DiD estimate is biased (in this case, it’s too negative).
DiD Estimation
We can estimate a simple 2x2 DiD using the following interaction model:
\[ PP\;vote_{it} = \beta_0 + \beta_1 * Ciutadella_i + \beta_2 * Year2019_t + \beta_3 * Ciutadella_i * Year2019_t + e_{it} \]
Importantly, \(\beta_3\) tells us how much bigger the drop in PP voteshare between 2014 and 2019 is in Ciutadella, compared to the drop in the rest of Spain. Difference-in-differences!
Let’s estimate this model. Note that we need to cluster standard errors within polling stations using estimatr::lm_robust
. That’s because PP voteshare is likely to be very correlated over time within the same polling station.2
library(broom)
library(estimatr)
# change the region and year variables to dummies
<- dta |>
dta mutate(ciutadella = ifelse(region == "ciutadella", 1, 0),
year = factor(year),
treated = ifelse(ciutadella==1 & year == 2019, 1, 0))
# estimate the model
<- lm_robust(pp_vote ~ ciutadella*year,
did_mod data=dta,
se_type="stata",
clusters = id)
kable(tidy(did_mod)[, 1:5], digits=2)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 14.91 | 0.09 | 157.43 | 0 |
ciutadella | -2.96 | 0.49 | -6.07 | 0 |
year2019 | -2.99 | 0.02 | -131.90 | 0 |
ciutadella:year2019 | -2.07 | 0.12 | -17.76 | 0 |
TWFE Estimation
Note that in the simple 2x2 case, the DiD model is actually equivalent to the following two-way fixed effects (TWFE) model:
\[ PP\;vote_{it} = \alpha_{id} + \alpha_{year} + \delta_{twfe} * Treated_{it} + e_{it} \]
\[ Treated_{it} = \begin{cases} 1 & \text{if a polling station is in Ciutadella AND the year is 2019} \\ 0 & \text{otherwise} \end{cases} \]
Note that the TWFE model includes a set of fixed effects for polling station (\(\alpha_{id}\)), as well as fixed effects for time period (\(\alpha_{year}\)).3
Thinking through what’s going on:
- \(\alpha_{id}\) removes the influence of baseline differences between units that stay constant over time. Think of this as controlling for all time-invariant differences across polling stations.
- \(\alpha_{year}\) removes the influence of temporal disturbances that affect all units equally within the same time period. Think of this as controlling for temporal shocks that affect all polling stations equally.
- Finally, PTA implies that there are no time-varying confounders missing from the model
Thus, \(\delta\) gives us our estimate of the causal effect of the policy change on the treated (it’s an ATT).
Here’s another way of looking at it:
\[ \begin{cases} Y_{it}(0) & = \alpha_i + \alpha_t + \epsilon \\ Y_{it}(1) & = Y_{it}(0) + \delta \end{cases} \]
We can estimate the model using fixest::feols()
:
library(fixest)
<- feols(pp_vote ~ treated | id + year, data=dta)
twfe_mod
kable(tidy(twfe_mod), digits=2)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
treated | -2.07 | 0.12 | -17.77 | 0 |
Note that feols()
automatically clusters your standard errors within polling stations.
Here, the TWFE model gives the same answer as the DiD model, but we will see below that it’s also more flexible.
Between vs. Within Effects
How do fixed effects work again?
Suppose we are interested in the relationship between hours of exercise per week and the number of colds a person gets per year.
We collect data the following data on two people:
Note that we can actually “decompose” this data into two sources of variation: between and within:
person | year | exercise | colds | mean_colds | mean_exercise | within_colds | within_exercise |
---|---|---|---|---|---|---|---|
Andy | 2022 | 1 | 4 | 2.33 | 2.33 | 1.67 | -1.33 |
Andy | 2023 | 4 | 1 | 2.33 | 2.33 | -1.33 | 1.67 |
Andy | 2024 | 2 | 2 | 2.33 | 2.33 | -0.33 | -0.33 |
Bob | 2022 | 5 | 7 | 5.67 | 6.33 | 1.33 | -1.33 |
Bob | 2023 | 8 | 4 | 5.67 | 6.33 | -1.67 | 1.67 |
Bob | 2024 | 6 | 6 | 5.67 | 6.33 | 0.33 | -0.33 |
Here’s how that looks graphically:
We can see that even though the cloud is upward sloping, that’s mostly due to the between-variation.
Bob exercises more than Andy, and Bob also seems to get sick more easily.
But what’s the relationship between exercise and colds, once we strip out the average between-person differences?
So here’s what the within-variation looks like:
Here, you see the sign even flipped!
Key idea: Fixed effects achieves the exact same result as de-meaning.
Everything that stays constant within units (but which can vary a lot across units) – so stuff like “mean_exercise” – is “absorbed” by the fixed effects.4
So the only variation which is left to exploit is the within-variation.
Now Back to Voting: we are only looking at within-variation in PP voteshares within polling stations.
Everything which is constant within a polling station is captured by the unit fixed effects.
This also goes for variables that we didn’t / couldn’t measure (such as how much the PP is stigmatized by neighborhood residents).
And everything which is constant within time periods (like our hypothetical corruption scandal) is captured by the time fixed effects.
And that’s the power of fixed effects…we can control for stuff that we don’t even have to measure!
Checking Prior Trends
Importantly, parallel trends is an assumption. We can never know if it holds or not (since we cannot observe counterfactual untreated outcomes).
However, provided we have enough pre-treatment data, we can examine prior trends.
If prior trends appear to be parallel, then this gives us assurance that what is happening in our untreated units provides a good counterfactual for what would have happened in our treated units.
To see how to analyze prior trends, let’s make up some additional data for years 2009 and 2004.
For each of these years, we will draw outcomes randomly from the following distributions:
\[ Y_{rest, year} = \mathcal{N}(15,2) \]
\[ Y_{ciu, year} = \mathcal{N}(12,2) \]
Let’s make up that data now:
set.seed(2)
# parameters
<- 25
N_ciu <- 500
N_rest
# creating some more "wide" data
<- rnorm(N_rest, -3, 0.5)
delta_c1 <- rnorm(N_rest, 15, 2)
y_rest_2004 <- rnorm(N_rest, 15, 2)
y_rest_2009 <- rnorm(N_rest, 15, 2)
y_rest_2014 <- y_rest_2014 + delta_c1
y_rest_2019 <- rnorm(N_rest, 15, 2) # we'll use this later
y_rest_2024
<- rnorm(N_ciu, -3, 0.5)
delta_c2 <- rnorm(N_ciu, -2, 0.2)
delta_p <- rnorm(N_ciu, 12, 2)
y_ciu_2004 <- rnorm(N_ciu, 12, 2)
y_ciu_2009 <- rnorm(N_ciu, 12, 2)
y_ciu_2014 <- y_ciu_2014 + delta_c2 + delta_p
y_ciu_2019 <- rnorm(N_ciu, 12, 2) # we'll use this later
y_ciu_2024
# reshaping to "long" data
<- tibble(y_rest_2004, y_rest_2009, y_rest_2014, y_rest_2019, y_rest_2024) |>
rest mutate(id = row_number(),
region = "rest") |>
pivot_longer(
cols = starts_with("y_rest_"),
names_prefix = "y_rest_",
names_to = "year",
values_to = "pp_vote"
)
<- tibble(y_ciu_2004, y_ciu_2009, y_ciu_2014, y_ciu_2019, y_ciu_2024) |>
ciu mutate(id = row_number() + N_rest,
region = "ciutadella") |>
pivot_longer(
cols = starts_with("y_ciu_"),
names_prefix = "y_ciu_",
names_to = "year",
values_to = "pp_vote"
)
<- rbind(rest, ciu) |>
dta mutate(year=as.numeric(year),
ciutadella = ifelse(region == "ciutadella", 1, 0),
treated = ifelse(ciutadella==1 & year == 2019, 1, 0))
Here’s what the data look like:
The first check on prior trends we can do is simply run the following regression on the pre-treatment data:
\[ PP \; vote_{it} = \beta_0 + \beta_1 * Ciutadella_i + \beta_2 * year_t + \beta_3 * Ciutadella_i * year_t + e_{it} \]
This looks like the same regression as before, but now we are treating \(year\) as a continuous variable.
So \(\beta_2\) tells us the slope of the time trend in the rest of Spain from 2004-2014, and \(\beta_3\) tells us whether the slope in Ciutadella over the same period is significantly different.
Let’s run the regression:
<- lm_robust(pp_vote ~ ciutadella*year,
m1 data=filter(dta, year < 2019),
se_type="stata",
clusters = id)
kable(tidy(m1)[, 1:5], digits=2)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 34.97 | 25.19 | 1.39 | 0.17 |
ciutadella | 104.03 | 127.95 | 0.81 | 0.42 |
year | -0.01 | 0.01 | -0.79 | 0.43 |
ciutadella:year | -0.05 | 0.06 | -0.84 | 0.40 |
Placebo Tests
We could also do a placebo test: are the size of the 2004 and 2009 gaps between the red and blue lines significantly different than the size of the 2014 gap?
If the gap sizes are changing, this would also be evidence that the prior trends are not parallel.
To conduct this test, we basically run the same model as above, only now we treat \(year\) as a factor variable. Conventionally, we also use the last period prior to treatment (so: 2014) as the baseline.
\[ PP \; vote_{it} = \beta_0 + \beta_1 * Ciutadella_i + \beta_2 * Y2004_t + \beta_3 * Y2009_t + \] \[ \beta_4 * Ciutadella_i * Y2004_t + \beta_5 * Ciutadella_i * Y2004_t + e_{it} \]
# year as a factor variable
# reference category is 2014
<- dta |>
dta mutate(year = factor(year),
year = relevel(year, 3))
# model
<- lm_robust(pp_vote ~ ciutadella*year,
m2 data=filter(dta, year %in% c(2004, 2009, 2014)),
se_type="stata",
clusters = id)
kable(tidy(m2)[, 1:5], digits=2)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 15.03 | 0.09 | 164.25 | 0.00 |
ciutadella | -3.59 | 0.37 | -9.77 | 0.00 |
year2004 | 0.10 | 0.13 | 0.79 | 0.43 |
year2009 | 0.03 | 0.13 | 0.26 | 0.80 |
ciutadella:year2004 | 0.53 | 0.64 | 0.84 | 0.40 |
ciutadella:year2009 | 0.57 | 0.51 | 1.12 | 0.26 |
Both sets of analyses indicate that prior trends are indeed parallel.
Dynamic TWFEs
Often, we also have data on multiple post-treatment periods. Thus, we can examine how treatment effects evolve – e.g. whether their are “carry-over” and / or “cumulative” effects.
To do so, we just estimate the same model as above, except now we use all of the (pre-treatment and post-treatment) data.
# model
<- lm_robust(pp_vote ~ ciutadella*year,
m3 data=dta,
se_type="stata",
clusters = id)
kable(tidy(m3)[, 1:5], digits=2)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 15.03 | 0.09 | 164.23 | 0.00 |
ciutadella | -3.59 | 0.37 | -9.77 | 0.00 |
year2004 | 0.10 | 0.13 | 0.79 | 0.43 |
year2009 | 0.03 | 0.13 | 0.26 | 0.80 |
year2019 | -2.97 | 0.02 | -128.41 | 0.00 |
year2024 | 0.14 | 0.13 | 1.06 | 0.29 |
ciutadella:year2004 | 0.53 | 0.64 | 0.84 | 0.40 |
ciutadella:year2009 | 0.57 | 0.51 | 1.12 | 0.26 |
ciutadella:year2019 | -1.77 | 0.10 | -17.72 | 0.00 |
ciutadella:year2024 | 0.66 | 0.53 | 1.24 | 0.21 |
And here’s the same model using fixest:feols()
:
# model
<- feols(pp_vote ~ ciutadella*year | id + year,
m4 data=dta)
kable(tidy(m4), digits=2)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
ciutadella:year2004 | 0.53 | 0.64 | 0.84 | 0.40 |
ciutadella:year2009 | 0.57 | 0.51 | 1.12 | 0.26 |
ciutadella:year2019 | -1.77 | 0.10 | -17.72 | 0.00 |
ciutadella:year2024 | 0.66 | 0.53 | 1.24 | 0.21 |
Of course, it’s better if we plot the coefficients, rather than reporting this ugly table.
# plot data
<- tidy(m4) |>
toplot mutate(election = c(2004,2009,2019,2024)) |>
select(election, estimate, std.error) |>
mutate(cil = estimate - 1.96*std.error,
cih = estimate + 1.96*std.error)
# creating the reference category
<- c(2014, 0, NA, NA, NA)
ref_data
<- toplot |>
toplot rbind(ref_data)
# plotting
ggplot(toplot, aes(x = election, y=estimate)) +
geom_line() +
geom_point() +
geom_errorbar(aes(ymin = cil, ymax = cih),
width = 0) +
geom_hline(yintercept = 0, color="red") +
geom_vline(xintercept = 2017, linetype="dashed") +
labs(x = "Election",
y = "Difference in PP Vote Share: \n Exposed vs Unexposed Islands") +
theme_minimal(base_size = 14) +
scale_x_continuous(breaks=c(2004,2009,2014,2019,2024))
Notice that this graph combines a (pre-treatment) placebo test with a test of post-treatment differences.
We can also see what might be called “long term” effects (or lack thereof).
While not important here, long-term effects may manifest in cases where the effects of a treatment only kick-in after some time.
More “advanced” cases
Remember when we said that TWFE models were more general?
That comes in once we move beyond the 2x2 case.
Here’s two examples of what that might look like:
The first case is called “staggered adoption.” Here, different units get treated at different times. There are also “never-treated” and “always-treated” units.
Importantly, once a unit gets treated, it stays treated.
That’s on contrast to the second case, where units can also “flip” in and out of treatment:
Both cases can be easily estimated with TWFE models. Just include a “currently in treatment” indicator that takes the value of 1 for each of the dark blue squares.
However, recent research has revealed that the TWFE estimator can be biased in some cases.
The basic problem is that the TWFE is a combination of several different DiDs:
- switch into treatment vs. never treated
- early adopter vs. late adopter
- switch into treatment vs. always treated
- late adopter vs. early adopter
The first two comparisons are fine, but the last two are “forbidden”.
Moreover, in the case of treatment reversal, we might think that the effect of going “into” treatment is not simply the mirror image of the effect of a “treatment reversal”. But that’s not captured in the standard TWFE estimator.
The basic solution is to break the panel up into “small” DiDs, estimate each “small” DiD separately, and then put these estimates back together, dropping the “forbidden” comparisons.
These techniques are beyond the scope of this course. But I just want to flag the general issue for you here.
A final word
DiD / TWFEs are powerful tools, but by themselves, they do not guarantee that the effects we estimate are causal.
Claims to causality must rest on the assumption of parallel trends.
How plausible is this assumption?
Think about the data generating process: does the allocation to treatment to some units and not others resemble an experiment?
If so, you are probably good!
Footnotes
This example is inspired by Vicente Valentim’s paper “Political Stigma and Preference Falsification”, the Journal of Politics (2024)↩︎
In general, you should always cluster your standard errors within units when estimating TWFE models.↩︎
In other words, we are including a separate dummy variable (leaving one out) for each
id
, and a separate dummy variable (leaving one out) for eachyear
.↩︎Technically, these things are collinear with the unit dummies.↩︎