Regression Discontinuity

How much is it worth to hold political office?

Elected officials who leave government often land in “cushy” jobs – as lobbyists, consultants, or on corporate boards.

This begs the question: what is the “value added” of being in government on an individual’s lifetime earnings?1

Suppose we had data on how much wealth political candidates bequeath to their heirs when they die.

This seems like a pretty good measure of lifetime earnings.

What if we just compared the wealth (at death) of successful vs. unsuccessful political candidates.

Would this constitute an unbiased estimate of the causal effect of holding political office?


Comparing close winners and losers

While “winners” and “losers” may differ in lots of different ways (besides holding political office), the comparison is likely to be much better if we focus in on the group of people who barely won vs barely lost. In fact, it we could somehow find two people who somehow won 50%+1 vote and lost 50%-1 vote, these two people should – in expectation – have very similar potential outcomes.

At such close margins, the determination of “winner” vs. “loser” is as-if random.

This now starts to look very much like an experiment.

\[ \tau_{RDD} \; \equiv \; \mathbb{E}[Y_i(1) - Y_i(0) \; | \; voteshare_i = 0.5] = \] \[ \mathbb{E}[Y_i(1) \; | \; voteshare_i = 0.5] - \mathbb{E}[Y_i(0) \; | \; voteshare_i = 0.5] \]

This is the basic idea behind the regression discontinuity design (RDD).

We course, we never observe anyone at exactly \(voteshare_i = 0.5\), and even elections that are decided by a handful of votes rare.

Thus, technically, there are no data with which to derive \(\tau_{RDD}\).

However, we can extrapolate potential outcomes from above and below the 50% threshold, and try to estimate the point where these two limits hit the threshold.


Data Simulation

Let’s make up some data so we can see what’s going on graphically.

Here’s the data generating process:

Suppose individuals have an unobservable characteristic like \(Ability_i\) which influences both their lifetime earnings (\(Money_i\)) as well as whether they win political office (\(Win_i\)) through achieving a voteshare (\(Votes_i\)) > 50%.

Let’s draw \(Ability_i\) from a normal distribution with mean = 50 and sd = 15.

Suppose that \(Ability_i\) influences \(Votes_i\) via the following equation:

\[ Votes_i = \frac{Ability_i + \omega_i}{100} \] \[ \omega_i \sim \mathcal{N}(0,5) \]

Potential untreated outcomes – i.e. lifetime earnings if an individual does not win political office – are given:

\[ Y_i(0) = 400* Ability_i + 100* Ability_i^2+ e_i \] \[ e_i \sim \mathcal{N}(50.000, 20.000) \]

Finally, let’s set \(\tau_{rdd}\) to some fixed amount (100.000), plus some function of \(Ability_i\), such that the more “able” you are, the more you gain from winning office:

\[ \tau_{rdd} = 100.000 + 10.000*Ability_i \]

And potential treated outcomes are simply:

\[ Y_i(1) = Y_i(0) + \tau_{rdd} \]

Now we can make up the data:

# parameters
set.seed(1)

N <- 500
ability <- rnorm(N, 50, 15)
y0 = 400 * ability + 100 * ability^2 + rnorm(N, 50000, 20000)
y1 = y0 + 100000 + 10000*ability

# putting together the data
dta <- tibble(ability, y0, y1) |> 
  mutate(votes = (ability + rnorm(N, 0,5)) / 100,
         votes = ifelse(votes > 1, 1, votes),
         votes = ifelse(votes < 0, 0, votes),
         win = ifelse(votes > 0.5, TRUE, FALSE))

Note that treatment assignment (winning) is a deterministic function of voteshare. It’s not random!

Let’s visualize the data:

There’s a couple of things to notice from this graph.

  1. We are trying to extrapolate the two solid lines until they hit the cutoff point (threshold) of 0.5. It’s only at this single point where \(\tau_{rdd}\) is identified.
  2. \(\tau_{rdd}\) is extremely local. For this reason, it’s often called the local average treatment effect (LATE). We cannot assume that the same treatment effect will obtain at other places along the x-axis.
  3. The red and blue lines are continuous at the cutpoint. This is the key assumption behind RDD. Nothing else is changing discontinously (jumping) at the cutpoint except the treatment itself. For instance, if some other factor \(X\) also caused \(\mathbb{E}(Y_i(0))\) jump at the cutpoint, then we would be conflating the effect of treatment with the effect of \(X\).

Sorting across the threshold (self-selection into treatment)?

Depending on your specific application, another thing to check for is self-sorting across the cutoff.

Another way to think about this: you want to make sure that units close to the cutoff are not able to manipulate their own treatment status.

What does a violation look like?

Suppose that political candidates from wealthy backgrounds knew that the election was going to be close. These candidates therefore spend a lot of their own personal money in the campaign in order to get just “above” the threshold.

In this case, we would then be comparing a disporportionate number of independently wealthy close winners vs close losers without a lot of family wealth to begin with.

Thus our comparisons will be biased!

Another way to think about this problem: family wealth at the time the politician ran for office is no longer continuous at the threshold.


Estimation

The basic idea is to fit the solid red and blue lines, and measure the height of the “gap” where they hit the solid black line.

Here’s how to do it:

First, we center the rating / forcing variable – in our case, \(voteshare_i\) – at the cutpoint.

\[ voteshare\_centered_i = voteshare_i - 0.5 \]

Then we estimate the following regression model:

\[ Y_i = \beta_0 + \beta_1 * voteshare\_centered_i + \beta_2 * win_i \]

library(broom)

# centering the rating variable
dta <- dta |> 
  mutate(votes_cent = votes - 0.5)

# fitting the model
m1 <- lm(y ~ votes_cent + win, data = dta)

tidy(m1)
# A tibble: 3 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)  398949.     9880.      40.4 4.79e-159
2 votes_cent  1413004.    54372.      26.0 1.13e- 94
3 winTRUE      602293.    17110.      35.2 4.31e-137

Here’s the model graphically:

Quiz:

What do the coefficients represent in the graph?


Note that our lines don’t fit super well. And that’s because we constrained the model to have only a single slope \(\beta_1\).

Quiz:

But what if we wanted the red and the blue lines to have different slopes?

Write out the model.

Here’s the model:

\[ Y_i = \beta_0 + \beta_1 * voteshare\_centered_i + \beta_2 * win_i + \beta_3 * voteshare\_centered_i * win_i \]

# fitting the model
m2 <- lm(y ~ votes_cent*win, data = dta)

tidy(m2)
# A tibble: 4 × 5
  term               estimate std.error statistic   p.value
  <chr>                 <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)         306670.     9281.      33.0 2.07e-127
2 votes_cent          651841.    60123.      10.8 1.01e- 24
3 winTRUE             597586.    13359.      44.7 3.36e-176
4 votes_cent:winTRUE 1517472.    84891.      17.9 1.58e- 55

And here’s the fit:

That looks pretty good.


Checking for sorting

One way to look for sorting is just to consider the number of observations on each side of the cutoff:

Fig. 18 from Cattaneo et al. 2019

Let’s check our data:

Maybe it looks a little suspicious?

For statistical tests, the rddensity package has some nice functionality here:

library(rddensity)

# X stands for the rating variable
density_test_cat <- rddensity(X = dta$votes_cent)

# statistical tests
summary(density_test_cat)

Manipulation testing using local polynomial density estimation.

Number of obs =       500
Model =               unrestricted
Kernel =              triangular
BW method =           estimated
VCE method =          jackknife

c = 0                 Left of c           Right of c          
Number of obs         251                 249                 
Eff. Number of obs    151                 125                 
Order est. (p)        2                   2                   
Order bias (q)        3                   3                   
BW est. (h)           0.127               0.108               

Method                T                   P > |T|             
Robust                -0.5383             0.5904              


P-values of binomial tests (H0: p=0.5).

Window Length              <c     >=c    P>|T|
0.017     + 0.017          23      20    0.7608
0.029     + 0.027          43      32    0.2480
0.041     + 0.037          59      43    0.1371
0.054     + 0.047          70      56    0.2467
0.066     + 0.057          81      66    0.2481
0.078     + 0.068          96      77    0.1710
0.090     + 0.078         119      87    0.0305
0.103     + 0.088         135     101    0.0315
0.115     + 0.098         145     114    0.0621
0.127     + 0.108         151     125    0.1322

Here, it’s basically dividing the data up into little windows, and testing against the null hypothesis that the number of points on either side of the cutoff = 50%.

We can do the same thing “by hand” using a binomial test. For instance, in the first row of above table above, we observe 20 “successes” in 20+23 = 43 trials.

This is like tossing a coin 43 times, and seeing that we “only” got 20 heads. What’s the probability of that (or something more skewed) happening? This is given by the p-value (0.7608).

Let’s calculate it in R:

binom.test(20, 43, p = 0.5, alternative = "two.sided")

    Exact binomial test

data:  20 and 43
number of successes = 20, number of trials = 43, p-value = 0.7608
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.3117619 0.6234527
sample estimates:
probability of success 
             0.4651163 

Placebo outcomes

Since we made up the data, we can also check that \(Ability_i\) – which you ordinarily would not get to observe – is continuous at the cutoff.

We just re-estimate our model, only now we use \(Ability_i\) as the outcome.

Here’s what it looks like graphically:

That’s pretty good!


Binning Data

In our example, our data points look pretty nice. We can actually even more or less see the effect without fitting any lines!

But in real life, your data might look like this:

Here, a common approach is to “bin” the data.

To illustrate, let’s transform our data such that the outcome is:

\[ Rich_i = \begin{cases} 1 & \text{lifetime earnings > 300.000} \\ 0 & \text{otherwise} \end{cases} \]

# generate rich variable
dta <- dta |> 
  mutate(rich = ifelse(y > 300000, 1, 0))

Now if we make the graph, it’s no longer as pretty:

But what if we chopped up the x-axis into little bins (say, each 0.05 wide), and took the average of the outcome within those bins?

# Use cut() to create bins, using breaks to make sure it breaks at 0
# (-10:10)*.5/10 gives 20 breaks from -0.5 to .5
binned <- dta |> 
  mutate(bins = cut(votes_cent,
         breaks = (-10:10)*(.5/10))) |> 
  group_by(bins) |> 
  summarize(rich = mean(rich),
            votes_binned= mean(votes_cent)) |> 
  mutate(win = ifelse(votes_binned > 0, TRUE, FALSE))

Now, we can plot the binned data as points, while still fitting lines to the original data:

# plotting
ggplot(mapping = aes(color=win)) +
  geom_point(data=binned,
             mapping=aes(y = rich,
                         x = votes_binned)) +
  geom_smooth(data=dta,
              mapping=aes(y = rich,
                          x = votes_cent),
              method = "loess",
              se = FALSE) + 
  geom_vline(xintercept = 0,
             color = "black",
             lty = "dashed") +
  theme_minimal() +
  labs(x = "Voteshare Centered",
       y = "Rich?",
       color = "Won Election?") 

If you want to make a “quick and dirty” version, you can also use rdrobust::rdplot():

library(rdrobust)

rdplot(y = dta$rich, x = dta$votes_cent)
[1] "Warning: not enough variability in the outcome variable above the threshold"


Dealing with curvature

Another issue you might encouter with real data has to do with curvature.

Fig. 20.5 from Huntington-Klein, theeffectbook.net

Here, the data are clearly non-linear, and so our estimate \(\tau_{rdd}\) is wrong.

We can of course start fitting higher-order polynomials to the data:

\[ Y_i = \beta_0 + \beta_1X_i + \beta_2 T_i + \beta_3 X*T_i \]

\[ + \beta_4 X^2_i + \beta_5 X^2_i * T_i + e_i \] But when should we stop?

Fig. 20.7 from Huntington-Klein, theeffectbook.net

At some point, we will be over-fitting the data.

Also, since we are only interested in observations close to the cutoff, why are we allowing units “far away” to influence the shape of our fitted line?


Non-Parametric Estimation

Modern approaches therefore suggest using a low order polynomial – that is, linear or quadratic – but fitted within a small bandwidth of data around the cutoff.

This is illustrated in Panel D in the previous figure.

Zoom in enough and the curvature problem goes away.

However, we end up throwing away a lot of data. Compare:

Fig. 20.6 from Huntington-Klein, theeffectbook.net

Therefore, RDD often involves a bias-variance tradeoff: when we reduce bias by zooming in, our estimates get less precise since we are working with less data.


Optimal Bandwidths

As it turns out, people have developed algorithms that optimally resolve this tradeoff. The details aren’t really important… just know that the software is able to find an optimal bandwidth for your data.

Since your results will may change depending on the bandwidth chosen, using the optimal bandwidth at least reduces your ability to p-hack (that is, selecting the bandwidth that best fits your hypothesis).

There are a few algorithms out there, but we will use rdrobust:

# Fitting the RD model using rdrobust
rd_model <- rdrobust(y = dta$y, x = dta$votes_cent, all=TRUE )
summary(rd_model)
Sharp RD estimates using local polynomial regression.

Number of Obs.                  500
BW type                       mserd
Kernel                   Triangular
VCE method                       NN

Number of Obs.                  251          249
Eff. Number of Obs.             135          116
Order est. (p)                    1            1
Order bias  (q)                   2            2
BW est. (h)                   0.102        0.102
BW bias (b)                   0.159        0.159
rho (h/b)                     0.643        0.643
Unique Obs.                     251          249

=============================================================================
        Method     Coef. Std. Err.         z     P>|z|      [ 95% C.I. ]       
=============================================================================
  Conventional596718.518 22452.295    26.577     0.000[552712.828 , 640724.209]
Bias-Corrected599080.272 22452.295    26.682     0.000[555074.582 , 643085.962]
        Robust599080.272 27048.411    22.148     0.000[546066.360 , 652094.184]
=============================================================================
options(scipen = 999) #turning off scientific notation

You can see the optimal bandwidth BW est. (h), as well as the number of observations on each side of the cutoff that are used for estimation.

The recommendation is for you to report the Conventional estimate (596719), and the Robust standard error (27048). These can be found at rd_model$coef[1] and rd_model$se[3].


Kernels

In addition to using a bandwidth, you could also weight the data by how close / far away it is to the cutoff.

Here are some possible weighting schemes:

rdrobust() uses triangular weights by default, but you can easily change that.


How to present your results

In practice, you will still want to show that your results are robust to a range of different bandwidths.

So let’s estimate \(\tau_{rdd}\) using bandwidths from \(\pm 0.01\) all the way to \(\pm 0.20\).

# empty vectors for storing the results
coefs <- c()
ses <- c()

# looping
for (i in 1:20) {
  
  # keep only the data in the bandwidth
  bw <- filter(dta, votes_cent > -i / 100 & votes_cent < i / 100)
  
  # estimate the model
  rd_model <- rdrobust(y = bw$y, x = bw$votes_cent, all=TRUE )

  # store the results
  coefs[i] <- rd_model$coef[1]
  ses[i] <- rd_model$se[3]
  
}

We could of course display these as a table, but let’s make a nice graph instead:

# putting together the plot data
toplot <- tibble(coefs, ses) |> 
  mutate(cih = coefs + 1.96*ses,
         cil = coefs - 1.96*ses,
         bw = row_number() / 100)

# plotting
ggplot(toplot, aes(x=bw)) +
  geom_point(aes(y=coefs),
             size=2) +
  geom_line(aes(y=cil),
             linewidth=0.2) +
  geom_line(aes(y=cih),
             linewidth=0.2) +
  geom_hline(yintercept = 0,
             color="red",
             lty="dashed") +
  theme_minimal() +
  scale_y_continuous(label=scales::dollar) +
  coord_cartesian(ylim=c(-10, 1500000)) +
  labs(x = "Bandwidth",
       y = "Estimated Treatment Effect")


Local Randomization Approach

Sometimes, you won’t have enough data “near” the threshold, or data won’t be continuous “enough” for the techniques we discussed.

Suppose that we didn’t have the raw voting results, but only post-processed data that rounded the vote shares:

# create the variable
rounded <- dta |> 
  mutate(votes_rounded = round(votes_cent, 2)) |> 
  filter(votes_cent > -0.15 & votes_cent < 0.15,
         votes_rounded != 0)

# plot the data
ggplot(rounded, aes(x=votes_rounded, y = y)) +
  geom_point() +
  theme_minimal()

Now, when you run rdrobust, it yells at you:

rd_model <- rdrobust(y = rounded$y, x = rounded$votes_rounded, all=TRUE )
Warning in rdrobust(y = rounded$y, x = rounded$votes_rounded, all = TRUE): Mass
points detected in the running variable.
summary(rd_model)
Sharp RD estimates using local polynomial regression.

Number of Obs.                  326
BW type                       mserd
Kernel                   Triangular
VCE method                       NN

Number of Obs.                  163          163
Eff. Number of Obs.              43           36
Order est. (p)                    1            1
Order bias  (q)                   2            2
BW est. (h)                   0.038        0.038
BW bias (b)                   0.064        0.064
rho (h/b)                     0.595        0.595
Unique Obs.                      15           15

=============================================================================
        Method     Coef. Std. Err.         z     P>|z|      [ 95% C.I. ]       
=============================================================================
  Conventional610844.119 58280.873    10.481     0.000[496615.708 , 725072.531]
Bias-Corrected624241.110 58280.873    10.711     0.000[510012.699 , 738469.522]
        Robust624241.110 75077.085     8.315     0.000[477092.728 , 771389.493]
=============================================================================

Actually, this example isn’t even so bad. Here’s a much worse case:

Fig 1 from deKadt 2017.

Here, rather than taking the continuity approach to RDD, we can instead use the local randomization approach.

The key assumption is that, within some window around the bandwidth, who gets treated is exactly like a random experiment.

Here’s what potential outcomes look like:

Fig. 2.2 from Catteneo et al. 2024

Given that you are ready to assume (and defend) the assumption that treatment assignment is “as-if random” within the window, the you can just use the techniques (e.g. t.tests, regression, randomization inference) we have already learned to estimate the LATE.

Footnotes

  1. This example draws inspiration from Eggers and Hainmueller, “MPs for Sale?” APSR 2009.↩︎