Probability and Statistics Review
Probability – or the “doctrine of chances” – describes how often different kinds of events will happen:
- What are the chances of a fair coin coming up heads 10 times in a row?
- If I roll two six sided dice, how likely is it that I’ll roll two sixes?
- How likely is it that five cards drawn from a perfectly shuffled deck will all be hearts?
Notice that, in each case, we start with a known model of the world.
For instance, even though we don’t know the exact outcome of a single coin toss, we know that we are equally likely to heads or tails.
Or, even though we don’t know the exact cards that I will get in poker, we know (i) exactly what cards are in the deck, and (ii) that the cards were randomly shuffled.
We can write down the model as:
\[ Pr(\text{heads}) = 0.5 \]
or
\[ Pr(5 \; hearts) = \frac{13}{52} \times \frac{12}{51} \times \frac{11}{50} \times \frac{10}{49} \times \frac{9}{48} \]
This known model of the world is called a data generating process.
Probability Distributions
Suppose you roll two dice and add up the sum.
How many possible outcomes can are there? And how likely is each outcome?
Here’s another way of representing the same information:
This is called a probability density. The x-axis displays each of the possible outcomes, and the y-axis displays the probability that this outcome will occur.
Note that the total probabilities (i.e. the sum of the bar heights) must add up to 1.
Expected Values
Rolling dice or tossing coins are random data generating processes.
And although we won’t know the exact outcome of a random process (before carrying it out), we can know its expected value, which we denote with the \(\mathbb{E}\) operator:
That’s very abstract, so let’s do a concrete example. Suppose we are tossing a coin over and over again, and we record the proportion of heads that we have obtained so far:
# load the tidyverse package
library(tidyverse)
# simulating tossing a coin 1000 times
set.seed(2)
<- 1000
n_tosses
<- c("heads", "tails")
coins_faces <- sample(coins_faces, n_tosses, replace=TRUE)
coin_tosses
# saving the outcomes as a data frame
<- tibble(coin_tosses)
dta
# create additional variables as columns in the data frame:
# N counts the number of coin tosses so far
# Y converts heads into 1 (and tails into 0)
# C computes the number of heads so far
# P is the proportion of heads so far
<- dta |>
dta mutate(N = 1:n_tosses,
Y = ifelse(coin_tosses == "heads", 1, 0),
C = cumsum(Y),
P = C / N)
# plotting
ggplot(data=dta, mapping=aes(x=N, y=P)) +
geom_point(size=0.5) +
geom_hline(yintercept = 0.5,
color="red",
lty="dashed") +
theme_minimal() +
labs(x = "Number of Coin Tosses",
y = "Proportion of Heads")
We observe that the proportion of heads initially jumps all over the place, but then converges to P = 0.5 as we toss the coin over and over again ad infinitum.
This is what we mean when we say that \(\mathbb{E}[Y_i]\) = 0.5.
Conditional Expectations
Let’s go back to our table:
Suppose I told you that you rolled a 6 on the red die.
This question is asking about a conditional expectation:
\[ \mathbb{E}[\text{Die1}_i + \text{Die2}_i \; | \; \text{Die1}_i = 6] \]
We can also look at the difference between two conditional expectations:
\[ \mathbb{E}[\text{Die1}_i + \text{Die2}_i \; | \; \text{Die1}_i = 6] - \mathbb{E}[\text{Die1}_i + \text{Die2}_i \; | \; \text{Die1}_i = 1] \]
Conditional expectations are important once we start talking about experiments.
In experiments, we want to compare the expected value of the outcome for subjects in the treatment group (\(D=1\)) vs. subjects in the control group (\(D=0\)).
More precisely, if we want to learn about treatment effects, we will want to estimate:
\[ \mathbb{E}[Y_i | D_i = 1] - \mathbb{E}[Y_i | D_i = 0] \]
That looks complicated, but it’s basically just the same thign we just did.
More precisely, it’s the difference-in-means between the two groups.
The notation will come up again and again in your readings, so get comfortable with it!
Inferential Statistics
In statistics we do not know the “true model” of the world.
Instead, all we have are a sample of data, and we are trying to use this information to make an inference about the true data generating process.
For example:
- If I flip a coin 10 times and gets 10 heads, is the coin fair?
- If five cards off the top of the deck are all hearts, how likely is it that the deck was shuffled?
In each case, I am trying to guess what model of the world (e.g. “fair” vs. “unfair”, or “shuffled” vs. “unshuffled”) is most likely to have produced the data that I have.
Hopefully, you can see why inferential statistics was once called “inverse probability”…
Now let’s start building the bridge from probability to statistics.
We’ll use a simple example from the ebook moderndive.
Imagine we have a large bowl of red and white marbles:
We want to find out: what proportion of the marbles are red?
Or, to put this in more formal math terms:
\[ Y_i = \begin{cases} 1 & \text{if marble is red} \\ 0 & \text{if marble is white} \end{cases} \] and we want to find out \(\mathbb{E}[Y_i]\).
Sampling
From before, we know that if we repeatedly draw marbles from the bowl (and put the marble back), and keep a running tally of the proportion of red marbles, we will converge on the truth.
So for sure we are not going to keep calling people ad infinitum (and maybe we would call the same person twice).
Rather, we are going to draw a sample of fixed size (without replacement), calculate the proportion of SPD supporters in that sample, and use that information to estimate the population parameter that we are interested in.
We will do the same thing here.
We have a shovel with which to scoop out 50 marbles from the bowl:
Suppose we take one shovelful of marbles from the bowl, and observe that 17 of them are red.
Unbiasedness
As it turns out, our guess actually has a really nice property: it is right “on average”.
To see what that means, suppose we use the shovel over and over again (which we of course don’t get to do in real life).
To get a feel for what happens, let’s program a simulation:
# loading packages
# this package comes with a preloaded bowl of marbles
# this object is called "bowl"
library(tidyverse)
library(moderndive)
<- tibble(bowl) |>
marbles mutate(y = ifelse(color == "red", 1,0))
# parameters of simulation
= 50
shovel_size
# a single draw
# and calculate the mean
sample(marbles$y, shovel_size, replace = FALSE) |>
mean()
[1] 0.36
You’ll see that if you run the sample()
function over and over again, you’ll get a different answer every time.
Suppose we draw lots of samples from the bowl and write down the proportion of red marbles from each draw.
Some of these draws will give us “too many” red marbles, and some of them will give us “too few.”
But just as in our coin-tossing example above, when we average across our large number of samples, we should approach the “truth”.
# setting seed for reproducibility
set.seed(1)
# define the number of draws
<- 1000
n_draws
# an empty vector to store the results from each draw
<- c()
dta
# using a "for" loop to do something lots of times
for (i in 1:n_draws) {
# draw the marbles
<- sample(marbles$y, shovel_size, replace = FALSE)
draw
# storing the proportion red in our vector
<- mean(draw)
dta[i]
}
# calculating the average of all of our draws
<- mean(dta)
avg_shovel
# making a histogram of the distribution of our draws
# first, we need to save our vector of results to a data frame
<- as_tibble(dta) |>
toplot rename(prop_red = value)
<- ggplot(data=toplot, mapping=aes(x=prop_red)) +
hist_50 geom_histogram(color="white",
binwidth = 0.02,
fill="cornflowerblue") +
geom_vline(xintercept = avg_shovel,
lty="dashed",
color="red") +
theme_minimal() +
labs(x="Proportion Red Marbles",
y="Number of Shovels",
subtitle = "Shove Size 50") +
coord_cartesian(xlim=c(0.15, 0.65),
ylim=c(0, 175))
hist_50
We observe that across all of our 1000 draws, the “average” shovel contained 37.92% red marbles.
We calculated this above using the code mean(dta)
.
But we also see from the histogram that some shovels had less than 30% red marbles, and some had over 50%.
The nice thing about simulations is that we also know the “true” proportion of red marbles in the bowl. It’s 37.5%. We calculate this using the code mean(marbles$y)
.
So we see that our “average shovel” is not far off.
In other words, even though each of our individual estimates could be far from the mark, the average of our estimates approaches the “right” answer”.
This property is called unbiasedness.
More formally, if we denote the sample average from a single draw as \(\bar{Y}\), unbiasedness means that
\[ \mathbb{E}[\bar{Y}] = \mathbb{E}[Y_i] \]
Or, in English: the average of repeated samples from the bowl is equal to the average of all of the marbles in the bowl.
An Example with Bias
Above, we got an unbiased estimate because we drew a random sample every time.
But there are lots of estimation procedures that will produced biased estimates instead.
Suppose we did the following:
- draw a sample of 50 marbles and count the proportion of red ones
- if the proportion is \(>= 0.4\), record the data
- if the proportion is \(< 0.4\), throw away the sample and redraw it
Let’s see this in action:
# an empty vector to store the results from each draw
<- c()
dta_biased
# using a "for" loop to do something lots of times
for (i in 1:n_draws) {
# draw the marbles
<- sample(marbles$y, shovel_size, replace = FALSE)
draw
# checking the proportion red
<- mean(draw)
check
# using a "while" loop to keep doing something until a condition is met
while (check < 0.4) {
# draw the marbles again
<- sample(marbles$y, shovel_size, replace = FALSE)
draw
# checking the proportion red
<- mean(draw)
check
}
# storing the proportion red from the final kept draw in our vector
<- mean(draw)
dta_biased[i]
}
# calculating the average of all of our draws
<- mean(dta_biased) avg_biased_shovel
# making a histogram of the distribution of our draws
<- as_tibble(dta_biased) |>
toplot_biased rename(prop_red = value)
<- ggplot(data=toplot_biased, mapping=aes(x=prop_red)) +
hist_biased geom_histogram(color="white",
binwidth = 0.02,
fill="cornflowerblue") +
geom_vline(xintercept = avg_biased_shovel,
lty="dashed",
color="red") +
geom_vline(xintercept = mean(marbles$y),
lty="solid",
color="green") +
theme_minimal() +
labs(x="Proportion Red Marbles from Biased Procedure",
y="Number of Shovels") +
coord_cartesian(xlim=c(0.15, 0.65),
ylim=c(0, 175))
hist_biased
Here, the average across all of our samples is 43.84%.
Thus, even though we drew lots of samples, the average of our estimates does not converge to the truth.
\[ \mathbb{E}[\bar{Y}] \neq \mathbb{E}[Y_i] \]
We will come back to the problem of bias throughout the course.
Uncertainty
Let’s look again at our unbiased distribution of estimates:
Even though the histogram “averages out” to the truth, there’s a lot of variance in our estimates.
You may recall that variance provides a way to describe how “noisy” a list of numbers is:
\[ \frac{1}{N}\sum{(X_i - \bar{X})^2} \]
X | Y |
---|---|
1 | 2 |
1 | 2 |
3 | 3 |
3 | 3 |
5 | 4 |
5 | 4 |
In addition to the variance, we often use its square root: the standard deviation.
Here’s the standard deviation of our estimated proportion of red marbles from each of our 1000 unbiased samples:
round(sd(toplot$prop_red),4)
[1] 0.0695
You may also recall that 95% of the data are supposed to fall within \(\pm 1.96\) standard deviations of the mean.
In other words, out of our 1000 estimates, about 950 of them will be between 37.92% \(\pm 1.96 \times\) 6.95% = [24.3% and 51.53%].
# calculating the actual number of shovels which fall within the 95% range
<- mean(toplot$prop_red) + 1.96*sd(toplot$prop_red)
hi <- mean(toplot$prop_red) - 1.96*sd(toplot$prop_red)
lo
<- toplot |>
toplot mutate(in_ci = ifelse(prop_red > lo & prop_red < hi,
1, 0))
# counting up the shovels that fall within this range:
sum(toplot$in_ci)
[1] 940
In fact, 940 draws actually fall within this range.
Precision
Why do we care about this?
Because in real life, we cannot keep repeating studies ad infinitum and then average up of our results.
Rather, we usually run a study (“use the shovel”) once.
Thus, we only have one estimate of the “truth”.
And even though this estimate may be unbiased (that is, our estimation procedure yields the correct answer “on average”), we still would prefer our estimate to be closer to the truth than farther away.
Put differently, we want to histograms of estimates to be “skinnier” rather than “fatter”.
Skinnier histograms are composed of more precise estimates.
A Larger Shovel
As it turns out, there are lots of different ways to achieve this goal.
We’ll talk about a few throughout this course, but one intuitive method is to increase the size (N) of our study.
Think about this has equivalent to using a larger shovel.
# an empty vector to store the results from each draw
<- c()
dta
# parameters of simulation
= 100
shovel_size
# using a "for" loop to do something lots of times
for (i in 1:n_draws) {
# draw the marbles
<- sample(marbles$y, shovel_size, replace = FALSE)
draw
# storing the proportion red in our vector
<- mean(draw)
dta[i]
}
# calculating the average of all of our draws
<- mean(dta)
avg_shovel_large
# making a histogram of the distribution of our draws
# first, we need to save our vector of results to a data frame
<- as_tibble(dta) |>
toplot_large rename(prop_red = value)
<- ggplot(data=toplot_large, mapping=aes(x=prop_red)) +
hist_100 geom_histogram(color="white",
binwidth = 0.02,
fill="cornflowerblue") +
geom_vline(xintercept = avg_shovel,
lty="dashed",
color="red") +
theme_minimal() +
labs(x="Proportion Red Marbles",
y="Number of Shovels",
subtitle = "Shove Size 100") +
coord_cartesian(xlim=c(0.15, 0.65),
ylim=c(0, 175))
# placing the two histograms on top of each other
library(patchwork)
/ hist_100 +
hist_50 plot_layout(axes = "collect")
Summary of Key Ideas
Statistics is about using a limited set of data to make inferences about a larger population.
We want our inferences to be unbiased: that is, if we ran our study a large number of times, we want our estimate to be right “on average”.
But, in reality, we can only run our study once. Our one estimate therefore forms our “best guess” about the true data generating process in the population.
Provided we designed and implemented our study correctly (and we’ll talk about what “correctly” means throughout the whole course), our estimate will be unbiased, meaning that it recovers the “truth” on average.
However, even though our estimate may be unbiased, it is still subject to uncertainty.
Of course, we want that uncertainty to be as small as possible – that is, we would prefer if our estimates are more precise.
In all cases though, it is important to understand and communicate not only the estimate itself, but also the uncertainty surrounding it.
Note that precision and bias should be thought about as separate concepts. Here’s a diagram illustration this distinction: