Problem for identification: endogeneity
Problem for inference: randomness


OLS estimators (^β0 and ^β1) are computed from a finite (specific) sample of data
Our OLS model contains 2 sources of randomness:
Modeled randomness: u includes all factors affecting Y other than X
Sampling randomness: different samples will generate different OLS estimators
Sample statistical inference→ Population causal indentification→ Unobserved Parameters
We want to identify causal relationships between population variables
We'll use sample statistics to infer something about population parameters

Population

Population

Population relationship
Yi=3.24+0.44Xi+ui
Yi=β0+β1Xi+ui

Sample 1: 30 random individuals

Sample 1: 30 random individuals

Population relationship
Yi=3.24+0.44Xi+ui
Sample relationship
ˆYi=3.19+0.47Xi

Sample 2: 30 random individuals

Population relationship
Yi=3.24+0.44Xi+ui
Sample relationship
ˆYi=4.26+0.25Xi

Sample 3: 30 random individuals

Population relationship
Yi=3.24+0.44Xi+ui
Sample relationship
ˆYi=2.91+0.46Xi
Let's repeat this process 10,000 times!
This exercise is called a (Monte Carlo) simulation
infer packageLet's repeat this process 10,000 times!
This exercise is called a (Monte Carlo) simulation
infer package
On average estimated regression lines from our hypothetical samples provide an unbiased estimate of the true population regression line E[^β1]=β1
However, any individual line (any one sample) can miss the mark
This leads to uncertainty about our estimated regression line

Sample statistical inference→ Population causal indentification→ Unobserved Parameters
Sample statistical inference→ Population causal indentification→ Unobserved Parameters
Sample statistical inference→ Population causal indentification→ Unobserved Parameters
^Yi=^β0+^β1X🤞 hopefully 🤞→Yi=β0+β1X+ui
Our problem with uncertainty is we don’t know whether our sample estimate is close or far from the unknown population parameter
But we can use our errors to learn how well our model statistics likely estimate the true parameters
Use ^β1 and its standard error, se(^β1) for statistical inference about true β1
We have two options...


Point estimate
Use our ^β1 & se(^β1) to determine if statistically significant evidence to reject a hypothesized β1
Reporting a single value (^β1) is often not going to be the true population parameter (β1)

Confidence interval

We can generate our confidence interval by generating a “bootstrap” sampling distribution
This takes our sample data, and resamples it by selecting random observations with replacement
This allows us to approximate the sampling distribution of ^β1 by simulation!


infer package allows you to do statistical inference in a tidy way, following the philosophy of the tidyverse# install first!install.packages("infer")# loadlibrary(infer)

infer allows you to run through these steps manually to understand the process:specify() a model
generate() a bootstrap distribution
calculate() the confidence interval
visualize() with a histogram (optional)





| ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | |
|---|---|---|---|
| (Intercept) | 698.932952 | 9.4674914 | |
| str | -2.279808 | 0.4798256 |
| ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | |
|---|---|---|---|
| (Intercept) | 698.932952 | 9.4674914 | |
| str | -2.279808 | 0.4798256 |
| ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | |
|---|---|---|---|
| (Intercept) | 708.270835 | 9.5041448 | |
| str | -2.797334 | 0.4802065 |
👆 Bootstrapped from Our Sample
| ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | |
|---|---|---|---|
| (Intercept) | 698.932952 | 9.4674914 | |
| str | -2.279808 | 0.4798256 |
| ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | |
|---|---|---|---|
| (Intercept) | 708.270835 | 9.5041448 | |
| str | -2.797334 | 0.4802065 |
👆 Bootstrapped from Our Sample

data %>%
specify(y ~ x)
specify() function, which is essentially a lm() function for regression (for our purposes)CASchool %>% specify(testscr ~ str)
| ABCDEFGHIJ0123456789 |
testscr <dbl> | str <dbl> | ||
|---|---|---|---|
| 690.80 | 17.88991 | ||
| 661.20 | 21.52466 | ||
| 643.60 | 18.69723 | ||
| 647.70 | 17.35714 | ||
| 640.85 | 18.67133 |

%>% generate(reps = n,
type = "bootstrap")
Now the magic starts, as we run a number of simulated samples
Set the number of reps and set type to "bootstrap"
CASchool %>% specify(testscr ~ str) %>% generate(reps = 1000, type = "bootstrap")
%>% generate(reps = n,
type = "bootstrap")
replicate: the “sample” number (1-1000)
creates x and y values (data points)
%>% calculate(stat = "slope")
CASchool %>% specify(testscr ~ str) %>% generate(reps = 1000, type = "bootstrap") %>% calculate(stat = "slope")
For each of the 1,000 replicates, calculate slope in lm(testscr ~ str)
Calls it the stat
%>% calculate(stat = "slope")
boot <- CASchool %>% #<< # save this specify(testscr ~ str) %>% generate(reps = 1000, type = "bootstrap") %>% calculate(stat = "slope")
boot is (our simulated) sampling distribution of ^β1!
We can now use this to estimate the confidence interval from our ^β1=−2.28
And visualize it
| ABCDEFGHIJ0123456789 |
lower <dbl> | upper <dbl> | |
|---|---|---|
| -3.340545 | -1.238815 |
sampling_dist<-ggplot(data = boot)+ aes(x = stat)+ geom_histogram(color="white", fill = "#e64173")+ labs(x = expression(hat(beta[1])))+ theme_pander(base_family = "Fira Sans Condensed", base_size=20)sampling_dist

ci<-boot %>% summarize(lower = quantile(stat, 0.025), upper = quantile(stat, 0.975))ci
| ABCDEFGHIJ0123456789 |
lower <dbl> | upper <dbl> | |
|---|---|---|
| -3.340545 | -1.238815 |
sampling_dist+ geom_vline(data = ci, aes(xintercept = lower), size = 1, linetype="dashed")+ geom_vline(data = ci, aes(xintercept = upper), size = 1, linetype="dashed")

%>% get_confidence_interval()
CASchool %>% #<< # save this specify(testscr ~ str) %>% generate(reps = 1000, type = "bootstrap") %>% calculate(stat = "slope") %>% get_confidence_interval(level = 0.95, type = "se", point_estimate = -2.28)
| ABCDEFGHIJ0123456789 |
lower_ci <dbl> | upper_ci <dbl> | ||
|---|---|---|---|
| -3.273376 | -1.286624 |
tidy_reg <- school_reg %>% tidy(conf.int = T)tidy_reg
| ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> | conf.low <dbl> | conf.high <dbl> |
|---|---|---|---|---|---|---|
| (Intercept) | 698.932952 | 9.4674914 | 73.824514 | 6.569925e-242 | 680.32313 | 717.542779 |
| str | -2.279808 | 0.4798256 | -4.751327 | 2.783307e-06 | -3.22298 | -1.336637 |
tidy_reg <- school_reg %>% tidy(conf.int = T)tidy_reg
| ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> | conf.low <dbl> | conf.high <dbl> |
|---|---|---|---|---|---|---|
| (Intercept) | 698.932952 | 9.4674914 | 73.824514 | 6.569925e-242 | 680.32313 | 717.542779 |
| str | -2.279808 | 0.4798256 | -4.751327 | 2.783307e-06 | -3.22298 | -1.336637 |
# save and extract confidence intervalour_CI <- tidy_reg %>% filter(term == "str") %>% select(conf.low, conf.high)our_CI
| ABCDEFGHIJ0123456789 |
conf.low <dbl> | conf.high <dbl> | |||
|---|---|---|---|---|
| -3.22298 | -1.336637 |
%>% visualize()
CASchool %>% #<< # save this specify(testscr ~ str) %>% generate(reps = 1000, type = "bootstrap") %>% calculate(stat = "slope") %>% visualize()

visualize() is just a wrapper for ggplot()%>% visualize()
CASchool %>% #<< # save this specify(testscr ~ str) %>% generate(reps = 1000, type = "bootstrap") %>% calculate(stat = "slope") %>% visualize()+shade_ci(endpoints = our_CI)

our_CI) we can shade_ci() in infer's visualize() function([ estimate - MOE ], [ estimate + MOE ])

(1−α) is the confidence level of our confidence interval
Typical levels: 90%, 95%, 99%

Depending on our confidence level, we are essentially looking for the middle (1−α)% of the sampling distribution
This puts α in the tails; α2 in each tail

Recall the 68-95-99.7% empirical rule for (standard) normal distributions!†
95% of data falls within 2 standard deviations of the mean
Thus, in 95% of samples, the true parameter is likely to fall within about 2 standard deviations of the sample estimate
† I’m playing fast and loose here, we can’t actually use the normal distribution, we use the Student’s t-distribution with n-k-1 degrees of freedom. But there’s no need to complicate things you don’t need to know about. Look at today’s class notes for more.

❌ 95% of the time, the true effect of class size on test score will be between -3.22 and -1.33
❌ 95% of the time, the true effect of class size on test score will be between -3.22 and -1.33
❌ We are 95% confident that a randomly selected school district will have an effect of class size on test score between -3.22 and -1.33
❌ 95% of the time, the true effect of class size on test score will be between -3.22 and -1.33
❌ We are 95% confident that a randomly selected school district will have an effect of class size on test score between -3.22 and -1.33
❌ The effect of class size on test score is -2.28 95% of the time.
❌ 95% of the time, the true effect of class size on test score will be between -3.22 and -1.33
❌ We are 95% confident that a randomly selected school district will have an effect of class size on test score between -3.22 and -1.33
❌ The effect of class size on test score is -2.28 95% of the time.
✅ We are 95% confident that in similarly constructed samples, the true effect is between -3.22 and -1.33
infer)lm summary() output, need the confint commandconfint(school_reg)
## 2.5 % 97.5 %## (Intercept) 680.32313 717.542779## str -3.22298 -1.336637broom can include confidence intervalsschool_reg %>% tidy(conf.int = TRUE)
| ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | |
|---|---|---|
| (Intercept) | 698.932952 | |
| str | -2.279808 |
Keyboard shortcuts
| ↑, ←, Pg Up, k | Go to previous slide |
| ↓, →, Pg Dn, Space, j | Go to next slide |
| Home | Go to first slide |
| End | Go to last slide |
| Number + Return | Go to specific slide |
| b / m / f | Toggle blackout / mirrored / fullscreen mode |
| c | Clone slideshow |
| p | Toggle presenter mode |
| t | Restart the presentation timer |
| ?, h | Toggle this help |
| o | Tile View: Overview of Slides |
| Esc | Back to slideshow |
Problem for identification: endogeneity
Problem for inference: randomness


OLS estimators (^β0 and ^β1) are computed from a finite (specific) sample of data
Our OLS model contains 2 sources of randomness:
Modeled randomness: u includes all factors affecting Y other than X
Sampling randomness: different samples will generate different OLS estimators
Sample statistical inference→ Population causal indentification→ Unobserved Parameters
We want to identify causal relationships between population variables
We'll use sample statistics to infer something about population parameters

Population

Population

Population relationship
Yi=3.24+0.44Xi+ui
Yi=β0+β1Xi+ui

Sample 1: 30 random individuals

Sample 1: 30 random individuals

Population relationship
Yi=3.24+0.44Xi+ui
Sample relationship
ˆYi=3.19+0.47Xi

Sample 2: 30 random individuals

Population relationship
Yi=3.24+0.44Xi+ui
Sample relationship
ˆYi=4.26+0.25Xi

Sample 3: 30 random individuals

Population relationship
Yi=3.24+0.44Xi+ui
Sample relationship
ˆYi=2.91+0.46Xi
Let's repeat this process 10,000 times!
This exercise is called a (Monte Carlo) simulation
infer packageLet's repeat this process 10,000 times!
This exercise is called a (Monte Carlo) simulation
infer package
On average estimated regression lines from our hypothetical samples provide an unbiased estimate of the true population regression line E[^β1]=β1
However, any individual line (any one sample) can miss the mark
This leads to uncertainty about our estimated regression line

Sample statistical inference→ Population causal indentification→ Unobserved Parameters
Sample statistical inference→ Population causal indentification→ Unobserved Parameters
Sample statistical inference→ Population causal indentification→ Unobserved Parameters
^Yi=^β0+^β1X🤞 hopefully 🤞→Yi=β0+β1X+ui
Our problem with uncertainty is we don’t know whether our sample estimate is close or far from the unknown population parameter
But we can use our errors to learn how well our model statistics likely estimate the true parameters
Use ^β1 and its standard error, se(^β1) for statistical inference about true β1
We have two options...


Point estimate
Use our ^β1 & se(^β1) to determine if statistically significant evidence to reject a hypothesized β1
Reporting a single value (^β1) is often not going to be the true population parameter (β1)

Confidence interval

We can generate our confidence interval by generating a “bootstrap” sampling distribution
This takes our sample data, and resamples it by selecting random observations with replacement
This allows us to approximate the sampling distribution of ^β1 by simulation!


infer package allows you to do statistical inference in a tidy way, following the philosophy of the tidyverse# install first!install.packages("infer")# loadlibrary(infer)

infer allows you to run through these steps manually to understand the process:specify() a model
generate() a bootstrap distribution
calculate() the confidence interval
visualize() with a histogram (optional)





| ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> |
|---|---|---|---|---|
| (Intercept) | 698.932952 | 9.4674914 | 73.824514 | 6.569925e-242 |
| str | -2.279808 | 0.4798256 | -4.751327 | 2.783307e-06 |
| ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> |
|---|---|---|---|---|
| (Intercept) | 698.932952 | 9.4674914 | 73.824514 | 6.569925e-242 |
| str | -2.279808 | 0.4798256 | -4.751327 | 2.783307e-06 |
| ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> |
|---|---|---|---|---|
| (Intercept) | 708.270835 | 9.5041448 | 74.522311 | 1.698400e-243 |
| str | -2.797334 | 0.4802065 | -5.825274 | 1.139188e-08 |
👆 Bootstrapped from Our Sample
| ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> |
|---|---|---|---|---|
| (Intercept) | 698.932952 | 9.4674914 | 73.824514 | 6.569925e-242 |
| str | -2.279808 | 0.4798256 | -4.751327 | 2.783307e-06 |
| ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> |
|---|---|---|---|---|
| (Intercept) | 708.270835 | 9.5041448 | 74.522311 | 1.698400e-243 |
| str | -2.797334 | 0.4802065 | -5.825274 | 1.139188e-08 |
👆 Bootstrapped from Our Sample

data %>%
specify(y ~ x)
specify() function, which is essentially a lm() function for regression (for our purposes)CASchool %>% specify(testscr ~ str)
| ABCDEFGHIJ0123456789 |
testscr <dbl> | str <dbl> |
|---|---|
| 690.80 | 17.88991 |
| 661.20 | 21.52466 |
| 643.60 | 18.69723 |
| 647.70 | 17.35714 |
| 640.85 | 18.67133 |

%>% generate(reps = n,
type = "bootstrap")
Now the magic starts, as we run a number of simulated samples
Set the number of reps and set type to "bootstrap"
CASchool %>% specify(testscr ~ str) %>% generate(reps = 1000, type = "bootstrap")
%>% generate(reps = n,
type = "bootstrap")
replicate: the “sample” number (1-1000)
creates x and y values (data points)
%>% calculate(stat = "slope")
CASchool %>% specify(testscr ~ str) %>% generate(reps = 1000, type = "bootstrap") %>% calculate(stat = "slope")
For each of the 1,000 replicates, calculate slope in lm(testscr ~ str)
Calls it the stat
%>% calculate(stat = "slope")
boot <- CASchool %>% #<< # save this specify(testscr ~ str) %>% generate(reps = 1000, type = "bootstrap") %>% calculate(stat = "slope")
boot is (our simulated) sampling distribution of ^β1!
We can now use this to estimate the confidence interval from our ^β1=−2.28
And visualize it
| ABCDEFGHIJ0123456789 |
lower <dbl> | upper <dbl> |
|---|---|
| -3.340545 | -1.238815 |
sampling_dist<-ggplot(data = boot)+ aes(x = stat)+ geom_histogram(color="white", fill = "#e64173")+ labs(x = expression(hat(beta[1])))+ theme_pander(base_family = "Fira Sans Condensed", base_size=20)sampling_dist

ci<-boot %>% summarize(lower = quantile(stat, 0.025), upper = quantile(stat, 0.975))ci
| ABCDEFGHIJ0123456789 |
lower <dbl> | upper <dbl> |
|---|---|
| -3.340545 | -1.238815 |
sampling_dist+ geom_vline(data = ci, aes(xintercept = lower), size = 1, linetype="dashed")+ geom_vline(data = ci, aes(xintercept = upper), size = 1, linetype="dashed")

%>% get_confidence_interval()
CASchool %>% #<< # save this specify(testscr ~ str) %>% generate(reps = 1000, type = "bootstrap") %>% calculate(stat = "slope") %>% get_confidence_interval(level = 0.95, type = "se", point_estimate = -2.28)
| ABCDEFGHIJ0123456789 |
lower_ci <dbl> | upper_ci <dbl> |
|---|---|
| -3.273376 | -1.286624 |
tidy_reg <- school_reg %>% tidy(conf.int = T)tidy_reg
| ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> | |
|---|---|---|---|---|---|
| (Intercept) | 698.932952 | 9.4674914 | 73.824514 | 6.569925e-242 | |
| str | -2.279808 | 0.4798256 | -4.751327 | 2.783307e-06 |
tidy_reg <- school_reg %>% tidy(conf.int = T)tidy_reg
| ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> | |
|---|---|---|---|---|---|
| (Intercept) | 698.932952 | 9.4674914 | 73.824514 | 6.569925e-242 | |
| str | -2.279808 | 0.4798256 | -4.751327 | 2.783307e-06 |
# save and extract confidence intervalour_CI <- tidy_reg %>% filter(term == "str") %>% select(conf.low, conf.high)our_CI
| ABCDEFGHIJ0123456789 |
conf.low <dbl> | conf.high <dbl> |
|---|---|
| -3.22298 | -1.336637 |
%>% visualize()
CASchool %>% #<< # save this specify(testscr ~ str) %>% generate(reps = 1000, type = "bootstrap") %>% calculate(stat = "slope") %>% visualize()

visualize() is just a wrapper for ggplot()%>% visualize()
CASchool %>% #<< # save this specify(testscr ~ str) %>% generate(reps = 1000, type = "bootstrap") %>% calculate(stat = "slope") %>% visualize()+shade_ci(endpoints = our_CI)

our_CI) we can shade_ci() in infer's visualize() function([ estimate - MOE ], [ estimate + MOE ])

(1−α) is the confidence level of our confidence interval
Typical levels: 90%, 95%, 99%

Depending on our confidence level, we are essentially looking for the middle (1−α)% of the sampling distribution
This puts α in the tails; α2 in each tail

Recall the 68-95-99.7% empirical rule for (standard) normal distributions!†
95% of data falls within 2 standard deviations of the mean
Thus, in 95% of samples, the true parameter is likely to fall within about 2 standard deviations of the sample estimate
† I’m playing fast and loose here, we can’t actually use the normal distribution, we use the Student’s t-distribution with n-k-1 degrees of freedom. But there’s no need to complicate things you don’t need to know about. Look at today’s class notes for more.

❌ 95% of the time, the true effect of class size on test score will be between -3.22 and -1.33
❌ 95% of the time, the true effect of class size on test score will be between -3.22 and -1.33
❌ We are 95% confident that a randomly selected school district will have an effect of class size on test score between -3.22 and -1.33
❌ 95% of the time, the true effect of class size on test score will be between -3.22 and -1.33
❌ We are 95% confident that a randomly selected school district will have an effect of class size on test score between -3.22 and -1.33
❌ The effect of class size on test score is -2.28 95% of the time.
❌ 95% of the time, the true effect of class size on test score will be between -3.22 and -1.33
❌ We are 95% confident that a randomly selected school district will have an effect of class size on test score between -3.22 and -1.33
❌ The effect of class size on test score is -2.28 95% of the time.
✅ We are 95% confident that in similarly constructed samples, the true effect is between -3.22 and -1.33
infer)lm summary() output, need the confint commandconfint(school_reg)
## 2.5 % 97.5 %## (Intercept) 680.32313 717.542779## str -3.22298 -1.336637broom can include confidence intervalsschool_reg %>% tidy(conf.int = TRUE)
| ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> | |
|---|---|---|---|---|---|
| (Intercept) | 698.932952 | 9.4674914 | 73.824514 | 6.569925e-242 | |
| str | -2.279808 | 0.4798256 | -4.751327 | 2.783307e-06 |