Problem for identification: endogeneity
Problem for inference: randomness
OLS estimators (^β0 and ^β1) are computed from a finite (specific) sample of data
Our OLS model contains 2 sources of randomness:
Modeled randomness: u includes all factors affecting Y other than X
Sampling randomness: different samples will generate different OLS estimators
Sample statistical inference→ Population causal indentification→ Unobserved Parameters
We want to identify causal relationships between population variables
We'll use sample statistics to infer something about population parameters
Population
Population
Population relationship
Yi=3.24+0.44Xi+ui
Yi=β0+β1Xi+ui
Sample 1: 30 random individuals
Sample 1: 30 random individuals
Population relationship
Yi=3.24+0.44Xi+ui
Sample relationship
ˆYi=3.19+0.47Xi
Sample 2: 30 random individuals
Population relationship
Yi=3.24+0.44Xi+ui
Sample relationship
ˆYi=4.26+0.25Xi
Sample 3: 30 random individuals
Population relationship
Yi=3.24+0.44Xi+ui
Sample relationship
ˆYi=2.91+0.46Xi
Let's repeat this process 10,000 times!
This exercise is called a (Monte Carlo) simulation
infer
packageLet's repeat this process 10,000 times!
This exercise is called a (Monte Carlo) simulation
infer
packageOn average estimated regression lines from our hypothetical samples provide an unbiased estimate of the true population regression line E[^β1]=β1
However, any individual line (any one sample) can miss the mark
This leads to uncertainty about our estimated regression line
Sample statistical inference→ Population causal indentification→ Unobserved Parameters
Sample statistical inference→ Population causal indentification→ Unobserved Parameters
Sample statistical inference→ Population causal indentification→ Unobserved Parameters
^Yi=^β0+^β1X🤞 hopefully 🤞→Yi=β0+β1X+ui
Our problem with uncertainty is we don’t know whether our sample estimate is close or far from the unknown population parameter
But we can use our errors to learn how well our model statistics likely estimate the true parameters
Use ^β1 and its standard error, se(^β1) for statistical inference about true β1
We have two options...
Point estimate
Use our ^β1 & se(^β1) to determine if statistically significant evidence to reject a hypothesized β1
Reporting a single value (^β1) is often not going to be the true population parameter (β1)
Confidence interval
We can generate our confidence interval by generating a “bootstrap” sampling distribution
This takes our sample data, and resamples it by selecting random observations with replacement
This allows us to approximate the sampling distribution of ^β1 by simulation!
infer
package allows you to do statistical inference in a tidy
way, following the philosophy of the tidyverse
# install first!install.packages("infer")# loadlibrary(infer)
infer
allows you to run through these steps manually to understand the process:specify()
a model
generate()
a bootstrap distribution
calculate()
the confidence interval
visualize()
with a histogram (optional)
ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | |
---|---|---|---|
(Intercept) | 698.932952 | 9.4674914 | |
str | -2.279808 | 0.4798256 |
ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | |
---|---|---|---|
(Intercept) | 698.932952 | 9.4674914 | |
str | -2.279808 | 0.4798256 |
ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | |
---|---|---|---|
(Intercept) | 708.270835 | 9.5041448 | |
str | -2.797334 | 0.4802065 |
👆 Bootstrapped from Our Sample
ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | |
---|---|---|---|
(Intercept) | 698.932952 | 9.4674914 | |
str | -2.279808 | 0.4798256 |
ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | |
---|---|---|---|
(Intercept) | 708.270835 | 9.5041448 | |
str | -2.797334 | 0.4802065 |
👆 Bootstrapped from Our Sample
data %>%
specify(y ~ x)
specify()
function, which is essentially a lm()
function for regression (for our purposes)CASchool %>% specify(testscr ~ str)
ABCDEFGHIJ0123456789 |
testscr <dbl> | str <dbl> | ||
---|---|---|---|
690.80 | 17.88991 | ||
661.20 | 21.52466 | ||
643.60 | 18.69723 | ||
647.70 | 17.35714 | ||
640.85 | 18.67133 |
%>% generate(reps = n,
type = "bootstrap")
Now the magic starts, as we run a number of simulated samples
Set the number of reps
and set type
to "bootstrap"
CASchool %>% specify(testscr ~ str) %>% generate(reps = 1000, type = "bootstrap")
%>% generate(reps = n,
type = "bootstrap")
replicate
: the “sample” number (1-1000)
creates x
and y
values (data points)
%>% calculate(stat = "slope")
CASchool %>% specify(testscr ~ str) %>% generate(reps = 1000, type = "bootstrap") %>% calculate(stat = "slope")
For each of the 1,000 replicates, calculate slope
in lm(testscr ~ str)
Calls it the stat
%>% calculate(stat = "slope")
boot <- CASchool %>% #<< # save this specify(testscr ~ str) %>% generate(reps = 1000, type = "bootstrap") %>% calculate(stat = "slope")
boot
is (our simulated) sampling distribution of ^β1!
We can now use this to estimate the confidence interval from our ^β1=−2.28
And visualize it
ABCDEFGHIJ0123456789 |
lower <dbl> | upper <dbl> | |
---|---|---|
-3.340545 | -1.238815 |
sampling_dist<-ggplot(data = boot)+ aes(x = stat)+ geom_histogram(color="white", fill = "#e64173")+ labs(x = expression(hat(beta[1])))+ theme_pander(base_family = "Fira Sans Condensed", base_size=20)sampling_dist
ci<-boot %>% summarize(lower = quantile(stat, 0.025), upper = quantile(stat, 0.975))ci
ABCDEFGHIJ0123456789 |
lower <dbl> | upper <dbl> | |
---|---|---|
-3.340545 | -1.238815 |
sampling_dist+ geom_vline(data = ci, aes(xintercept = lower), size = 1, linetype="dashed")+ geom_vline(data = ci, aes(xintercept = upper), size = 1, linetype="dashed")
%>% get_confidence_interval()
CASchool %>% #<< # save this specify(testscr ~ str) %>% generate(reps = 1000, type = "bootstrap") %>% calculate(stat = "slope") %>% get_confidence_interval(level = 0.95, type = "se", point_estimate = -2.28)
ABCDEFGHIJ0123456789 |
lower_ci <dbl> | upper_ci <dbl> | ||
---|---|---|---|
-3.273376 | -1.286624 |
tidy_reg <- school_reg %>% tidy(conf.int = T)tidy_reg
ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> | conf.low <dbl> | conf.high <dbl> |
---|---|---|---|---|---|---|
(Intercept) | 698.932952 | 9.4674914 | 73.824514 | 6.569925e-242 | 680.32313 | 717.542779 |
str | -2.279808 | 0.4798256 | -4.751327 | 2.783307e-06 | -3.22298 | -1.336637 |
tidy_reg <- school_reg %>% tidy(conf.int = T)tidy_reg
ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> | conf.low <dbl> | conf.high <dbl> |
---|---|---|---|---|---|---|
(Intercept) | 698.932952 | 9.4674914 | 73.824514 | 6.569925e-242 | 680.32313 | 717.542779 |
str | -2.279808 | 0.4798256 | -4.751327 | 2.783307e-06 | -3.22298 | -1.336637 |
# save and extract confidence intervalour_CI <- tidy_reg %>% filter(term == "str") %>% select(conf.low, conf.high)our_CI
ABCDEFGHIJ0123456789 |
conf.low <dbl> | conf.high <dbl> | |||
---|---|---|---|---|
-3.22298 | -1.336637 |
%>% visualize()
CASchool %>% #<< # save this specify(testscr ~ str) %>% generate(reps = 1000, type = "bootstrap") %>% calculate(stat = "slope") %>% visualize()
visualize()
is just a wrapper for ggplot()
%>% visualize()
CASchool %>% #<< # save this specify(testscr ~ str) %>% generate(reps = 1000, type = "bootstrap") %>% calculate(stat = "slope") %>% visualize()+shade_ci(endpoints = our_CI)
our_CI
) we can shade_ci()
in infer
's visualize()
function([ estimate - MOE ], [ estimate + MOE ])
(1−α) is the confidence level of our confidence interval
Typical levels: 90%, 95%, 99%
Depending on our confidence level, we are essentially looking for the middle (1−α)% of the sampling distribution
This puts α in the tails; α2 in each tail
Recall the 68-95-99.7% empirical rule for (standard) normal distributions!†
95% of data falls within 2 standard deviations of the mean
Thus, in 95% of samples, the true parameter is likely to fall within about 2 standard deviations of the sample estimate
† I’m playing fast and loose here, we can’t actually use the normal distribution, we use the Student’s t-distribution with n-k-1 degrees of freedom. But there’s no need to complicate things you don’t need to know about. Look at today’s class notes for more.
❌ 95% of the time, the true effect of class size on test score will be between -3.22 and -1.33
❌ 95% of the time, the true effect of class size on test score will be between -3.22 and -1.33
❌ We are 95% confident that a randomly selected school district will have an effect of class size on test score between -3.22 and -1.33
❌ 95% of the time, the true effect of class size on test score will be between -3.22 and -1.33
❌ We are 95% confident that a randomly selected school district will have an effect of class size on test score between -3.22 and -1.33
❌ The effect of class size on test score is -2.28 95% of the time.
❌ 95% of the time, the true effect of class size on test score will be between -3.22 and -1.33
❌ We are 95% confident that a randomly selected school district will have an effect of class size on test score between -3.22 and -1.33
❌ The effect of class size on test score is -2.28 95% of the time.
✅ We are 95% confident that in similarly constructed samples, the true effect is between -3.22 and -1.33
infer
)lm summary()
output, need the confint
commandconfint(school_reg)
## 2.5 % 97.5 %## (Intercept) 680.32313 717.542779## str -3.22298 -1.336637
broom
can include confidence intervalsschool_reg %>% tidy(conf.int = TRUE)
ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | |
---|---|---|
(Intercept) | 698.932952 | |
str | -2.279808 |
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
o | Tile View: Overview of Slides |
Esc | Back to slideshow |
Problem for identification: endogeneity
Problem for inference: randomness
OLS estimators (^β0 and ^β1) are computed from a finite (specific) sample of data
Our OLS model contains 2 sources of randomness:
Modeled randomness: u includes all factors affecting Y other than X
Sampling randomness: different samples will generate different OLS estimators
Sample statistical inference→ Population causal indentification→ Unobserved Parameters
We want to identify causal relationships between population variables
We'll use sample statistics to infer something about population parameters
Population
Population
Population relationship
Yi=3.24+0.44Xi+ui
Yi=β0+β1Xi+ui
Sample 1: 30 random individuals
Sample 1: 30 random individuals
Population relationship
Yi=3.24+0.44Xi+ui
Sample relationship
ˆYi=3.19+0.47Xi
Sample 2: 30 random individuals
Population relationship
Yi=3.24+0.44Xi+ui
Sample relationship
ˆYi=4.26+0.25Xi
Sample 3: 30 random individuals
Population relationship
Yi=3.24+0.44Xi+ui
Sample relationship
ˆYi=2.91+0.46Xi
Let's repeat this process 10,000 times!
This exercise is called a (Monte Carlo) simulation
infer
packageLet's repeat this process 10,000 times!
This exercise is called a (Monte Carlo) simulation
infer
packageOn average estimated regression lines from our hypothetical samples provide an unbiased estimate of the true population regression line E[^β1]=β1
However, any individual line (any one sample) can miss the mark
This leads to uncertainty about our estimated regression line
Sample statistical inference→ Population causal indentification→ Unobserved Parameters
Sample statistical inference→ Population causal indentification→ Unobserved Parameters
Sample statistical inference→ Population causal indentification→ Unobserved Parameters
^Yi=^β0+^β1X🤞 hopefully 🤞→Yi=β0+β1X+ui
Our problem with uncertainty is we don’t know whether our sample estimate is close or far from the unknown population parameter
But we can use our errors to learn how well our model statistics likely estimate the true parameters
Use ^β1 and its standard error, se(^β1) for statistical inference about true β1
We have two options...
Point estimate
Use our ^β1 & se(^β1) to determine if statistically significant evidence to reject a hypothesized β1
Reporting a single value (^β1) is often not going to be the true population parameter (β1)
Confidence interval
We can generate our confidence interval by generating a “bootstrap” sampling distribution
This takes our sample data, and resamples it by selecting random observations with replacement
This allows us to approximate the sampling distribution of ^β1 by simulation!
infer
package allows you to do statistical inference in a tidy
way, following the philosophy of the tidyverse
# install first!install.packages("infer")# loadlibrary(infer)
infer
allows you to run through these steps manually to understand the process:specify()
a model
generate()
a bootstrap distribution
calculate()
the confidence interval
visualize()
with a histogram (optional)
ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> |
---|---|---|---|---|
(Intercept) | 698.932952 | 9.4674914 | 73.824514 | 6.569925e-242 |
str | -2.279808 | 0.4798256 | -4.751327 | 2.783307e-06 |
ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> |
---|---|---|---|---|
(Intercept) | 698.932952 | 9.4674914 | 73.824514 | 6.569925e-242 |
str | -2.279808 | 0.4798256 | -4.751327 | 2.783307e-06 |
ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> |
---|---|---|---|---|
(Intercept) | 708.270835 | 9.5041448 | 74.522311 | 1.698400e-243 |
str | -2.797334 | 0.4802065 | -5.825274 | 1.139188e-08 |
👆 Bootstrapped from Our Sample
ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> |
---|---|---|---|---|
(Intercept) | 698.932952 | 9.4674914 | 73.824514 | 6.569925e-242 |
str | -2.279808 | 0.4798256 | -4.751327 | 2.783307e-06 |
ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> |
---|---|---|---|---|
(Intercept) | 708.270835 | 9.5041448 | 74.522311 | 1.698400e-243 |
str | -2.797334 | 0.4802065 | -5.825274 | 1.139188e-08 |
👆 Bootstrapped from Our Sample
data %>%
specify(y ~ x)
specify()
function, which is essentially a lm()
function for regression (for our purposes)CASchool %>% specify(testscr ~ str)
ABCDEFGHIJ0123456789 |
testscr <dbl> | str <dbl> |
---|---|
690.80 | 17.88991 |
661.20 | 21.52466 |
643.60 | 18.69723 |
647.70 | 17.35714 |
640.85 | 18.67133 |
%>% generate(reps = n,
type = "bootstrap")
Now the magic starts, as we run a number of simulated samples
Set the number of reps
and set type
to "bootstrap"
CASchool %>% specify(testscr ~ str) %>% generate(reps = 1000, type = "bootstrap")
%>% generate(reps = n,
type = "bootstrap")
replicate
: the “sample” number (1-1000)
creates x
and y
values (data points)
%>% calculate(stat = "slope")
CASchool %>% specify(testscr ~ str) %>% generate(reps = 1000, type = "bootstrap") %>% calculate(stat = "slope")
For each of the 1,000 replicates, calculate slope
in lm(testscr ~ str)
Calls it the stat
%>% calculate(stat = "slope")
boot <- CASchool %>% #<< # save this specify(testscr ~ str) %>% generate(reps = 1000, type = "bootstrap") %>% calculate(stat = "slope")
boot
is (our simulated) sampling distribution of ^β1!
We can now use this to estimate the confidence interval from our ^β1=−2.28
And visualize it
ABCDEFGHIJ0123456789 |
lower <dbl> | upper <dbl> |
---|---|
-3.340545 | -1.238815 |
sampling_dist<-ggplot(data = boot)+ aes(x = stat)+ geom_histogram(color="white", fill = "#e64173")+ labs(x = expression(hat(beta[1])))+ theme_pander(base_family = "Fira Sans Condensed", base_size=20)sampling_dist
ci<-boot %>% summarize(lower = quantile(stat, 0.025), upper = quantile(stat, 0.975))ci
ABCDEFGHIJ0123456789 |
lower <dbl> | upper <dbl> |
---|---|
-3.340545 | -1.238815 |
sampling_dist+ geom_vline(data = ci, aes(xintercept = lower), size = 1, linetype="dashed")+ geom_vline(data = ci, aes(xintercept = upper), size = 1, linetype="dashed")
%>% get_confidence_interval()
CASchool %>% #<< # save this specify(testscr ~ str) %>% generate(reps = 1000, type = "bootstrap") %>% calculate(stat = "slope") %>% get_confidence_interval(level = 0.95, type = "se", point_estimate = -2.28)
ABCDEFGHIJ0123456789 |
lower_ci <dbl> | upper_ci <dbl> |
---|---|
-3.273376 | -1.286624 |
tidy_reg <- school_reg %>% tidy(conf.int = T)tidy_reg
ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> | |
---|---|---|---|---|---|
(Intercept) | 698.932952 | 9.4674914 | 73.824514 | 6.569925e-242 | |
str | -2.279808 | 0.4798256 | -4.751327 | 2.783307e-06 |
tidy_reg <- school_reg %>% tidy(conf.int = T)tidy_reg
ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> | |
---|---|---|---|---|---|
(Intercept) | 698.932952 | 9.4674914 | 73.824514 | 6.569925e-242 | |
str | -2.279808 | 0.4798256 | -4.751327 | 2.783307e-06 |
# save and extract confidence intervalour_CI <- tidy_reg %>% filter(term == "str") %>% select(conf.low, conf.high)our_CI
ABCDEFGHIJ0123456789 |
conf.low <dbl> | conf.high <dbl> |
---|---|
-3.22298 | -1.336637 |
%>% visualize()
CASchool %>% #<< # save this specify(testscr ~ str) %>% generate(reps = 1000, type = "bootstrap") %>% calculate(stat = "slope") %>% visualize()
visualize()
is just a wrapper for ggplot()
%>% visualize()
CASchool %>% #<< # save this specify(testscr ~ str) %>% generate(reps = 1000, type = "bootstrap") %>% calculate(stat = "slope") %>% visualize()+shade_ci(endpoints = our_CI)
our_CI
) we can shade_ci()
in infer
's visualize()
function([ estimate - MOE ], [ estimate + MOE ])
(1−α) is the confidence level of our confidence interval
Typical levels: 90%, 95%, 99%
Depending on our confidence level, we are essentially looking for the middle (1−α)% of the sampling distribution
This puts α in the tails; α2 in each tail
Recall the 68-95-99.7% empirical rule for (standard) normal distributions!†
95% of data falls within 2 standard deviations of the mean
Thus, in 95% of samples, the true parameter is likely to fall within about 2 standard deviations of the sample estimate
† I’m playing fast and loose here, we can’t actually use the normal distribution, we use the Student’s t-distribution with n-k-1 degrees of freedom. But there’s no need to complicate things you don’t need to know about. Look at today’s class notes for more.
❌ 95% of the time, the true effect of class size on test score will be between -3.22 and -1.33
❌ 95% of the time, the true effect of class size on test score will be between -3.22 and -1.33
❌ We are 95% confident that a randomly selected school district will have an effect of class size on test score between -3.22 and -1.33
❌ 95% of the time, the true effect of class size on test score will be between -3.22 and -1.33
❌ We are 95% confident that a randomly selected school district will have an effect of class size on test score between -3.22 and -1.33
❌ The effect of class size on test score is -2.28 95% of the time.
❌ 95% of the time, the true effect of class size on test score will be between -3.22 and -1.33
❌ We are 95% confident that a randomly selected school district will have an effect of class size on test score between -3.22 and -1.33
❌ The effect of class size on test score is -2.28 95% of the time.
✅ We are 95% confident that in similarly constructed samples, the true effect is between -3.22 and -1.33
infer
)lm summary()
output, need the confint
commandconfint(school_reg)
## 2.5 % 97.5 %## (Intercept) 680.32313 717.542779## str -3.22298 -1.336637
broom
can include confidence intervalsschool_reg %>% tidy(conf.int = TRUE)
ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> | |
---|---|---|---|---|---|
(Intercept) | 698.932952 | 9.4674914 | 73.824514 | 6.569925e-242 | |
str | -2.279808 | 0.4798256 | -4.751327 | 2.783307e-06 |