class: center, middle, inverse, title-slide # 2.5 — OLS: Precision and Diagnostics ## ECON 480 • Econometrics • Fall 2021 ### Ryan Safner
Assistant Professor of Economics
safner@hood.edu
ryansafner/metricsF21
metricsF21.classes.ryansafner.com
--- class: inverse # Outline ### [Variation in `\(\hat{\beta_1}\)`](#5) ### [Presenting Regression Results](#10) ### [Diagnostics about Regression](#23) ### [Problem: Heteroskedasticity](#33) ### [Outliers](#54) --- # The Sampling Distribution of `\(\hat{\beta_1}\)` .pull-left[ .smaller[ `$$\hat{\beta_1} \sim N(E[\hat{\beta_1}], \sigma_{\hat{\beta_1}})$$` 1. .hi-purple[Center] of the distribution (last class) - `\(E[\hat{\beta_1}]=\beta_1\)`<sup>.magenta[†]</sup> ] ] .pull-right[ <img src="2.5-slides_files/figure-html/unnamed-chunk-1-1.png" width="504" /> ] --- # The Sampling Distribution of `\(\hat{\beta_1}\)` .pull-left[ .smaller[ `$$\hat{\beta_1} \sim N(E[\hat{\beta_1}], \sigma_{\hat{\beta_1}})$$` 1. .hi-purple[Center] of the distribution (last class) - `\(E[\hat{\beta_1}]=\beta_1\)`<sup>.magenta[†]</sup> 2. How .hi-purple[precise] is our estimate? (today) - <span class="hi">Variance `\(\sigma^2_{\hat{\beta_1}}\)`</span> or <span class="hi">standard error<sup>.magenta[‡]</sup> `\(\sigma_{\hat{\beta_1}}\)`</span> ] .footnote[ .quitesmall[ <sup>.magenta[†]</sup> Under the 4 assumptions about `\(u\)` (particularly, `\(cor(X,u)=0)\)`. <sup>.magenta[‡]</sup> Standard **“error”** is the analog of standard *deviation* when talking about <br> the *sampling distribution* of a sample statistic (such as `\(\bar{X}\)` or `\(\hat{\beta_1})\)`. ] ] ] .pull-right[ ![](2.5-slides_files/figure-html/unnamed-chunk-2-1.png)<!-- --> ] --- class: inverse, center, middle # Variation in `\(\hat{\beta_1}\)` --- # What Affects Variation in `\(\hat{\beta_1}\)` .pull-left[ `$$var(\hat{\beta_1})=\frac{(SER)^2}{n \times var(X)}$$` `$$se(\hat{\beta_1})=\sqrt{var(\hat{\beta_1})} = \frac{SER}{\sqrt{n} \times sd(X)}$$` ] .pull-right[ - Variation in `\(\hat{\beta_1}\)` is affected by 3 things: 1. .hi-turquoise[Goodness of fit of the model (SER)]<sup>.magenta[†]</sup> - Larger `\(SER\)` `\(\rightarrow\)` larger `\(var(\hat{\beta_1})\)` 2. .hi-turquoise[Sample size, n] - Larger `\(n\)` `\(\rightarrow\)` smaller `\(var(\hat{\beta_1})\)` 3. .hi-turquoise[Variance of X] - Larger `\(var(X)\)` `\(\rightarrow\)` smaller `\(var(\hat{\beta_1})\)` ] .footnote[.quitesmall[ <sup>.magenta[†]</sup> Recall from last class, the **S**tandard **E**rror of the **R**egression `\(\hat{\sigma_u} = \sqrt{\frac{\sum \hat{u_i}^2}{n-2}}\)`] ] --- # Variation in `\(\hat{\beta_1}\)`: Goodness of Fit .pull-left[ ![](2.5-slides_files/figure-html/unnamed-chunk-3-1.png)<!-- --> ] .pull-right[ ![](2.5-slides_files/figure-html/unnamed-chunk-4-1.png)<!-- --> ] --- # Variation in `\(\hat{\beta_1}\)`: Sample Size .pull-left[ ![](2.5-slides_files/figure-html/unnamed-chunk-5-1.png)<!-- --> ] .pull-right[ ![](2.5-slides_files/figure-html/unnamed-chunk-6-1.png)<!-- --> ] --- # Variation in `\(\hat{\beta_1}\)`: Variation in `\(X\)` .pull-left[ ![](2.5-slides_files/figure-html/unnamed-chunk-7-1.png)<!-- --> ] .pull-right[ ![](2.5-slides_files/figure-html/unnamed-chunk-8-1.png)<!-- --> ] --- class: inverse, center, middle # Presenting Regression Results --- # Our Class Size Regression: Base R .pull-left[ - How can we present all of this information in a tidy way? ] .pull-right[ .code50[ ```r summary(school_reg) # get full summary ``` ``` ## ## Call: ## lm(formula = testscr ~ str, data = CASchool) ## ## Residuals: ## Min 1Q Median 3Q Max ## -47.727 -14.251 0.483 12.822 48.540 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 698.9330 9.4675 73.825 < 2e-16 *** ## str -2.2798 0.4798 -4.751 2.78e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 18.58 on 418 degrees of freedom ## Multiple R-squared: 0.05124, Adjusted R-squared: 0.04897 ## F-statistic: 22.58 on 1 and 418 DF, p-value: 2.783e-06 ``` ] ] --- # Our Class Size Regression: Broom I .left-column[ .center[ ![](../images/rbroom.png) ] ] .right-column[ .smaller[ - `broom`'s `tidy()` function creates a tidy tibble of regression output .code50[ ```r # load broom library(broom) # tidy regression output tidy(school_reg) ``` ``` ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 699. 9.47 73.8 6.57e-242 ## 2 str -2.28 0.480 -4.75 2.78e- 6 ``` ] ] ] --- # Our Class Size Regression: Broom II - `broom`'s `glance()` gives us summary statistics about the regression ```r glance(school_reg) ``` ``` ## # A tibble: 1 × 12 ## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.0512 0.0490 18.6 22.6 0.00000278 1 -1822. 3650. 3663. ## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int> ``` --- # Presenting Regressions in a Table .pull-left[ .smaller[ - Professional journals and papers often have a .hi-purple[regression table], including: - Estimates of `\(\hat{\beta_0}\)` and `\(\hat{\beta_1}\)` - Standard errors of `\(\hat{\beta_0}\)` and `\(\hat{\beta_1}\)` (often below, in parentheses) - Indications of statistical significance (often with asterisks) - Measures of regression fit: `\(R^2\)`, `\(SER\)`, etc - Later: multiple rows & columns for multiple variables & models ] ] .pull-right[ .smallest[ .regtable[
Test Score
Intercept
698.93 ***
(9.47)
STR
-2.28 ***
(0.48)
N
420
R-Squared
0.05
SER
18.58
*** p < 0.001; ** p < 0.01; * p < 0.05.
] ] ] --- # Regression Output with huxtable I .pull-left[ .smallest[ - You will need to first `install.packages("huxtable")` - Load with `library(huxtable)` - Command: `huxreg()` - Main argument is the name of your `lm` object - Default output is fine, but often we want to customize a bit ] .code60[ ```r # install.packages("huxtable") library(huxtable) huxreg(school_reg) ``` ] ] .pull-right[ .smallest[ .regtable[
(1)
(Intercept)
698.933 ***
(9.467)
str
-2.280 ***
(0.480)
N
420
R2
0.051
logLik
-1822.250
AIC
3650.499
*** p < 0.001; ** p < 0.01; * p < 0.05.
] ] ] --- # Regression Output with huxtable II .smallest[ - Can give title to each column ] .code60[ ```r "Test Score" = school_reg ``` ] -- .smallest[ - Can change name of coefficients from default ] .code60[ ```r coefs = c("Intercept" = "(Intercept)", "STR" = "str") ``` ] -- .smallest[ - Decide what statistics to include, and rename them ] .code60[ ```r statistics = c("N" = "nobs", "R-Squared" = "r.squared", "SER" = "sigma") ``` ] -- .smallest[ - Choose how many decimal places to round to ] .code60[ ```r number_format = 2 ``` ] --- # Regression Output with huxtable III .pull-left[ .code60[ ```r huxreg("Test Score" = school_reg, coefs = c("Intercept" = "(Intercept)", "STR" = "str"), statistics = c("N" = "nobs", "R-Squared" = "r.squared", "SER" = "sigma"), number_format = 2) ``` ] ] -- .pull-left[ .font70[ .regtable[
Test Score
Intercept
698.93 ***
(9.47)
STR
-2.28 ***
(0.48)
N
420
R-Squared
0.05
SER
18.58
*** p < 0.001; ** p < 0.01; * p < 0.05.
] ] ] --- # Regression Outputs - `huxtable` is one package you can use - See [here for more options](https://cran.r-project.org/web/packages/huxtable/vignettes/huxtable.html) - I used to only use [`stargazer`](https://cran.r-project.org/web/packages/stargazer/index.html), but as it was originally meant for STATA, it has limits and problems - A great [cheetsheat](http://jakeruss.com/cheatsheets/stargazer.html) by my friend Jake Russ --- class: inverse, center, middle # Diagnostics about Regression --- # Diagnostics: Residuals I - We often look at the residuals of a regression to get more insight about its **goodness of fit** and its **bias** - Recall `broom`'s `augment` creates some useful new variables - `.fitted` are fitted (predicted) values from model, i.e. `\(\hat{Y}_i\)` - `.resid` are residuals (errors) from model, i.e. `\(\hat{u}_i\)` --- # Diagnostics: Residuals II - Often a good idea to store in a new object (so we can make some plots) .code60[ .smallest[ ```r aug_reg<-augment(school_reg) aug_reg %>% head() ```
testscr
str
.fitted
.resid
.hat
.sigma
.cooksd
.std.resid
691
17.9
658
32.7
0.00442
18.5
0.00689
1.76
661
21.5
650
11.3
0.00475
18.6
0.000893
0.612
644
18.7
656
-12.7
0.00297
18.6
0.0007
-0.685
648
17.4
659
-11.7
0.00586
18.6
0.00117
-0.629
641
18.7
656
-15.5
0.00301
18.6
0.00105
-0.836
606
21.4
650
-44.6
0.00446
18.5
0.013
-2.4
] ] --- # Recap: Assumptions about Errors .pull-left[ .smallest[ - We make .hi[4 critical **assumptions** about `\\(u\\)`]: 1. The expected value of the residuals is 0 `$$E[u]=0$$` 2. The variance of the residuals over `\(X\)` is constant: `$$var(u|X)=\sigma^2_{u}$$` 3. Errors are not correlated across observations: `$$cor(u_i,u_j)=0 \quad \forall i \neq j$$` 4. There is no correlation between `\(X\)` and the error term: `$$cor(X, u)=0 \text{ or } E[u|X]=0$$` ] ] .pull-right[ .center[ ![](../images/error.png) ] ] --- # Assumptions 1 and 2: Errors are i.i.d. - Assumptions 1 and 2 assume that errors are coming from the same (*normal*) distribution `$$u \sim N(0, \sigma_u)$$` - Assumption 1: `\(E[u]=0\)` - Assumption 2: `\(sd(u|X)=\sigma_u\)` - virtually always unknown... - We often can visually check by plotting a **histogram** of `\(u\)` --- # Plotting Residuals .pull-left[ .code50[ ```r ggplot(data = aug_reg)+ aes(x = .resid)+ geom_histogram(color="white", fill = "pink")+ labs(x = expression(paste("Residual, ", hat(u))))+ theme_pander(base_family = "Fira Sans Condensed", base_size=20) ``` ] ] -- .pull-right[ <img src="2.5-slides_files/figure-html/unnamed-chunk-22-1.png" width="504" /> ] --- # Plotting Residuals .pull-left[ .code50[ ```r ggplot(data = aug_reg)+ aes(x = .resid)+ geom_histogram(color="white", fill = "pink")+ labs(x = expression(paste("Residual, ", hat(u))))+ theme_pander(base_family = "Fira Sans Condensed", base_size=20) ``` ] - Just to check: ```r aug_reg %>% summarize(E_u = mean(.resid), sd_u = sd(.resid)) ```
E_u
sd_u
3.7e-13
18.6
] .pull-right[ <img src="2.5-slides_files/figure-html/unnamed-chunk-24-1.png" width="504" /> ] --- # Residual Plot .pull-left[ - We often plot a .hi[residual plot] to see any odd patterns about residuals - `\(x\)`-axis are `\(X\)` values (`str`) - `\(y\)`-axis are `\(u\)` values (`.resid`) .code60[ ```r ggplot(data = aug_reg)+ * aes(x = str, * y = .resid)+ geom_point(color="blue")+ geom_hline(aes(yintercept = 0), color="red")+ labs(x = "Student to Teacher Ratio", y = expression(paste("Residual, ", hat(u))))+ theme_pander(base_family = "Fira Sans Condensed", base_size=20) ``` ] ] -- .pull-right[ <img src="2.5-slides_files/figure-html/unnamed-chunk-25-1.png" width="504" /> ] --- class: inverse, center, middle # Problem: Heteroskedasticity --- # Homoskedasticity .pull-left[ - ".hi[Homoskedasticity]:" variance of the residuals over `\(X\)` is constant, written: `$$var(u|X)=\sigma^2_{u}$$` - Knowing the value of `\(X\)` does not affect the variance (spread) of the errors ] .pull-right[ .center[ ![](../images/error.png) ] ] --- # Heteroskedasticity I .pull-left[ - ".hi[Heteroskedasticity]:" variance of the residuals over `\(X\)` is *NOT* constant: `$$var(u|X) \neq \sigma^2_{u}$$` - **This does not cause `\(\hat{\beta_1}\)` to be biased**, but it does cause the standard error of `\(\hat{\beta_1}\)` to be incorrect - This **does** cause a problem for .hi-purple[inference]! ] .pull-right[ .center[ ![](../images/error.png) ] ] --- # Heteroskedasticity II - Recall the formula for the standard error of `\(\hat{\beta_1}\)`: `$$se(\hat{\beta_1})=\sqrt{var(\hat{\beta_1})} = \frac{SER}{\sqrt{n} \times sd(X)}$$` - This actually *assumes* homoskedasticity --- # Heteroskedasticity III - Under heteroskedasticity, the standard error of `\(\hat{\beta_1}\)` mutates to: `$$se(\hat{\beta_1})=\sqrt{\frac{\displaystyle\sum^n_{i=1}(X_i-\bar{X})^2\hat{u}^2}{\big[\displaystyle\sum^n_{i=1}(X_i-\bar{X})^2\big]^2}}$$` - This is a .hi[heteroskedasticity-robust] (or just .hi["robust"]) method of calculating `\(se(\hat{\beta_1})\)` - Don't learn formula, **do learn what heteroskedasticity is and how it affects our model!** --- # Visualizing Heteroskedasticity I .pull-left[ - Our original scatterplot with regression line ] .pull-right[ <img src="2.5-slides_files/figure-html/unnamed-chunk-26-1.png" width="504" /> ] --- # Visualizing Heteroskedasticity I .pull-left[ - Our original scatterplot with regression line - Does the spread of the errors change over different values of `\(str\)`? - No: homoskedastic - Yes: heteroskedastic ] .pull-right[ ![](2.5-slides_files/figure-html/unnamed-chunk-27-1.png)<!-- --> ] --- # Visualizing Heteroskedasticity I .pull-left[ - Our original scatterplot with regression line - Does the spread of the errors change over different values of `\(str\)`? - No: homoskedastic - Yes: heteroskedastic ] .pull-right[ ![](2.5-slides_files/figure-html/unnamed-chunk-28-1.png)<!-- --> ![](2.5-slides_files/figure-html/unnamed-chunk-29-1.png)<!-- --> ] --- # Heteroskedasticity: Another View .pull-left[ - Using the `ggridges` package - Plotting the (conditional) distribution of errors by STR - See that the variation in errors `\((\hat{u})\)` changes across class sizes! ] .pull-right[ ![](2.5-slides_files/figure-html/unnamed-chunk-30-1.png)<!-- --> ] --- # More Obvious Heteroskedasticity .pull-left[ - Visual cue: data is "fan-shaped" - Data points are closer to line in some areas - Data points are more spread from line in other areas ] .pull-right[ <img src="2.5-slides_files/figure-html/unnamed-chunk-31-1.png" width="504" /> ] --- # More Obvious Heteroskedasticity .pull-left[ - Visual cue: data is "fan-shaped" - Data points are closer to line in some areas - Data points are more spread from line in other areas ] .pull-right[ <img src="2.5-slides_files/figure-html/unnamed-chunk-32-1.png" width="504" /><img src="2.5-slides_files/figure-html/unnamed-chunk-32-2.png" width="504" /> ] --- # Heteroskedasticity: Another View .pull-left[ - Using the `ggridges` package - Plotting the (conditional) distribution of errors by x ] .pull-right[ ![](2.5-slides_files/figure-html/unnamed-chunk-33-1.png)<!-- --> ] --- # What Might Cause Heteroskedastic Errors? .pull-left[ .smallest[ `$$\widehat{wage_i}=\hat{\beta_0}+\hat{\beta_1}educ_i$$`
Wage
Intercept
-0.90
(0.68)
Years of Schooling
0.54 ***
(0.05)
N
526
R-Squared
0.16
SER
3.38
*** p < 0.001; ** p < 0.01; * p < 0.05.
] ] .pull-right[ <img src="2.5-slides_files/figure-html/unnamed-chunk-35-1.png" width="504" /> ] --- # What Might Cause Heteroskedastic Errors? .pull-left[ .smallest[ `$$\widehat{wage_i}=\hat{\beta_0}+\hat{\beta_1}educ_i$$`
Wage
Intercept
-0.90
(0.68)
Years of Schooling
0.54 ***
(0.05)
N
526
R-Squared
0.16
SER
3.38
*** p < 0.001; ** p < 0.01; * p < 0.05.
] ] .pull-right[ <img src="2.5-slides_files/figure-html/unnamed-chunk-37-1.png" width="504" /><img src="2.5-slides_files/figure-html/unnamed-chunk-37-2.png" width="504" /> ] --- # Heteroskedasticity: Another View .pull-left[ - Using the `ggridges` package - Plotting the (conditional) distribution of errors by education ] .pull-right[ ![](2.5-slides_files/figure-html/unnamed-chunk-38-1.png)<!-- --> ] --- # Detecting Heteroskedasticity I - Several tests to check if data is heteroskedastic - One common test is **Breusch-Pagan test** - Can use `bptest()` with `lmtest` package in `R` - `\(H_0\)`: homoskedastic - If `\(p\)`-value < 0.05, reject `\(H_0\implies\)` heteroskedastic -- .pull-left[ ```r # install.packages("lmtest") library("lmtest") bptest(school_reg) ``` ] .pull-right[ ``` ## ## studentized Breusch-Pagan test ## ## data: school_reg ## BP = 5.7936, df = 1, p-value = 0.01608 ``` ] --- # Detecting Heteroskedasticity II - How about our wage regression? .pull-left[ ```r # install.packages("lmtest") library("lmtest") bptest(wage_reg) ``` ] .pull-right[ ``` ## ## studentized Breusch-Pagan test ## ## data: wage_reg ## BP = 15.306, df = 1, p-value = 9.144e-05 ``` ] --- # Fixing Heteroskedasticity I .smallest[ - Heteroskedasticity is easy to fix with software that can calculate .hi[robust] standard errors (using the more complicated formula above) - Easiest method is to use `estimatr` package - `lm_robust()` command (instead of `lm`) to run regression - set `se_type="stata"` to calculate robust SEs using the formula above ] .code60[ ```r #install.packages("estimatr") library(estimatr) school_reg_robust <-lm_robust(testscr ~ str, data = CASchool, se_type = "stata") school_reg_robust ``` ``` ## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper ## (Intercept) 698.932952 10.3643599 67.436191 9.486678e-227 678.560192 719.305713 ## str -2.279808 0.5194892 -4.388557 1.446737e-05 -3.300945 -1.258671 ## DF ## (Intercept) 418 ## str 418 ``` ] --- # Fixing Heteroskedasticity II .pull-left[ ```r library(huxtable) huxreg("Normal" = school_reg, "Robust" = school_reg_robust, coefs = c("Intercept" = "(Intercept)", "STR" = "str"), statistics = c("N" = "nobs", "R-Squared" = "r.squared", "SER" = "sigma"), number_format = 2) ``` ] .pull-right[ .font80[ .regtable[
Normal
Robust
Intercept
698.93 ***
698.93 ***
(9.47)
(10.36)
STR
-2.28 ***
-2.28 ***
(0.48)
(0.52)
N
420
420
R-Squared
0.05
0.05
SER
18.58
*** p < 0.001; ** p < 0.01; * p < 0.05.
] ] ] --- # Assumption 3: No Serial Correlation .pull-left[ .smallest[ - Errors are not correlated across observations: `$$cor(u_i,u_j)=0 \quad \forall i \neq j$$` - For simple cross-sectional data, this is rarely an issue - Time-series & panel data nearly always contain .hi[serial correlation] or .hi[autocorrelation] between errors - Errors may be .hi[clustered] - **by group**: e.g. all observations from Maryland, all observations from Virginia, etc. - **by time**: GDP in 2006 around the world, GDP in 2008 around the world, etc. - We'll deal with these fixes when we talk about panel data (or time-series if necessary) ] ] .pull-right[ .center[ ![](../images/error.png) ] ] --- class: inverse, center, middle # Outliers --- # Outliers Can Bias OLS! I .pull-left[ - .hi[Outliers] can affect the slope (and intercept) of the line and add .hi[bias] - May be result of human error (measurement, transcribing, etc) - May be meaningful and accurate - In any case, compare how including/dropping outliers affects regression and always discuss outliers! ] .pull-right[ <img src="2.5-slides_files/figure-html/outlier-plot-1.png" width="504" /> ] --- # Outliers Can Bias OLS! II .pull-left[ .quitesmall[ .regtable[ .code60[ ```r huxreg("No Outliers" = school_reg, "Outliers" = school_outlier_reg, coefs = c("Intercept" = "(Intercept)", "STR" = "str"), statistics = c("N" = "nobs", "R-Squared" = "r.squared", "SER" = "sigma"), number_format = 2) ```
No Outliers
Outliers
Intercept
698.93 ***
641.40 ***
(9.47)
(11.21)
STR
-2.28 ***
0.71
(0.48)
(0.57)
N
420
423
R-Squared
0.05
0.00
SER
18.58
23.76
*** p < 0.001; ** p < 0.01; * p < 0.05.
] ] ] ] .pull-right[ ![](2.5-slides_files/figure-html/unnamed-chunk-48-1.png)<!-- --> ] --- # Detecting Outliers .quitesmall[ - The `car` package has an `outlierTest` command to run on the regression .code60[ ```r library("car") # Use Bonferonni test outlierTest(school_outlier_reg) # will point out which obs #s seem outliers ``` ``` ## rstudent unadjusted p-value Bonferroni p ## 422 8.822768 3.0261e-17 1.2800e-14 ## 423 7.233470 2.2493e-12 9.5147e-10 ## 421 6.232045 1.1209e-09 4.7414e-07 ``` ```r # find these observations CA.outlier %>% slice(c(422,423,421)) ```
observat
district
testscr
str
422
Crazy School 2
850
28
423
Crazy School 3
820
29
421
Crazy School 1
800
30
] ]