2.5 — OLS: Precision and Diagnostics

# 2.5 — OLS: Precision and Diagnostics
## ECON 480 • Econometrics • Fall 2021
### Ryan Safner<br> Assistant Professor of Economics <br> <a href="mailto:safner@hood.edu"><i class="fa fa-paper-plane fa-fw"></i>safner@hood.edu</a> <br> <a href="https://github.com/ryansafner/metricsF21"><i class="fa fa-github fa-fw"></i>ryansafner/metricsF21</a><br> <a href="https://metricsF21.classes.ryansafner.com"> <i class="fa fa-globe fa-fw"></i>metricsF21.classes.ryansafner.com</a><br>

---

# Outline

### [Variation in `$\hat{\beta_1}$`](#5)

### [Presenting Regression Results](#10)

### [Diagnostics about Regression](#23)

### [Problem: Heteroskedasticity](#33)

### [Outliers](#54)

---

# The Sampling Distribution of `$\hat{\beta_1}$`

1. .hi-purple[Center] of the distribution (last class)
  - `$E[\hat{\beta_1}]=\beta_1$`<sup>.magenta[†]</sup>

]
]

]

---

# The Sampling Distribution of `$\hat{\beta_1}$`

1. .hi-purple[Center] of the distribution (last class)
  - `$E[\hat{\beta_1}]=\beta_1$`<sup>.magenta[†]</sup>

2. How .hi-purple[precise] is our estimate? (today)
  - <span class="hi">Variance `$\sigma^2_{\hat{\beta_1}}$`</span> or <span class="hi">standard error<sup>.magenta[‡]</sup> `$\sigma_{\hat{\beta_1}}$`</span>

]

.footnote[
.quitesmall[
<sup>.magenta[†]</sup> Under the 4 assumptions about `$u$` (particularly, `$cor(X,u)=0)$`.

<sup>.magenta[‡]</sup> Standard **“error”** is the analog of standard *deviation* when talking about <br>
the *sampling distribution* of a sample statistic (such as `$\bar{X}$` or `$\hat{\beta_1})$`.
]
]

]

![](2.5-slides_files/figure-html/unnamed-chunk-2-1.png)

]
---

# Variation in `$\hat{\beta_1}$`

---

# What Affects Variation in `$\hat{\beta_1}$`

`$$var(\hat{\beta_1})=\frac{(SER)^2}{n \times var(X)}$$`

`$$se(\hat{\beta_1})=\sqrt{var(\hat{\beta_1})} = \frac{SER}{\sqrt{n} \times sd(X)}$$`

]

- Variation in `$\hat{\beta_1}$` is affected by 3 things:

1. .hi-turquoise[Goodness of fit of the model (SER)]<sup>.magenta[†]</sup>
  - Larger `$SER$` `$\rightarrow$` larger `$var(\hat{\beta_1})$`
2. .hi-turquoise[Sample size, n]
  - Larger `$n$` `$\rightarrow$` smaller `$var(\hat{\beta_1})$`
3. .hi-turquoise[Variance of X]
  - Larger `$var(X)$` `$\rightarrow$` smaller `$var(\hat{\beta_1})$`
    
]

.footnote[.quitesmall[
<sup>.magenta[†]</sup> Recall from last class, the **S**tandard **E**rror of the **R**egression `$\hat{\sigma_u} = \sqrt{\frac{\sum \hat{u_i}^2}{n-2}}$`]
]

---

# Variation in `$\hat{\beta_1}$`: Goodness of Fit

![](2.5-slides_files/figure-html/unnamed-chunk-3-1.png)
]

![](2.5-slides_files/figure-html/unnamed-chunk-4-1.png)

]

---

# Variation in `$\hat{\beta_1}$`: Sample Size

![](2.5-slides_files/figure-html/unnamed-chunk-5-1.png)
]

![](2.5-slides_files/figure-html/unnamed-chunk-6-1.png)

]

---

# Variation in `$\hat{\beta_1}$`: Variation in `$X$`

![](2.5-slides_files/figure-html/unnamed-chunk-7-1.png)
]

![](2.5-slides_files/figure-html/unnamed-chunk-8-1.png)

]

---

# Presenting Regression Results

---

# Our Class Size Regression: Base R

```r
summary(school_reg) # get full summary
```

```
## 
## Call:
## lm(formula = testscr ~ str, data = CASchool)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.727 -14.251   0.483  12.822  48.540 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 698.9330     9.4675  73.825  < 2e-16 ***
## str          -2.2798     0.4798  -4.751 2.78e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.58 on 418 degrees of freedom
## Multiple R-squared:  0.05124,	Adjusted R-squared:  0.04897 
## F-statistic: 22.58 on 1 and 418 DF,  p-value: 2.783e-06
```
]
]

---

# Our Class Size Regression: Broom I

]
.right-column[
.smaller[
- `broom`'s `tidy()` function creates a tidy tibble of regression output

```r
# load broom
library(broom)

# tidy regression output
tidy(school_reg) 
```

```
## # A tibble: 2 × 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)   699.       9.47      73.8  6.57e-242
## 2 str            -2.28     0.480     -4.75 2.78e-  6
```

]
]
]
---

# Our Class Size Regression: Broom II

- `broom`'s `glance()` gives us summary statistics about the regression

```r
glance(school_reg)
```

```
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic    p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>      <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1    0.0512        0.0490  18.6      22.6 0.00000278     1 -1822. 3650. 3663.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
```

---

# Presenting Regressions in a Table

.pull-left[
.smaller[
- Professional journals and papers often have a .hi-purple[regression table], including:
  - Estimates of `$\hat{\beta_0}$` and `$\hat{\beta_1}$`
  - Standard errors of `$\hat{\beta_0}$` and `$\hat{\beta_1}$` (often below, in parentheses)
  - Indications of statistical significance (often with asterisks)
  - Measures of regression fit: `$R^2$`, `$SER$`, etc

- Later: multiple rows & columns for multiple variables & models
]
]
.pull-right[
.smallest[
.regtable[
<table class="huxtable" style="border-collapse: collapse; border: 0px; margin-bottom: 2em; margin-top: 2em; ; margin-left: auto; margin-right: auto;  " id="tab:unnamed-chunk-12">
<col><col><tr>
<th style="vertical-align: top; text-align: center; white-space: normal; border-style: solid solid solid solid; border-width: 0.8pt 0pt 0pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: normal;"></th><th style="vertical-align: top; text-align: center; white-space: normal; border-style: solid solid solid solid; border-width: 0.8pt 0pt 0.4pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: normal;">Test Score</th></tr>
<tr>
<th style="vertical-align: top; text-align: left; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">Intercept</th><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0.4pt 0pt 0pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: normal;">698.93 ***</td></tr>
<tr>
<th style="vertical-align: top; text-align: left; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;"></th><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">(9.47)   </td></tr>
<tr>
<th style="vertical-align: top; text-align: left; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">STR</th><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">-2.28 ***</td></tr>
<tr>
<th style="vertical-align: top; text-align: left; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;"></th><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0.4pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: normal;">(0.48)   </td></tr>
<tr>
<th style="vertical-align: top; text-align: left; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">N</th><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0.4pt 0pt 0pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: normal;">420       </td></tr>
<tr>
<th style="vertical-align: top; text-align: left; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">R-Squared</th><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">0.05    </td></tr>
<tr>
<th style="vertical-align: top; text-align: left; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0.8pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: normal;">SER</th><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0.8pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: normal;">18.58    </td></tr>
<tr>
<th colspan="2" style="vertical-align: top; text-align: left; white-space: normal; border-style: solid solid solid solid; border-width: 0.8pt 0pt 0pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: normal;"> *** p < 0.001;  ** p < 0.01;  * p < 0.05.</th></tr>
</table>

]
]
]

---

# Regression Output with huxtable I

- Load with `library(huxtable)`

- Command: `huxreg()`

- Main argument is the name of your `lm` object

- Default output is fine, but often we want to customize a bit
]

```r
# install.packages("huxtable")
library(huxtable)
huxreg(school_reg)
```
]

]

.smallest[
.regtable[
<table class="huxtable" style="border-collapse: collapse; border: 0px; margin-bottom: 2em; margin-top: 2em; ; margin-left: auto; margin-right: auto;  " id="tab:unnamed-chunk-14">
<col><col><tr>
<th style="vertical-align: top; text-align: center; white-space: normal; border-style: solid solid solid solid; border-width: 0.8pt 0pt 0pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: normal;"></th><th style="vertical-align: top; text-align: center; white-space: normal; border-style: solid solid solid solid; border-width: 0.8pt 0pt 0.4pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: normal;">(1)</th></tr>
<tr>
<th style="vertical-align: top; text-align: left; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">(Intercept)</th><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0.4pt 0pt 0pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: normal;">698.933 ***</td></tr>
<tr>
<th style="vertical-align: top; text-align: left; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;"></th><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">(9.467)   </td></tr>
<tr>
<th style="vertical-align: top; text-align: left; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">str</th><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">-2.280 ***</td></tr>
<tr>
<th style="vertical-align: top; text-align: left; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;"></th><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0.4pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: normal;">(0.480)   </td></tr>
<tr>
<th style="vertical-align: top; text-align: left; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">N</th><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0.4pt 0pt 0pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: normal;">420        </td></tr>
<tr>
<th style="vertical-align: top; text-align: left; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">R2</th><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">0.051    </td></tr>
<tr>
<th style="vertical-align: top; text-align: left; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">logLik</th><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">-1822.250    </td></tr>
<tr>
<th style="vertical-align: top; text-align: left; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0.8pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: normal;">AIC</th><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0.8pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: normal;">3650.499    </td></tr>
<tr>
<th colspan="2" style="vertical-align: top; text-align: left; white-space: normal; border-style: solid solid solid solid; border-width: 0.8pt 0pt 0pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: normal;"> *** p < 0.001;  ** p < 0.01;  * p < 0.05.</th></tr>
</table>

]
]
]

---

# Regression Output with huxtable II

```r
"Test Score" = school_reg
```
]

```r
coefs = c("Intercept" = "(Intercept)",
          "STR" = "str")
```
]

```r
statistics = c("N" = "nobs",
               "R-Squared" = "r.squared",
               "SER" = "sigma")
```
]

```r
number_format = 2
```
]

---

# Regression Output with huxtable III

```r
huxreg("Test Score" = school_reg,
       coefs = c("Intercept" = "(Intercept)",
                 "STR" = "str"),
       statistics = c("N" = "nobs",
                      "R-Squared" = "r.squared",
                      "SER" = "sigma"),
       number_format = 2)
```

]
]

.pull-left[
.font70[
.regtable[
<table class="huxtable" style="border-collapse: collapse; border: 0px; margin-bottom: 2em; margin-top: 2em; ; margin-left: auto; margin-right: auto;  " id="tab:unnamed-chunk-20">
<col><col><tr>
<th style="vertical-align: top; text-align: center; white-space: normal; border-style: solid solid solid solid; border-width: 0.8pt 0pt 0pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: normal;"></th><th style="vertical-align: top; text-align: center; white-space: normal; border-style: solid solid solid solid; border-width: 0.8pt 0pt 0.4pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: normal;">Test Score</th></tr>
<tr>
<th style="vertical-align: top; text-align: left; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">Intercept</th><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0.4pt 0pt 0pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: normal;">698.93 ***</td></tr>
<tr>
<th style="vertical-align: top; text-align: left; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;"></th><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">(9.47)   </td></tr>
<tr>
<th style="vertical-align: top; text-align: left; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">STR</th><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">-2.28 ***</td></tr>
<tr>
<th style="vertical-align: top; text-align: left; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;"></th><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0.4pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: normal;">(0.48)   </td></tr>
<tr>
<th style="vertical-align: top; text-align: left; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">N</th><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0.4pt 0pt 0pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: normal;">420       </td></tr>
<tr>
<th style="vertical-align: top; text-align: left; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">R-Squared</th><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">0.05    </td></tr>
<tr>
<th style="vertical-align: top; text-align: left; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0.8pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: normal;">SER</th><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0.8pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: normal;">18.58    </td></tr>
<tr>
<th colspan="2" style="vertical-align: top; text-align: left; white-space: normal; border-style: solid solid solid solid; border-width: 0.8pt 0pt 0pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: normal;"> *** p < 0.001;  ** p < 0.01;  * p < 0.05.</th></tr>
</table>

]
]
]

---

# Regression Outputs

- `huxtable` is one package you can use
    - See [here for more options](https://cran.r-project.org/web/packages/huxtable/vignettes/huxtable.html)

- I used to only use [`stargazer`](https://cran.r-project.org/web/packages/stargazer/index.html), but as it was originally meant for STATA, it has limits and problems
    - A great [cheetsheat](http://jakeruss.com/cheatsheets/stargazer.html) by my friend Jake Russ

---

# Diagnostics about Regression

---

# Diagnostics: Residuals I

- We often look at the residuals of a regression to get more insight about its **goodness of fit** and its **bias**

- Recall `broom`'s `augment` creates some useful new variables
    - `.fitted` are fitted (predicted) values from model, i.e. `$\hat{Y}_i$`
    - `.resid` are residuals (errors) from model, i.e. `$\hat{u}_i$`

---

# Diagnostics: Residuals II

- Often a good idea to store in a new object (so we can make some plots)

```r
aug_reg<-augment(school_reg)

aug_reg %>% head()
```

<table class="huxtable" style="border-collapse: collapse; border: 0px; margin-bottom: 2em; margin-top: 2em; ; margin-left: auto; margin-right: auto;  " id="tab:unnamed-chunk-21">
<col><col><col><col><col><col><col><col><tr>
<th style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0.4pt 0pt 0.4pt 0.4pt;    padding: 6pt 6pt 6pt 6pt; font-weight: bold;">testscr</th><th style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0.4pt 0pt 0.4pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: bold;">str</th><th style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0.4pt 0pt 0.4pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: bold;">.fitted</th><th style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0.4pt 0pt 0.4pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: bold;">.resid</th><th style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0.4pt 0pt 0.4pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: bold;">.hat</th><th style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0.4pt 0pt 0.4pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: bold;">.sigma</th><th style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0.4pt 0pt 0.4pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: bold;">.cooksd</th><th style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0.4pt 0.4pt 0.4pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: bold;">.std.resid</th></tr>
<tr>
<td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0.4pt 0pt 0pt 0.4pt;    padding: 6pt 6pt 6pt 6pt; background-color: rgb(242, 242, 242); font-weight: normal;">691</td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0.4pt 0pt 0pt 0pt;    padding: 6pt 6pt 6pt 6pt; background-color: rgb(242, 242, 242); font-weight: normal;">17.9</td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0.4pt 0pt 0pt 0pt;    padding: 6pt 6pt 6pt 6pt; background-color: rgb(242, 242, 242); font-weight: normal;">658</td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0.4pt 0pt 0pt 0pt;    padding: 6pt 6pt 6pt 6pt; background-color: rgb(242, 242, 242); font-weight: normal;">32.7</td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0.4pt 0pt 0pt 0pt;    padding: 6pt 6pt 6pt 6pt; background-color: rgb(242, 242, 242); font-weight: normal;">0.00442</td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0.4pt 0pt 0pt 0pt;    padding: 6pt 6pt 6pt 6pt; background-color: rgb(242, 242, 242); font-weight: normal;">18.5</td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0.4pt 0pt 0pt 0pt;    padding: 6pt 6pt 6pt 6pt; background-color: rgb(242, 242, 242); font-weight: normal;">0.00689 </td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0.4pt 0.4pt 0pt 0pt;    padding: 6pt 6pt 6pt 6pt; background-color: rgb(242, 242, 242); font-weight: normal;">1.76 </td></tr>
<tr>
<td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0pt 0.4pt;    padding: 6pt 6pt 6pt 6pt; font-weight: normal;">661</td><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">21.5</td><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">650</td><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">11.3</td><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">0.00475</td><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">18.6</td><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">0.000893</td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0.4pt 0pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: normal;">0.612</td></tr>
<tr>
<td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0pt 0.4pt;    padding: 6pt 6pt 6pt 6pt; background-color: rgb(242, 242, 242); font-weight: normal;">644</td><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; background-color: rgb(242, 242, 242); font-weight: normal;">18.7</td><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; background-color: rgb(242, 242, 242); font-weight: normal;">656</td><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; background-color: rgb(242, 242, 242); font-weight: normal;">-12.7</td><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; background-color: rgb(242, 242, 242); font-weight: normal;">0.00297</td><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; background-color: rgb(242, 242, 242); font-weight: normal;">18.6</td><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; background-color: rgb(242, 242, 242); font-weight: normal;">0.0007  </td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0.4pt 0pt 0pt;    padding: 6pt 6pt 6pt 6pt; background-color: rgb(242, 242, 242); font-weight: normal;">-0.685</td></tr>
<tr>
<td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0pt 0.4pt;    padding: 6pt 6pt 6pt 6pt; font-weight: normal;">648</td><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">17.4</td><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">659</td><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">-11.7</td><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">0.00586</td><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">18.6</td><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">0.00117 </td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0.4pt 0pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: normal;">-0.629</td></tr>
<tr>
<td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0pt 0.4pt;    padding: 6pt 6pt 6pt 6pt; background-color: rgb(242, 242, 242); font-weight: normal;">641</td><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; background-color: rgb(242, 242, 242); font-weight: normal;">18.7</td><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; background-color: rgb(242, 242, 242); font-weight: normal;">656</td><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; background-color: rgb(242, 242, 242); font-weight: normal;">-15.5</td><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; background-color: rgb(242, 242, 242); font-weight: normal;">0.00301</td><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; background-color: rgb(242, 242, 242); font-weight: normal;">18.6</td><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; background-color: rgb(242, 242, 242); font-weight: normal;">0.00105 </td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0.4pt 0pt 0pt;    padding: 6pt 6pt 6pt 6pt; background-color: rgb(242, 242, 242); font-weight: normal;">-0.836</td></tr>
<tr>
<td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0.4pt 0.4pt;    padding: 6pt 6pt 6pt 6pt; font-weight: normal;">606</td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0.4pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: normal;">21.4</td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0.4pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: normal;">650</td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0.4pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: normal;">-44.6</td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0.4pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: normal;">0.00446</td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0.4pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: normal;">18.5</td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0.4pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: normal;">0.013   </td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0.4pt 0.4pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: normal;">-2.4  </td></tr>
</table>

]
]

---

# Recap: Assumptions about Errors

1. The expected value of the residuals is 0
`$$E[u]=0$$`

2. The variance of the residuals over `$X$` is constant:
`$$var(u|X)=\sigma^2_{u}$$`

3. Errors are not correlated across observations: 
`$$cor(u_i,u_j)=0 \quad \forall i \neq j$$`

4. There is no correlation between `$X$` and the error term: 
`$$cor(X, u)=0 \text{ or } E[u|X]=0$$`

]
]
.pull-right[

---

# Assumptions 1 and 2: Errors are i.i.d.

- Assumptions 1 and 2 assume that errors are coming from the same (*normal*) distribution
`$$u \sim N(0, \sigma_u)$$`
  - Assumption 1: `$E[u]=0$`
  - Assumption 2: `$sd(u|X)=\sigma_u$`
      - virtually always unknown...

- We often can visually check by plotting a **histogram** of `$u$`

---

# Plotting Residuals

```r
ggplot(data = aug_reg)+
  aes(x = .resid)+
  geom_histogram(color="white", fill = "pink")+
  labs(x = expression(paste("Residual, ", hat(u))))+
  theme_pander(base_family = "Fira Sans Condensed",
           base_size=20)
```
]
]

]

---

# Plotting Residuals

- Just to check:

```r
aug_reg %>%
  summarize(E_u = mean(.resid),
            sd_u = sd(.resid))
```

]

]

---

# Residual Plot

- We often plot a .hi[residual plot] to see any odd patterns about residuals
    - `$x$`-axis are `$X$` values (`str`)
    - `$y$`-axis are `$u$` values (`.resid`)

```r
ggplot(data = aug_reg)+
* aes(x = str,
*     y = .resid)+
  geom_point(color="blue")+
  geom_hline(aes(yintercept = 0), color="red")+
  labs(x = "Student to Teacher Ratio",
       y = expression(paste("Residual, ", hat(u))))+
  theme_pander(base_family = "Fira Sans Condensed",
           base_size=20)
```
]
]

]

---

# Problem: Heteroskedasticity

---

# Homoskedasticity

- ".hi[Homoskedasticity]:" variance of the residuals over `$X$` is constant, written:
`$$var(u|X)=\sigma^2_{u}$$`

- Knowing the value of `$X$` does not affect the variance (spread) of the errors

]

---

# Heteroskedasticity I

- ".hi[Heteroskedasticity]:" variance of the residuals over `$X$` is *NOT* constant:
`$$var(u|X) \neq \sigma^2_{u}$$`

- **This does not cause `$\hat{\beta_1}$` to be biased**, but it does cause the standard error of `$\hat{\beta_1}$` to be incorrect

- This **does** cause a problem for .hi-purple[inference]!
]

---

# Heteroskedasticity II

- Recall the formula for the standard error of `$\hat{\beta_1}$`:

`$$se(\hat{\beta_1})=\sqrt{var(\hat{\beta_1})} = \frac{SER}{\sqrt{n} \times sd(X)}$$`

- This actually *assumes* homoskedasticity

---

# Heteroskedasticity III

- Under heteroskedasticity, the standard error of `$\hat{\beta_1}$` mutates to:

`$$se(\hat{\beta_1})=\sqrt{\frac{\displaystyle\sum^n_{i=1}(X_i-\bar{X})^2\hat{u}^2}{\big[\displaystyle\sum^n_{i=1}(X_i-\bar{X})^2\big]^2}}$$`

- This is a .hi[heteroskedasticity-robust] (or just .hi["robust"]) method of calculating `$se(\hat{\beta_1})$`

- Don't learn formula, **do learn what heteroskedasticity is and how it affects our model!**

---

# Visualizing Heteroskedasticity I

- Our original scatterplot with regression line
]

<img src="2.5-slides_files/figure-html/unnamed-chunk-26-1.png" width="504" />
]

---

# Visualizing Heteroskedasticity I

- Our original scatterplot with regression line

- Does the spread of the errors change over different values of `$str$`?
    - No: homoskedastic
    - Yes: heteroskedastic
]

![](2.5-slides_files/figure-html/unnamed-chunk-27-1.png)
]

---

# Visualizing Heteroskedasticity I

- Our original scatterplot with regression line

- Does the spread of the errors change over different values of `$str$`?
    - No: homoskedastic
    - Yes: heteroskedastic
]

![](2.5-slides_files/figure-html/unnamed-chunk-28-1.png)

![](2.5-slides_files/figure-html/unnamed-chunk-29-1.png)

]

---

# Heteroskedasticity: Another View

- Using the `ggridges` package

- Plotting the (conditional) distribution of errors by STR

- See that the variation in errors `$(\hat{u})$` changes across class sizes!
]

]

---

# More Obvious Heteroskedasticity

- Visual cue: data is "fan-shaped"
    - Data points are closer to line in some areas
    - Data points are more spread from line in other areas
]

<img src="2.5-slides_files/figure-html/unnamed-chunk-31-1.png" width="504" />
]

---

# More Obvious Heteroskedasticity

- Visual cue: data is "fan-shaped"
    - Data points are closer to line in some areas
    - Data points are more spread from line in other areas
]

<img src="2.5-slides_files/figure-html/unnamed-chunk-32-1.png" width="504" /><img src="2.5-slides_files/figure-html/unnamed-chunk-32-2.png" width="504" />
]

---

# Heteroskedasticity: Another View

- Using the `ggridges` package

- Plotting the (conditional) distribution of errors by x

]

]

---

# What Might Cause Heteroskedastic Errors?

]
]

<img src="2.5-slides_files/figure-html/unnamed-chunk-35-1.png" width="504" />
]

---

# What Might Cause Heteroskedastic Errors?

]
]

<img src="2.5-slides_files/figure-html/unnamed-chunk-37-1.png" width="504" /><img src="2.5-slides_files/figure-html/unnamed-chunk-37-2.png" width="504" />
]
---

# Heteroskedasticity: Another View

- Using the `ggridges` package

- Plotting the (conditional) distribution of errors by education

]

]

---

# Detecting Heteroskedasticity I

- Several tests to check if data is heteroskedastic 
- One common test is **Breusch-Pagan test** 
- Can use `bptest()` with `lmtest` package in `R`
    - `$H_0$`: homoskedastic
    - If `$p$`-value < 0.05, reject `$H_0\implies$` heteroskedastic

```r
# install.packages("lmtest")
library("lmtest")
bptest(school_reg)
```
]

```
## 
## 	studentized Breusch-Pagan test
## 
## data:  school_reg
## BP = 5.7936, df = 1, p-value = 0.01608
```
]

---

# Detecting Heteroskedasticity II

- How about our wage regression?

```r
# install.packages("lmtest")
library("lmtest")
bptest(wage_reg)
```
]

```
## 
## 	studentized Breusch-Pagan test
## 
## data:  wage_reg
## BP = 15.306, df = 1, p-value = 9.144e-05
```
]

---

# Fixing Heteroskedasticity I

.smallest[
- Heteroskedasticity is easy to fix with software that can calculate .hi[robust] standard errors (using the more complicated formula above)

- Easiest method is to use `estimatr` package
  - `lm_robust()` command (instead of `lm`) to run regression
  - set `se_type="stata"` to calculate robust SEs using the formula above
]

```r
#install.packages("estimatr")
library(estimatr)

school_reg_robust <-lm_robust(testscr ~ str, data = CASchool,
                              se_type = "stata")

school_reg_robust
```

```
##               Estimate Std. Error   t value      Pr(>|t|)   CI Lower   CI Upper
## (Intercept) 698.932952 10.3643599 67.436191 9.486678e-227 678.560192 719.305713
## str          -2.279808  0.5194892 -4.388557  1.446737e-05  -3.300945  -1.258671
##              DF
## (Intercept) 418
## str         418
```
]

---

# Fixing Heteroskedasticity II

```r
library(huxtable)
huxreg("Normal" = school_reg,
       "Robust" = school_reg_robust,
       coefs = c("Intercept" = "(Intercept)",
                 "STR" = "str"),
       statistics = c("N" = "nobs",
                      "R-Squared" = "r.squared",
                      "SER" = "sigma"),
       number_format = 2)
```
]

.pull-right[
.font80[
.regtable[
<table class="huxtable" style="border-collapse: collapse; border: 0px; margin-bottom: 2em; margin-top: 2em; ; margin-left: auto; margin-right: auto;  " id="tab:unnamed-chunk-45">
<col><col><col><tr>
<th style="vertical-align: top; text-align: center; white-space: normal; border-style: solid solid solid solid; border-width: 0.8pt 0pt 0pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: normal;"></th><th style="vertical-align: top; text-align: center; white-space: normal; border-style: solid solid solid solid; border-width: 0.8pt 0pt 0.4pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: normal;">Normal</th><th style="vertical-align: top; text-align: center; white-space: normal; border-style: solid solid solid solid; border-width: 0.8pt 0pt 0.4pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: normal;">Robust</th></tr>
<tr>
<th style="vertical-align: top; text-align: left; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">Intercept</th><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0.4pt 0pt 0pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: normal;">698.93 ***</td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0.4pt 0pt 0pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: normal;">698.93 ***</td></tr>
<tr>
<th style="vertical-align: top; text-align: left; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;"></th><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">(9.47)   </td><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">(10.36)   </td></tr>
<tr>
<th style="vertical-align: top; text-align: left; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">STR</th><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">-2.28 ***</td><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">-2.28 ***</td></tr>
<tr>
<th style="vertical-align: top; text-align: left; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;"></th><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0.4pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: normal;">(0.48)   </td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0.4pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: normal;">(0.52)   </td></tr>
<tr>
<th style="vertical-align: top; text-align: left; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">N</th><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0.4pt 0pt 0pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: normal;">420       </td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0.4pt 0pt 0pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: normal;">420       </td></tr>
<tr>
<th style="vertical-align: top; text-align: left; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">R-Squared</th><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">0.05    </td><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">0.05    </td></tr>
<tr>
<th style="vertical-align: top; text-align: left; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0.8pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: normal;">SER</th><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0.8pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: normal;">18.58    </td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0.8pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: normal;">       </td></tr>
<tr>
<th colspan="3" style="vertical-align: top; text-align: left; white-space: normal; border-style: solid solid solid solid; border-width: 0.8pt 0pt 0pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: normal;"> *** p < 0.001;  ** p < 0.01;  * p < 0.05.</th></tr>
</table>

]
]
]

---

# Assumption 3: No Serial Correlation

.pull-left[
.smallest[
- Errors are not correlated across observations: 
`$$cor(u_i,u_j)=0 \quad \forall i \neq j$$`

- For simple cross-sectional data, this is rarely an issue

- Time-series & panel data nearly always contain .hi[serial correlation] or .hi[autocorrelation] between errors

- Errors may be .hi[clustered]
    - **by group**: e.g. all observations from Maryland, all observations from Virginia, etc.
    - **by time**: GDP in 2006 around the world, GDP in 2008 around the world, etc.

- We'll deal with these fixes when we talk about panel data (or time-series if necessary)

]
]
.pull-right[

---

---

# Outliers Can Bias OLS! I

- .hi[Outliers] can affect the slope (and intercept) of the line and add .hi[bias]
    - May be result of human error (measurement, transcribing, etc)
    - May be meaningful and accurate

- In any case, compare how including/dropping outliers affects regression and always discuss outliers!

]

<img src="2.5-slides_files/figure-html/outlier-plot-1.png" width="504" />
]
---

# Outliers Can Bias OLS! II

```r
huxreg("No Outliers" = school_reg,
       "Outliers" = school_outlier_reg,
       coefs = c("Intercept" = "(Intercept)",
                 "STR" = "str"),
       statistics = c("N" = "nobs",
                      "R-Squared" = "r.squared",
                      "SER" = "sigma"),
       number_format = 2)
```

]
]
]
]
.pull-right[
![](2.5-slides_files/figure-html/unnamed-chunk-48-1.png)
]

---

# Detecting Outliers

```r
library("car")

# Use Bonferonni test 
outlierTest(school_outlier_reg) # will point out which obs #s seem outliers
```

```
##     rstudent unadjusted p-value Bonferroni p
## 422 8.822768         3.0261e-17   1.2800e-14
## 423 7.233470         2.2493e-12   9.5147e-10
## 421 6.232045         1.1209e-09   4.7414e-07
```

```r
# find these observations
CA.outlier %>%
  slice(c(422,423,421)) 
```

]
]