2.3 — OLS Linear Regression — Class Content

Overview
Readings
Slides
Assignments
- Problem Set 1
Math Appendix

Thursday, September 16, 2021

Problem Set 1 answers are posted on that page. Problem Set 2 is due by class Tuesday September 21.

Overview

Today we start looking at associations between variables, which we will first attempt to quantify with measures like covariance and correlation. Then we turn to fitting a line to data via linear regression. We overview the basic regression model, the parameters and how they are derived, and see how to work with regressions in R with lm and the tidyverse package broom.

We consider an extended example about class sizes and test scores, which comes from a (Stata) dataset from an old textbook that I used to use, Stock and Watson, 2007. Download and follow along with the data from today’s example:¹

caschool.dta

Readings

Ch. 3.1, Math and Probability Background Appendices D-E in Bailey, Real Econometrics

Slides

Below, you can find the slides in two formats. Clicking the image will bring you to the html version of the slides in a new tab. Note while in going through the slides, you can type h to see a special list of viewing options, and type o for an outline view of all the slides.

The lower button will allow you to download a PDF version of the slides. I suggest printing the slides beforehand and using them to take additional notes in class (not everything is in the slides)!

Download as PDF

Assignments

Problem Set 1

Problem Set 2 is due by Tuesday September 21. Please see the instructions for more information on how to submit your assignment (there are multiple ways!).

Math Appendix

Variance

Recall the variance of a discrete random variable $X$ , denoted $v a r (X)$ or $σ^{2}$ , is the expected value (probability-weighted average) of the squared deviations of $X_{i}$ from it’s mean (or expected value) $\bar{X}$ or $E (X)$ .²

$\begin{aligned} σ_{X}^{2} & = E (X - E (X)) \\ = \sum_{i = 1}^{n} (X_{i} - \bar{X})^{2} p_{i} \end{aligned}$

Fpr continuous data (if all possible values of $X_{i}$ are equally likely or we don’t know the probabilities), we can write variance as a simple average of squared deviations from the mean:

$\begin{aligned} σ_{X}^{2} & = \frac{1}{n} \sum_{i = 1}^{n} (X_{i} - \bar{X})^{2} \end{aligned}$

Variance has some useful properties:

Property 1: The variance of a constant is 0

$v a r (c) = 0 iff P (X = c) = 1$

If a random variable takes the same value (e.g. 2) with probability 1.00, $E (2) = 2$ , so the average squared deviation from the mean is 0, because there are never any values other than 2.

Property 2: The variance is unchanged for a random variable plus/minus a constant

$v a r (X \pm c)$

Since the variance of a constant is 0.

Property 3: The variance of a scaled random variable is scaled by the square of the coefficient

$v a r (a X) = a^{2} v a r (X)$

Property 4: The variance of a linear transformation of a random variable is scaled by the square of the coefficient

$v a r (a X + b) = a^{2} v a r (X)$

Covariance

For two random variables, $X$ and $Y$ , we can measure their covariance (denoted $c o v (X, Y)$ or $σ_{X, Y}$ )³ to quantify how they vary together. A good way to think about this is: when $X$ is above its mean, would we expect $Y$ to also be above its mean (and covary positively), or below its mean (and covary negatively). Remember, this is describing the joint probability distribution for two random variables.

$\begin{aligned} σ_{X, Y} & = E [(X - \bar{X}) (Y - \bar{Y})] \end{aligned}$

Again, in the case of equally probable values for both $X$ and $Y$ , covariance is sometimes written:

$\begin{aligned} σ_{X, Y} & = \frac{1}{N} \sum_{i = 1}^{n} (X - \bar{X}) (Y - \bar{Y}) \end{aligned}$

Covariance also has a number of useful properties:

Property 1: The covariance of a random variable $X$ and a constant $c$ is 0

$c o v (X, c) = 0$

Property 2: The covariance of a random variable and itself is the variable’s variance

$\begin{aligned} c o v (X, X) & = v a r (X) \\ σ_{X, X} & = σ_{X}^{2} \end{aligned}$

Property 3: The covariance of a two random variables $X$ and $Y$ each scaled by a constant $a$ and $b$ is the product of the covariance and the constants

$c o v (a X, b Y) = a \times b \times c o v (X, Y)$

Property 4: If two random variables are independent, their covariance is 0

$c o v (X, Y) = 0 iff X and Y are independent: E (X Y) = E (X) \times E (Y)$

Correlation

Covariance, like variance, is often cumbersome, and the numerical value of the covariance of two random variables does not really mean much. It is often convenient to normalize the covariance to a decimal between $- 1$ and 1. We do this by dividing by the product of the standard deviations of $X$ and $Y$ . This is known as the correlation coefficient between $X$ and $Y$ , denoted $c o r r (X, Y)$ or $ρ_{X, Y}$ (for populations) or $r_{X, Y}$ (for samples):

$\begin{aligned} r_{X, Y} & = \frac{c o v (X, Y)}{s d (X) s d (Y)} \\ = \frac{E [(X - \bar{X}) (Y - \bar{Y})]}{\sqrt{E [X - \bar{X}]} \sqrt{E [Y - \bar{Y}]}} \\ = \frac{σ_{X, Y}}{σ_{X} σ_{Y}} \end{aligned}$

Note this also means that covariance is the product of the standard deviation of $X$ and $Y$ and their correlation coefficient:

$\begin{aligned} σ_{X, Y} & = r_{X, Y} σ_{X} σ_{Y} \\ c o v (X, Y) & = c o r r (X, Y) \times s d (X) \times s d (Y) \end{aligned}$

Another way to reach the (sample) correlation coefficient is by finding the average joint $Z$ -score of each pair of $(X_{i}, Y_{i})$ :

$\begin{aligned} r_{X, Y} & = \frac{1}{n} \frac{\sum_{i = 1}^{n} (X_{i} - \bar{X}) (Y_{i} - \bar{Y}))}{s_{X} s_{Y}} & Definition of sample correlation \\ = \frac{1}{n} \sum_{i = 1}^{n} (\frac{X_{i} - \bar{X}}{s_{X}}) (\frac{Y_{i} - \bar{Y}}{s_{Y}}) & Breaking into separate sums \\ = \frac{1}{n} \sum_{i = 1}^{n} (Z_{X}) (Z_{Y}) & Recognize each sum is the z-score for that r.v. \end{aligned}$

Correlation has some useful properties that should be familiar to you:

Correlation is between $- 1$ and 1
A correlation of -1 is a downward sloping straight line
A correlation of 1 is an upward sloping straight line
A correlation of 0 implies no relationship

Calculating Correlation Example

We can calculate the correlation of a simple data set (of 4 observations) using R to show how correlation is calculated. We will use the $Z$ -score method. Begin with a simple set of data in $(X_{i}, Y_{i})$ points:

$(1, 1), (2, 2), (3, 4), (4, 9)$

library(tidyverse)

corr_example<-tibble(x=c(1,2,3,4),
                         y=c(1,2,4,9))

ggplot(corr_example,aes(x=x,y=y))+geom_point()

corr_example %>%
  summarize(mean_x = mean(x), #find mean of x, its 2.5
            sd_x = sd(x), #find sd of x, its 1.291
            mean_y = mean(y), #find mean of y, its 4
            sd_y = sd(y)) #find sd of y, its 3.559

## # A tibble: 1 × 4
##   mean_x  sd_x mean_y  sd_y
##    <dbl> <dbl>  <dbl> <dbl>
## 1    2.5  1.29      4  3.56

#take z score of x,y for each pair and multiply them

corr_example <- corr_example %>%
  mutate(z_product = ((x-mean(x))/sd(x)) * ((y-mean(y))/sd(y)))

corr_example %>%
  summarize(avg_z_product = sum(z_product)/(n()-1), # average z products over n-1
            actual_corr = cor(x,y), #compare our answer to actual cor() command!
            covariance = cov(x,y)) # just for kicks, what's the covariance?

## # A tibble: 1 × 3
##   avg_z_product actual_corr covariance
##           <dbl>       <dbl>      <dbl>
## 1         0.943       0.943       4.33

Note this is a .dta Stata file. You will need to (install and) load the package haven to read_dta() Stata files into a dataframe.↩︎
Note there will be a different in notation depending on whether we refer to a population (e.g. $μ_{X}$ ) or to a sample (e.g. $\bar{X}$ ). As the overwhelming majority of cases we will deal with samples, I will use sample notation for means).↩︎
Again, to be technically correct, $σ_{X, Y}$ refers to populations, $s_{X, Y}$ refers to samples, in line with population vs. sample variance and standard deviation. Recall also that sample estimates of variance and standard deviation divide by $n - 1$ , rather than $n$ . In large sample sizes, this difference is negligible.↩︎

Last updated on Sep 26, 2021