class: center, middle, inverse, title-slide # 2.3 — OLS Linear Regression ## ECON 480 • Econometrics • Fall 2021 ### Ryan Safner
Assistant Professor of Economics
safner@hood.edu
ryansafner/metricsF21
metricsF21.classes.ryansafner.com
--- class: inverse, center, middle # Exploring Relationships --- # Bivariate Data and Relationships .pull-left[ .smaller[ - We looked at single variables for descriptive statistics - Most uses of statistics in economics and business investigate relationships *between* variables ] .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt5[ .smallest[ .green[**Examples**] - \# of police & crime rates - healthcare spending & life expectancy - government spending & GDP growth - carbon dioxide emissions & temperatures ] ] ] .pull-right[ .center[ ![](../images/statsgraphs.jpg) ] ] --- # Bivariate Data and Relationships .pull-left[ - We will begin with .hi-purple[bivariate] data for relationships between `\(X\)` and `\(Y\)` - Immediate aim is to explore .hi[associations] between variables, quantified with .hi[correlation] and .hi[linear regression] - Later we want to develop more sophisticated tools to argue for .hi[causation] ] .pull-right[ .center[ ![](../images/statsgraphs.jpg) ] ] --- # Bivariate Data: Spreadsheets I .pull-left[ ```r econfreedom <- read_csv("econfreedom.csv") head(econfreedom) ``` ] .pull-right[ ``` ## # A tibble: 6 × 6 ## ...1 Country ISO ef gdp continent ## <dbl> <chr> <chr> <dbl> <dbl> <chr> ## 1 1 Albania ALB 7.4 4543. Europe ## 2 2 Algeria DZA 5.15 4784. Africa ## 3 3 Angola AGO 5.08 4153. Africa ## 4 4 Argentina ARG 4.81 10502. Americas ## 5 5 Australia AUS 7.93 54688. Oceania ## 6 6 Austria AUT 7.56 47604. Europe ``` ] - **Rows** are individual observations (countries) - **Columns** are variables on all individuals --- # Bivariate Data: Spreadsheets II ```r econfreedom %>% glimpse() ``` ``` ## Rows: 112 ## Columns: 6 ## $ ...1 <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1… ## $ Country <chr> "Albania", "Algeria", "Angola", "Argentina", "Australia", "A… ## $ ISO <chr> "ALB", "DZA", "AGO", "ARG", "AUS", "AUT", "BHR", "BGD", "BEL… ## $ ef <dbl> 7.40, 5.15, 5.08, 4.81, 7.93, 7.56, 7.60, 6.35, 7.51, 6.22, … ## $ gdp <dbl> 4543.0880, 4784.1943, 4153.1463, 10501.6603, 54688.4459, 476… ## $ continent <chr> "Europe", "Africa", "Africa", "Americas", "Oceania", "Europe… ``` --- # Bivariate Data: Spreadsheets III ```r source("summaries.R") # use my summary_table function econfreedom %>% summary_table(ef, gdp) ``` <table> <thead> <tr> <th style="text-align:left;"> Variable </th> <th style="text-align:right;"> Obs </th> <th style="text-align:right;"> Min </th> <th style="text-align:right;"> Q1 </th> <th style="text-align:right;"> Median </th> <th style="text-align:right;"> Q3 </th> <th style="text-align:right;"> Max </th> <th style="text-align:right;"> Mean </th> <th style="text-align:right;"> Std. Dev. </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> ef </td> <td style="text-align:right;"> 112 </td> <td style="text-align:right;"> 4.81 </td> <td style="text-align:right;"> 6.42 </td> <td style="text-align:right;"> 7.0 </td> <td style="text-align:right;"> 7.40 </td> <td style="text-align:right;"> 8.71 </td> <td style="text-align:right;"> 6.86 </td> <td style="text-align:right;"> 0.78 </td> </tr> <tr> <td style="text-align:left;"> gdp </td> <td style="text-align:right;"> 112 </td> <td style="text-align:right;"> 206.71 </td> <td style="text-align:right;"> 1307.46 </td> <td style="text-align:right;"> 5123.3 </td> <td style="text-align:right;"> 17302.66 </td> <td style="text-align:right;"> 89590.81 </td> <td style="text-align:right;"> 14488.49 </td> <td style="text-align:right;"> 19523.54 </td> </tr> </tbody> </table> --- # Bivariate Data: Scatterplots .pull-left[ - The best way to visualize an association between two variables is with a .hi[scatterplot] - Each point: pair of variable values `\((x_i,y_i) \in X, Y\)` for observation `\(i\)` .code50[ ```r ggplot(data = econfreedom)+ aes(x = ef, y = gdp)+ geom_point(aes(color = continent), size = 2)+ labs(x = "Economic Freedom Index (2014)", y = "GDP per Capita (2014 USD)", color = "")+ scale_y_continuous(labels = scales::dollar)+ theme_pander(base_family = "Fira Sans Condensed", base_size=20)+ theme(legend.position = "bottom") ``` ] ] .pull-right[ <img src="2.3-slides_files/figure-html/econfreedom-scatter-out-1.png" width="504" style="display: block; margin: auto;" /> ] --- # Associations .pull-left[ - Look for .hi-purple[association] between independent and dependent variables 1. .hi-turquoise[Direction]: is the trend positive or negative? 2. .hi-turquoise[Form]: is the trend linear, quadratic, something else, or no pattern? 3. .hi-turquoise[Strength]: is the association strong or weak? 4. .hi-turquoise[Outliers]: do any observations break the trends above? ] .pull-right[ <img src="2.3-slides_files/figure-html/unnamed-chunk-5-1.png" width="504" style="display: block; margin: auto;" /> ] --- class: inverse, center, middle # Quantifying Relationships --- # Covariance - For any two variables, we can measure their .hi[sample covariance, `\\(cov(X,Y)\\)`] or .hi[`\\(s_{X,Y}\\)`] to quantify how they vary *together*<sup>.magenta[†]</sup> -- `$$s_{X,Y}=E\big[(X-\bar{X})(Y-\bar{Y}) \big]$$` -- - Intuition: if `\(x_i\)` is above the mean of `\(X\)`, would we expect the associated `\(y_i\)`: - to be **above** the mean of `\(Y\)` also `\((X\)` and `\(Y\)` covary **positively**) - to be **below** the mean of `\(Y\)` `\((X\)` and `\(Y\)` covary **negatively**) -- - Covariance is a common measure, but the units are meaningless, thus we rarely need to use it so **don't worry about learning the formula** .footnote[<sup>.magenta[†]</sup> Henceforth we limit all measures to *samples*, for convenience. Population covariance is denoted `\\(\sigma_{X,Y}\\)`] --- # Covariance, in R ```r # base R cov(econfreedom$ef,econfreedom$gdp) ``` ``` ## [1] 8922.933 ``` ```r # tidyverse econfreedom %>% summarize(cov = cov(ef,gdp)) ``` ``` ## # A tibble: 1 × 1 ## cov ## <dbl> ## 1 8923. ``` -- 8923 what, exactly? --- # Correlation .smallest[ - More convenient to *standardize* covariance into a more intuitive concept: .hi[correlation, `\\(\rho\\)`] or .hi[`\\(r\\)`] `\(\in [-1, 1]\)` ] -- .smallest[ `$$r_{X,Y}=\frac{s_{X,Y}}{s_X s_Y}=\frac{cov(X,Y)}{sd(X)sd(Y)}$$` ] -- .smallest[ - Simply weight covariance by the product of the standard deviations of `\(X\)` and `\(Y\)` ] -- .smallest[ - Alternatively, take the average<sup>.magenta[†]</sup> of the product of standardized `\((Z\)`-scores for) each `\((x_i,y_i)\)` pair:<sup>.magenta[‡]</sup> ] -- .smallest[ `$$\begin{align*} r&=\frac{1}{n-1}\sum^n_{i=1}\bigg(\frac{x_i-\bar{X}}{s_X}\bigg)\bigg(\frac{y_i-\bar{Y}}{s_Y}\bigg)\\ r&=\frac{1}{n-1}\sum^n_{i=1}Z_XZ_Y\\ \end{align*}$$` ] .footnote[<sup>.magenta[†]</sup> Over n-1, a *sample* statistic! <sup>.magenta[‡]</sup> See today's [class notes page](/content/2.3-content#math-appendix) for example code to calculate correlation "by hand" in R using the second method.] --- # Correlation: Interpretation .pull-left[ - Correlation is standardized to `$$-1 \leq r \leq 1$$` .smallest[ - Negative values `\(\implies\)` negative association - Positive values `\(\implies\)` positive association - Correlation of 0 `\(\implies\)` no association - As `\(|r| \rightarrow 1 \implies\)` the stronger the association - Correlation of `\(|r|=1 \implies\)` perfectly linear ] ] .pull-right[ <img src="2.3-slides_files/figure-html/unnamed-chunk-8-1.png" width="504" style="display: block; margin: auto;" /> ] --- # Guess the Correlation! .center[ ![:scale 25%](https://www.dropbox.com/s/70opitsw7jjpfko/guessthecorrelation.png?raw=1) [Guess the Correlation Game](http://guessthecorrelation.com/index.html) ] --- # Correlation and Covariance in R .pull-left[ ```r # Base r: cov or cor(df$x, df$y) cov(econfreedom$ef, econfreedom$gdp) ``` ``` ## [1] 8922.933 ``` ```r cor(econfreedom$ef, econfreedom$gdp) ``` ``` ## [1] 0.5867018 ``` ] -- .pull-right[ ```r # tidyverse method econfreedom %>% summarize(covariance = cov(ef, gdp), correlation = cor(ef, gdp)) ``` ``` ## # A tibble: 1 × 2 ## covariance correlation ## <dbl> <dbl> ## 1 8923. 0.587 ``` ] --- # Correlation and Covariance in R I - `corrplot` is a great package (install and then load) to **visualize** correlations in data ```r library(corrplot) # see more at https://github.com/taiyun/corrplot library(RColorBrewer) # for color scheme used here library(gapminder) # for gapminder data # need to make a corelation matrix with cor(); can only include numeric variables gapminder_cor<- gapminder %>% dplyr::select(gdpPercap, pop, lifeExp) # make a correlation table with cor (base R) gapminder_cor_table<-cor(gapminder_cor) # view it gapminder_cor_table ``` ``` ## gdpPercap pop lifeExp ## gdpPercap 1.00000000 -0.02559958 0.58370622 ## pop -0.02559958 1.00000000 0.06495537 ## lifeExp 0.58370622 0.06495537 1.00000000 ``` --- # Correlation and Covariance in R II .pull-left[ ```r corrplot(gapminder_cor_table, type="upper", method = "circle", order = "alphabet", col = viridis::viridis(100)) # custom color ``` ] -- .pull-right[ <img src="2.3-slides_files/figure-html/unnamed-chunk-14-1.png" width="504" style="display: block; margin: auto;" /> ] --- # Correlation and Endogeneity .pull-left[ - Your Occasional Reminder: .hi[Correlation does not imply causation!] - I'll show you the difference in a few weeks (when we can actually talk about causation) - If `\(X\)` and `\(Y\)` are strongly correlated, `\(X\)` can still be .hi-purple[endogenous]! - See [today's class notes page](/content/2.3-content#math-appendix) for more on Covariance and Correlation ] .pull-right[ .center[ ![](../images/causality.jpg) ] ] --- # Always Plot Your Data! .center[ ![](https://d2f99xq7vri1nk.cloudfront.net/DinoSequentialSmaller.gif) ] --- class: inverse, center, middle # Linear Regression --- # Fitting a Line to Data .pull-left[ .smallest[ - If an association appears linear, we can estimate the equation of a line that would “fit” the data `$$Y = a + bX$$` - Recall a linear equation describing a line contains: - `\(a\)`: vertical intercept - `\(b\)`: slope ] ] .pull-right[ <img src="2.3-slides_files/figure-html/unnamed-chunk-15-1.png" width="504" style="display: block; margin: auto;" /> ] --- # Fitting a Line to Data .pull-left[ .smallest[ - If an association appears linear, we can estimate the equation of a line that would “fit” the data `$$Y = a + bX$$` - Recall a linear equation describing a line contains: - `\(a\)`: vertical intercept - `\(b\)`: slope - How do we choose the equation that **best** fits the data? ] ] .pull-right[ <img src="2.3-slides_files/figure-html/unnamed-chunk-16-1.png" width="504" style="display: block; margin: auto;" /> ] --- # Fitting a Line to Data .pull-left[ .smallest[ - If an association appears linear, we can estimate the equation of a line that would “fit” the data `$$Y = a + bX$$` - Recall a linear equation describing a line contains: - `\(a\)`: vertical intercept - `\(b\)`: slope - How do we choose the equation that **best** fits the data? - This process is called .hi[linear regression] ] ] .pull-right[ <img src="2.3-slides_files/figure-html/unnamed-chunk-17-1.png" width="504" style="display: block; margin: auto;" /> ] --- # Population Linear Regression Model - Linear regression lets us estimate the slope of the .hi-purple[population] regression line between `\(X\)` and `\(Y\)` using .hi-purple[sample] data - We can make .hi-purple[statistical inferences] about the population slope coefficient - eventually & hopefully: a .hi-purple[*causal* inference] - `\(\text{slope}=\frac{\Delta Y}{\Delta X}\)`: for a 1-unit change in `\(X\)`, how many units will this *cause* `\(Y\)` to change? --- # Class Size Example .pull-left[ .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt5[ .green[**Example**]: What is the relationship between class size and educational performance? ] ] .pull-right[ .center[ ![](https://www.dropbox.com/s/80sb1duv6e2tazd/smallclass.jpg?raw=1) ] ] --- # Class Size Example: Load the Data ```r # install.packages("haven") # install for first use library("haven") # load for importing .dta files CASchool<-read_dta("../data/caschool.dta") ``` --- # Class Size Example: Look at the Data I ```r glimpse(CASchool) ``` ``` ## Rows: 420 ## Columns: 21 ## $ observat <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18… ## $ dist_cod <dbl> 75119, 61499, 61549, 61457, 61523, 62042, 68536, 63834, 62331… ## $ county <chr> "Alameda", "Butte", "Butte", "Butte", "Butte", "Fresno", "San… ## $ district <chr> "Sunol Glen Unified", "Manzanita Elementary", "Thermalito Uni… ## $ gr_span <chr> "KK-08", "KK-08", "KK-08", "KK-08", "KK-08", "KK-08", "KK-08"… ## $ enrl_tot <dbl> 195, 240, 1550, 243, 1335, 137, 195, 888, 379, 2247, 446, 987… ## $ teachers <dbl> 10.90, 11.15, 82.90, 14.00, 71.50, 6.40, 10.00, 42.50, 19.00,… ## $ calw_pct <dbl> 0.5102, 15.4167, 55.0323, 36.4754, 33.1086, 12.3188, 12.9032,… ## $ meal_pct <dbl> 2.0408, 47.9167, 76.3226, 77.0492, 78.4270, 86.9565, 94.6237,… ## $ computer <dbl> 67, 101, 169, 85, 171, 25, 28, 66, 35, 0, 86, 56, 25, 0, 31, … ## $ testscr <dbl> 690.80, 661.20, 643.60, 647.70, 640.85, 605.55, 606.75, 609.0… ## $ comp_stu <dbl> 0.34358975, 0.42083332, 0.10903226, 0.34979424, 0.12808989, 0… ## $ expn_stu <dbl> 6384.911, 5099.381, 5501.955, 7101.831, 5235.988, 5580.147, 5… ## $ str <dbl> 17.88991, 21.52466, 18.69723, 17.35714, 18.67133, 21.40625, 1… ## $ avginc <dbl> 22.690001, 9.824000, 8.978000, 8.978000, 9.080333, 10.415000,… ## $ el_pct <dbl> 0.000000, 4.583333, 30.000002, 0.000000, 13.857677, 12.408759… ## $ read_scr <dbl> 691.6, 660.5, 636.3, 651.9, 641.8, 605.7, 604.5, 605.5, 608.9… ## $ math_scr <dbl> 690.0, 661.9, 650.9, 643.5, 639.9, 605.4, 609.0, 612.5, 616.1… ## $ aowijef <dbl> 35.77982, 43.04933, 37.39445, 34.71429, 37.34266, 42.81250, 3… ## $ es_pct <dbl> 1.000000, 3.583333, 29.000002, 1.000000, 12.857677, 11.408759… ## $ es_frac <dbl> 0.01000000, 0.03583334, 0.29000002, 0.01000000, 0.12857677, 0… ``` --- # Class Size Example: Look at the Data II <table class="table" style="font-size: 8px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> observat </th> <th style="text-align:right;"> dist_cod </th> <th style="text-align:left;"> county </th> <th style="text-align:left;"> district </th> <th style="text-align:left;"> gr_span </th> <th style="text-align:right;"> enrl_tot </th> <th style="text-align:right;"> teachers </th> <th style="text-align:right;"> calw_pct </th> <th style="text-align:right;"> meal_pct </th> <th style="text-align:right;"> computer </th> <th style="text-align:right;"> testscr </th> <th style="text-align:right;"> comp_stu </th> <th style="text-align:right;"> expn_stu </th> <th style="text-align:right;"> str </th> <th style="text-align:right;"> avginc </th> <th style="text-align:right;"> el_pct </th> <th style="text-align:right;"> read_scr </th> <th style="text-align:right;"> math_scr </th> <th style="text-align:right;"> aowijef </th> <th style="text-align:right;"> es_pct </th> <th style="text-align:right;"> es_frac </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 75119 </td> <td style="text-align:left;"> Alameda </td> <td style="text-align:left;"> Sunol Glen Unified </td> <td style="text-align:left;"> KK-08 </td> <td style="text-align:right;"> 195 </td> <td style="text-align:right;"> 10.90 </td> <td style="text-align:right;"> 0.5102 </td> <td style="text-align:right;"> 2.0408 </td> <td style="text-align:right;"> 67 </td> <td style="text-align:right;"> 690.80 </td> <td style="text-align:right;"> 0.3435898 </td> <td style="text-align:right;"> 6384.911 </td> <td style="text-align:right;"> 17.88991 </td> <td style="text-align:right;"> 22.690001 </td> <td style="text-align:right;"> 0.000000 </td> <td style="text-align:right;"> 691.6 </td> <td style="text-align:right;"> 690.0 </td> <td style="text-align:right;"> 35.77982 </td> <td style="text-align:right;"> 1.000000 </td> <td style="text-align:right;"> 0.0100000 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 61499 </td> <td style="text-align:left;"> Butte </td> <td style="text-align:left;"> Manzanita Elementary </td> <td style="text-align:left;"> KK-08 </td> <td style="text-align:right;"> 240 </td> <td style="text-align:right;"> 11.15 </td> <td style="text-align:right;"> 15.4167 </td> <td style="text-align:right;"> 47.9167 </td> <td style="text-align:right;"> 101 </td> <td style="text-align:right;"> 661.20 </td> <td style="text-align:right;"> 0.4208333 </td> <td style="text-align:right;"> 5099.381 </td> <td style="text-align:right;"> 21.52466 </td> <td style="text-align:right;"> 9.824000 </td> <td style="text-align:right;"> 4.583334 </td> <td style="text-align:right;"> 660.5 </td> <td style="text-align:right;"> 661.9 </td> <td style="text-align:right;"> 43.04933 </td> <td style="text-align:right;"> 3.583334 </td> <td style="text-align:right;"> 0.0358333 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 61549 </td> <td style="text-align:left;"> Butte </td> <td style="text-align:left;"> Thermalito Union Elementary </td> <td style="text-align:left;"> KK-08 </td> <td style="text-align:right;"> 1550 </td> <td style="text-align:right;"> 82.90 </td> <td style="text-align:right;"> 55.0323 </td> <td style="text-align:right;"> 76.3226 </td> <td style="text-align:right;"> 169 </td> <td style="text-align:right;"> 643.60 </td> <td style="text-align:right;"> 0.1090323 </td> <td style="text-align:right;"> 5501.955 </td> <td style="text-align:right;"> 18.69723 </td> <td style="text-align:right;"> 8.978000 </td> <td style="text-align:right;"> 30.000002 </td> <td style="text-align:right;"> 636.3 </td> <td style="text-align:right;"> 650.9 </td> <td style="text-align:right;"> 37.39445 </td> <td style="text-align:right;"> 29.000002 </td> <td style="text-align:right;"> 0.2900000 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 61457 </td> <td style="text-align:left;"> Butte </td> <td style="text-align:left;"> Golden Feather Union Elementary </td> <td style="text-align:left;"> KK-08 </td> <td style="text-align:right;"> 243 </td> <td style="text-align:right;"> 14.00 </td> <td style="text-align:right;"> 36.4754 </td> <td style="text-align:right;"> 77.0492 </td> <td style="text-align:right;"> 85 </td> <td style="text-align:right;"> 647.70 </td> <td style="text-align:right;"> 0.3497942 </td> <td style="text-align:right;"> 7101.831 </td> <td style="text-align:right;"> 17.35714 </td> <td style="text-align:right;"> 8.978000 </td> <td style="text-align:right;"> 0.000000 </td> <td style="text-align:right;"> 651.9 </td> <td style="text-align:right;"> 643.5 </td> <td style="text-align:right;"> 34.71429 </td> <td style="text-align:right;"> 1.000000 </td> <td style="text-align:right;"> 0.0100000 </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 61523 </td> <td style="text-align:left;"> Butte </td> <td style="text-align:left;"> Palermo Union Elementary </td> <td style="text-align:left;"> KK-08 </td> <td style="text-align:right;"> 1335 </td> <td style="text-align:right;"> 71.50 </td> <td style="text-align:right;"> 33.1086 </td> <td style="text-align:right;"> 78.4270 </td> <td style="text-align:right;"> 171 </td> <td style="text-align:right;"> 640.85 </td> <td style="text-align:right;"> 0.1280899 </td> <td style="text-align:right;"> 5235.988 </td> <td style="text-align:right;"> 18.67133 </td> <td style="text-align:right;"> 9.080333 </td> <td style="text-align:right;"> 13.857677 </td> <td style="text-align:right;"> 641.8 </td> <td style="text-align:right;"> 639.9 </td> <td style="text-align:right;"> 37.34266 </td> <td style="text-align:right;"> 12.857677 </td> <td style="text-align:right;"> 0.1285768 </td> </tr> <tr> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 62042 </td> <td style="text-align:left;"> Fresno </td> <td style="text-align:left;"> Burrel Union Elementary </td> <td style="text-align:left;"> KK-08 </td> <td style="text-align:right;"> 137 </td> <td style="text-align:right;"> 6.40 </td> <td style="text-align:right;"> 12.3188 </td> <td style="text-align:right;"> 86.9565 </td> <td style="text-align:right;"> 25 </td> <td style="text-align:right;"> 605.55 </td> <td style="text-align:right;"> 0.1824818 </td> <td style="text-align:right;"> 5580.147 </td> <td style="text-align:right;"> 21.40625 </td> <td style="text-align:right;"> 10.415000 </td> <td style="text-align:right;"> 12.408759 </td> <td style="text-align:right;"> 605.7 </td> <td style="text-align:right;"> 605.4 </td> <td style="text-align:right;"> 42.81250 </td> <td style="text-align:right;"> 11.408759 </td> <td style="text-align:right;"> 0.1140876 </td> </tr> </tbody> </table> --- # Class Size Example: Scatterplot .pull-left[ .code60[ ```r scatter <- ggplot(data = CASchool)+ aes(x = str, y = testscr)+ geom_point(color = "blue")+ labs(x = "Student to Teacher Ratio", y = "Test Score")+ theme_pander(base_family = "Fira Sans Condensed", base_size = 20) scatter ``` ] ] .pull-right[ <img src="2.3-slides_files/figure-html/school-scatter-out-1.png" width="504" style="display: block; margin: auto;" /> ] --- # Class Size Example: Slope I .pull-left[ .smaller[ - If we *change* `\((\Delta)\)` the class size by an amount, what would we expect the *change* in test scores to be? `$$\beta = \frac{\text{change in test score}}{\text{change in class size}} = \frac{\Delta \text{test score}}{\Delta \text{class size}}$$` - If we knew `\(\beta\)`, we could say that changing class size by 1 student will change test scores by `\(\beta\)` ] ] .pull-right[ .center[ ![](https://www.dropbox.com/s/80sb1duv6e2tazd/smallclass.jpg?raw=1) ] ] --- # Class Size Example: Slope II .pull-left[ .smaller[ - Rearranging: `$$\Delta \text{test score} = \beta \times \Delta \text{class size}$$` ] ] .pull-right[ .center[ ![](https://www.dropbox.com/s/80sb1duv6e2tazd/smallclass.jpg?raw=1) ] ] --- # Class Size Example: Slope II .pull-left[ .smalelr[ - Rearranging: `$$\Delta \text{test score} = \beta \times \Delta \text{class size}$$` - Suppose `\(\beta=-0.6\)`. If we shrank class size by 2 students, our model predicts: `$$\begin{align*} \Delta \text{test score} &= -2 \times \beta\\ \Delta \text{test score} &= -2 \times -0.6\\ \Delta \text{test score}&= 1.2 \\ \end{align*}$$` ] ] .pull-right[ .center[ ![](https://www.dropbox.com/s/80sb1duv6e2tazd/smallclass.jpg?raw=1) ] ] --- # Class Size Example: Slope and Average Effect .pull-left[ .smaller[ `$$\text{test score} = \beta_0 + \beta_{1} \times \text{class size}$$` - The line relating class size and test scores has the above equation - `\(\beta_0\)` is the .hi[vertical-intercept], test score where class size is 0 - `\(\beta_{1}\)` is the .hi[slope] of the regression line - This relationship only holds .hi-purple[on average] for all districts in the population, *individual* districts are also affected by other factors ] ] .pull-right[ .center[ ![](https://www.dropbox.com/s/80sb1duv6e2tazd/smallclass.jpg?raw=1) ] ] --- # Class Size Example: Marginal Effects .pull-left[ .smaller[ - To get an equation that holds for *each* district, we need to include other factors `$$\text{test score} = \beta_0 + \beta_1 \text{class size}+\text{other factors}$$` - For now, we will ignore these until Unit III - Thus, `\(\beta_0 + \beta_1 \text{class size}\)` gives the .hi-purple[average effect] of class sizes on scores - Later, we will want to estimate the .hi[marginal effect] (.hi[causal effect]) of each factor on an individual district's test score, holding all other factors constant ] ] .pull-right[ .center[ ![](https://www.dropbox.com/s/80sb1duv6e2tazd/smallclass.jpg?raw=1) ] ] --- # Econometric Models Overview `$$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + u$$` -- .smallest[ - `\(Y\)` is the .hi[dependent variable] of interest - AKA “response variable,” “regressand,” “Left-hand side (LHS) variable” ] -- .smallest[ - `\(X_1\)` and `\(X_2\)` are .hi[independent variables] - AKA “explanatory variables”, “regressors,” “Right-hand side (RHS) variables”, “covariates” ] -- .smallest[ - Our data consists of a spreadsheet of observed values of `\((X_{1i}, X_{2i}, Y_i)\)` ] -- .smallest[ - To model, we .hi-turquoise[“regress Y on `\\(X_1\\)` and `\\(X_2\\)`”] ] -- .smallest[ - `\(\beta_0\)` and `\(\beta_1\)` are .hi-purple[parameters] that describe the population relationships between the variables - unknown! to be estimated ] -- .smallest[ - `\(u\)` is a random .hi[error term] - **‘U’nobservable**, we can't measure it, and must model with assumptions about it ] --- # The Population Regression Model .pull-left[ - How do we draw a line through the scatterplot? We do not know the **“true”** `\(\beta_0\)` or `\(\beta_1\)` - We do have data from a .hi-turquoise[sample] of class sizes and test scores<sup>.magenta[†]</sup> - So the real question is, .hi-purple[how can we estimate `\\(\beta_0\\)` and `\\(\beta_1\\)`?] .quitesmall[ <sup>.magenta[†]</sup> Data are student-teacher-ratio and average test scores on Stanford 9 Achievement Test for 5th grade students for 420 K-6 and K-8 school districts in California in 1999, (Stock and Watson, 2015: p. 141)] ] .pull-right[ <img src="2.3-slides_files/figure-html/unnamed-chunk-21-1.png" width="504" style="display: block; margin: auto;" /> ] --- class: inverse, center, middle # Deriving OLS --- # Deriving OLS .pull-left[ .smallest[ - Suppose we have some data points ] ] .pull-right[ <img src="2.3-slides_files/figure-html/unnamed-chunk-22-1.png" width="504" style="display: block; margin: auto;" /> ] --- # Deriving OLS .pull-left[ .smallest[ - Suppose we have some data points - We add a line ] ] .pull-right[ <img src="2.3-slides_files/figure-html/unnamed-chunk-23-1.png" width="504" style="display: block; margin: auto;" /> ] --- # Deriving OLS .pull-left[ .smallest[ - Suppose we have some data points - We add a line - The .hi[residual], `\(\hat{u_i}\)` of each data point is the difference between the .hi-purple[actual] and the .hi-purple[predicted] value of `\(Y\)` given `\(X\)`: `$$\hat{u_i} = Y_i - \hat{Y_i}$$` ] ] .pull-right[ <img src="2.3-slides_files/figure-html/unnamed-chunk-24-1.png" width="504" style="display: block; margin: auto;" /> ] --- # Deriving OLS .pull-left[ .smallest[ - Suppose we have some data points - We add a line - The .hi[residual], `\(\hat{u_i}\)` of each data point is the difference between the .hi-purple[actual] and the .hi-purple[predicted] value of `\(Y\)` given `\(X\)`: `$$\hat{u_i} = Y_i - \hat{Y_i}$$` - We square each residual ] ] .pull-right[ <img src="2.3-slides_files/figure-html/unnamed-chunk-25-1.png" width="504" style="display: block; margin: auto;" /> ] --- # Deriving OLS .pull-left[ .smallest[ - Suppose we have some data points - We add a line - The .hi[residual], `\(\hat{u_i}\)` of each data point is the difference between the .hi-purple[actual] and the .hi-purple[predicted] value of `\(Y\)` given `\(X\)`: `$$\hat{u_i} = Y_i - \hat{Y_i}$$` - We square each residual - Add all of these up: .hi[Sum of Squared Errors (SSE)] `$$SSE = \sum^n_{i=1} \hat{u_i}^2$$` ] ] .pull-right[ <img src="2.3-slides_files/figure-html/unnamed-chunk-26-1.png" width="504" style="display: block; margin: auto;" /> ] --- # Deriving OLS .pull-left[ .smallest[ - Suppose we have some data points - We add a line - The .hi[residual], `\(\hat{u_i}\)` of each data point is the difference between the .hi-purple[actual] and the .hi-purple[predicted] value of `\(Y\)` given `\(X\)`: `$$\hat{u_i} = Y_i - \hat{Y_i}$$` - We square each residual - Add all of these up: .hi[Sum of Squared Errors (SSE)] `$$SSE = \sum^n_{i=1} \hat{u_i}^2$$` - .hi-purple[The line of best fit *minimizes* SSE] ] ] .pull-right[ <img src="2.3-slides_files/figure-html/unnamed-chunk-27-1.png" width="504" style="display: block; margin: auto;" /> ] --- # *O* rdinary *L* east *S* quares Estimators - The .hi[Ordinary Least Squares (OLS) estimators] of the unknown population parameters `\(\beta_0\)` and `\(\beta_1\)`, solve the calculus problem: `$$\min_{\beta_0, \beta_1} \sum^n_{i=1}[\underbrace{Y_i-(\underbrace{\beta_0+\beta_1 X_i}_{\hat{Y_i}})}_{\hat{u_i}}]^2$$` -- - Intuitively, OLS estimators .hi-purple[minimize the average squared distance between the actual values `\\((Y_i)\\)` and the predicted values `\\((\hat{Y}_i)\\)` along the estimated regression line] --- # The OLS Regression Line - The .hi[OLS regression line] or .hi[sample regression line] is the linear function constructed using the OLS estimators: `$$\hat{Y_i}=\hat{\beta_0}+\hat{\beta_1}X_i$$` -- - `\(\hat{\beta_0}\)` and `\(\hat{\beta_1}\)` (“beta 0 hat” & “beta 1 hat”) are the .hi[OLS estimators] of population parameters `\(\beta_0\)` and `\(\beta_1\)` using sample data -- - The .hi[predicted value] of Y given X, based on the regression, is `\(E(Y_i|X_i)=\hat{Y_i}\)` -- - The .hi-purple[residual] or .hi-purple[prediction error] for the `\(i^{th}\)` observation is the difference between observed `\(Y_i\)` and its predicted value, `\(\hat{u_i}=Y_i-\hat{Y_i}\)` --- # The OLS Regression Estimators - The solution to the SSE minimization problem yields:<sup>.magenta[†]</sup> -- `$$\hat{\beta}_0=\bar{Y}-\hat{\beta_1}\bar{X}$$` -- `$$\hat{\beta}_1=\frac{\displaystyle\sum^n_{i=1}(X_i-\bar{X})(Y_i-\bar{Y})}{\displaystyle\sum^n_{i=1}(X_i-\bar{X})^2}=\frac{s_{XY}}{s^2_X}= \frac{cov(X,Y)}{var(X)}$$` .footnote[<sup>.magenta[†]</sup> See [next class’ notes page](/content/2.4-content) for proofs.] --- class: inverse, center, middle # Our Class Size Example in R --- # Class Size Scatterplot (Again) .pull-left[ ```r scatter ``` - There is some true (unknown) population relationship: `$$\text{test score}=\beta_0+\beta_1 \times str$$` - `\(\beta_1=\frac{\Delta \text{test score}}{\Delta \text{str}}= ??\)` ] .pull-right[ <img src="2.3-slides_files/figure-html/unnamed-chunk-28-1.png" width="504" style="display: block; margin: auto;" /> ] --- # Class SIze Scatterplot with Regression Line .pull-left[ ```r scatter+ * geom_smooth(method = "lm", color = "red") ``` ] .pull-right[ <img src="2.3-slides_files/figure-html/school-scatter-line-out-1.png" width="504" style="display: block; margin: auto;" /> ] --- # OLS in R .left-code[ ```r # run regression of testscr on str school_reg <- lm(testscr ~ str, data = CASchool) ``` ] .right-plot[ .smaller[ Format for regression is `lm(y ~ x, data = df)` - `y` is dependent variable (listed first!) - `~` means “is modeled by” or “is by” - `x` is the independent variable - `df` is name of dataframe where data is stored This is `Base R` (there's no good `tidyverse` way to do this yet...ish) ] ] --- # OLS in R II .left-code[ ```r # look at reg object school_reg ``` ``` ## ## Call: ## lm(formula = testscr ~ str, data = CASchool) ## ## Coefficients: ## (Intercept) str ## 698.93 -2.28 ``` ] .right-plot[ - Stored as an `lm` object called `school_reg`, a type of `list` object ] --- # OLS in R III .pull-left[ .smaller[ - Looking at the `summary`, there's a lot of information here! - These objects are cumbersome, come from a much older, pre-`tidyverse` epoch of `base R` - Luckily, we now have `tidy` ways of working with regression *output*! ] ] .pull-right[ .code50[ ```r summary(school_reg) # get full summary ``` ``` ## ## Call: ## lm(formula = testscr ~ str, data = CASchool) ## ## Residuals: ## Min 1Q Median 3Q Max ## -47.727 -14.251 0.483 12.822 48.540 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 698.9330 9.4675 73.825 < 2e-16 *** ## str -2.2798 0.4798 -4.751 2.78e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 18.58 on 418 degrees of freedom ## Multiple R-squared: 0.05124, Adjusted R-squared: 0.04897 ## F-statistic: 22.58 on 1 and 418 DF, p-value: 2.783e-06 ``` ] ] --- # Tidy OLS in R: broom I .left-column[ .center[ ![](https://www.dropbox.com/s/qpe1604cf7s9r0g/rbroom.png?raw=1) ] .footnote[<sup>.magenta[†]</sup> See more at [broom.tidyverse.org](https://broom.tidyverse.org/).] ] .right-column[ .smaller[ - The `broom` package allows us to *tidy* up regression objects<sup>.magenta[†]</sup> - The `tidy()` function creates a *tidy* `tibble` of regression output ] .code50[ ```r # load packages library(broom) # tidy regression output tidy(school_reg) ``` ``` ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 699. 9.47 73.8 6.57e-242 ## 2 str -2.28 0.480 -4.75 2.78e- 6 ``` ] ] --- # Tidy OLS in R: broom II .left-column[ .center[ ![](https://www.dropbox.com/s/qpe1604cf7s9r0g/rbroom.png?raw=1) ] .footnote[<sup>.magenta[†]</sup> See more at [broom.tidyverse.org](https://broom.tidyverse.org/).] ] .right-column[ .smaller[ - The `broom` package allows us to *tidy* up regression objects<sup>.magenta[†]</sup> - The `tidy()` function creates a *tidy* `tibble` of regression output ] .code50[ ```r # load packages library(broom) # tidy regression output (with confidence intervals!) tidy(school_reg, * conf.int = TRUE) ``` ``` ## # A tibble: 2 × 7 ## term estimate std.error statistic p.value conf.low conf.high ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 699. 9.47 73.8 6.57e-242 680. 718. ## 2 str -2.28 0.480 -4.75 2.78e- 6 -3.22 -1.34 ``` ] ] --- # More broom Tools: glance - `glance()` shows us a lot of overall regression statistics and diagnostics - We'll interpret these in the next lecture and beyond ```r # look at regression statistics and diagnostics glance(school_reg) ``` ``` ## # A tibble: 1 × 12 ## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.0512 0.0490 18.6 22.6 0.00000278 1 -1822. 3650. 3663. ## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int> ``` --- # More broom Tools: augment .pull-left[ - `augment()` creates useful new variables in the stored `lm` object - `.fitted` are fitted (predicted) values from model, i.e. `\(\hat{Y}_i\)` - `.resid` are residuals (errors) from model, i.e. `\(\hat{u}_i\)` ] .pull-right[ .code40[ ```r # add regression-based values to data augment(school_reg) ``` ``` ## # A tibble: 420 × 8 ## testscr str .fitted .resid .hat .sigma .cooksd .std.resid ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 691. 17.9 658. 32.7 0.00442 18.5 0.00689 1.76 ## 2 661. 21.5 650. 11.3 0.00475 18.6 0.000893 0.612 ## 3 644. 18.7 656. -12.7 0.00297 18.6 0.000700 -0.685 ## 4 648. 17.4 659. -11.7 0.00586 18.6 0.00117 -0.629 ## 5 641. 18.7 656. -15.5 0.00301 18.6 0.00105 -0.836 ## 6 606. 21.4 650. -44.6 0.00446 18.5 0.0130 -2.40 ## 7 607. 19.5 654. -47.7 0.00239 18.5 0.00794 -2.57 ## 8 609 20.9 651. -42.3 0.00343 18.5 0.00895 -2.28 ## 9 612. 19.9 653. -41.0 0.00244 18.5 0.00597 -2.21 ## 10 613. 20.8 652. -38.9 0.00329 18.5 0.00723 -2.09 ## # … with 410 more rows ``` ] ] --- # Class Size Regression Result I - Using OLS, we find: `$$\widehat{\text{test score}}=689.9-2.28 \times str$$` --- # Class Size Regression Result II - There's a great package called `equatiomatic` that prints this equation in `markdown` or `\(\LaTeX\)`. $$ \operatorname{\widehat{testscr}} = 698.93 - 2.28(\operatorname{str}) $$ -- Here was my code: ```r # install.packages("equatiomatic") # install for first use library(equatiomatic) # load it extract_eq(school_reg, # regression lm object use_coefs = TRUE, # use names of variables coef_digits = 2, # round to 2 digits fix_signs = TRUE) # fix negatives (instead of + -) ``` $$ \operatorname{\widehat{testscr}} = 698.93 - 2.28(\operatorname{str}) $$ - In `R` chunk in `R markdown`, set `{r, results="asis"}` to print this raw output to be rendered --- # Class Size Regression: A Data Point .pull-left[ .smaller[ - One district in our sample is Richmond, CA: ] .code60[ ```r CASchool %>% filter(district=="Richmond Elementary") %>% dplyr::select(district, testscr, str) ``` ``` ## # A tibble: 1 × 3 ## district testscr str ## <chr> <dbl> <dbl> ## 1 Richmond Elementary 672. 22 ``` ] .smallest[ - Predicted value: `$$\widehat{\text{Test Score}}_{\text{Richmond}}=698-2.28(22) \approx 648$$` - Residual `$$\hat{u}_{Richmond}=672-648 \approx 24$$` ] ] .pull-right[ <img src="2.3-slides_files/figure-html/unnamed-chunk-39-1.png" width="504" style="display: block; margin: auto;" /> ]