class: center, middle, inverse, title-slide # 3.6 — Regression with Categorical Data ## ECON 480 • Econometrics • Fall 2021 ### Ryan Safner
Assistant Professor of Economics
safner@hood.edu
ryansafner/metricsF21
metricsF21.classes.ryansafner.com
--- class: inverse # Outline ### [Working with `Factor` Variables in R](#3) ### [Regression with Dummy Variables](#12) ### [Recoding Dummies](#36) ### [Categorical Variables (More than 2 Categories)](#47) --- # Categorical Data .pull-left[ - .hi[Categorical data] place an individual into one of several possible *categories* - e.g. sex, season, political party - may be responses to survey questions - can be quantitative (e.g. age, zip code) - `R` calls these `factors` ] .pull-right[ ![](../images/categoricaldata.png) ] --- class: inverse, center, middle # Working with `factor` Variables in `R` --- # Factors in R .quitesmall[ - `factor` is a special type of `character` object class that indicates membership in a category (called a `level`) - Suppose I have data on students: ```r students %>% head(n = 5) ``` <div data-pagedtable="false"> <script data-pagedtable-source type="application/json"> {"columns":[{"label":["ID"],"name":[1],"type":["dbl"],"align":["right"]},{"label":["Rank"],"name":[2],"type":["chr"],"align":["left"]},{"label":["Grade"],"name":[3],"type":["dbl"],"align":["right"]}],"data":[{"1":"1","2":"Sophomore","3":"77"},{"1":"2","2":"Senior","3":"72"},{"1":"3","2":"Freshman","3":"73"},{"1":"4","2":"Senior","3":"73"},{"1":"5","2":"Junior","3":"84"}],"options":{"columns":{"min":{},"max":[10]},"rows":{"min":[10],"max":[10]},"pages":{}}} </script> </div> ] -- .quitesmall[ - See that `Rank` is a `character` (`<chr>`) variable, just a string of text ] --- # Factors in R .quitesmall[ - We can make `Rank` a `factor` variable, to indicate a student is a member of one of the possible categories: (freshman, sophomore, junior, senior) ```r students <- students %>% mutate(Rank = as.factor(Rank)) # overwrite and change class of Rank to factor students %>% head(n = 5) ``` <div data-pagedtable="false"> <script data-pagedtable-source type="application/json"> {"columns":[{"label":["ID"],"name":[1],"type":["dbl"],"align":["right"]},{"label":["Rank"],"name":[2],"type":["fct"],"align":["left"]},{"label":["Grade"],"name":[3],"type":["dbl"],"align":["right"]}],"data":[{"1":"1","2":"Sophomore","3":"77"},{"1":"2","2":"Senior","3":"72"},{"1":"3","2":"Freshman","3":"73"},{"1":"4","2":"Senior","3":"73"},{"1":"5","2":"Junior","3":"84"}],"options":{"columns":{"min":{},"max":[10]},"rows":{"min":[10],"max":[10]},"pages":{}}} </script> </div> .quitesmall[ - See now it’s a `factor` (`<fct>`) ] ] --- # Factors in R .smallest[ ```r # what are the categories? students %>% group_by(Rank) %>% count() ``` <div data-pagedtable="false"> <script data-pagedtable-source type="application/json"> {"columns":[{"label":["Rank"],"name":[1],"type":["fct"],"align":["left"]},{"label":["n"],"name":[2],"type":["int"],"align":["right"]}],"data":[{"1":"Freshman","2":"1"},{"1":"Junior","2":"4"},{"1":"Senior","2":"2"},{"1":"Sophomore","2":"3"}],"options":{"columns":{"min":{},"max":[10]},"rows":{"min":[10],"max":[10]},"pages":{}}} </script> </div> ```r # note the order is arbitrary! This is an "unordered" factor ``` ] --- # Ordered Factors in R .quitesmall[ - If there is a rank order you wish to preserve, you can make an `ordered` (`factor`) variable - list the `levels` from 1st to last ```r students <- students %>% mutate(Rank = ordered(Rank, # overwrite and change class of Rank to ordered # next, specify the levels, in order levels = c("Freshman", "Sophomore", "Junior", "Senior") ) ) students %>% head(n = 5) ``` <div data-pagedtable="false"> <script data-pagedtable-source type="application/json"> {"columns":[{"label":["ID"],"name":[1],"type":["dbl"],"align":["right"]},{"label":["Rank"],"name":[2],"type":["ord"],"align":["right"]},{"label":["Grade"],"name":[3],"type":["dbl"],"align":["right"]}],"data":[{"1":"1","2":"Sophomore","3":"77"},{"1":"2","2":"Senior","3":"72"},{"1":"3","2":"Freshman","3":"73"},{"1":"4","2":"Senior","3":"73"},{"1":"5","2":"Junior","3":"84"}],"options":{"columns":{"min":{},"max":[10]},"rows":{"min":[10],"max":[10]},"pages":{}}} </script> </div> ] --- # Ordered Factors in R .quitesmall[ ```r students %>% group_by(Rank) %>% count() ``` <div data-pagedtable="false"> <script data-pagedtable-source type="application/json"> {"columns":[{"label":["Rank"],"name":[1],"type":["ord"],"align":["right"]},{"label":["n"],"name":[2],"type":["int"],"align":["right"]}],"data":[{"1":"Freshman","2":"1"},{"1":"Sophomore","2":"3"},{"1":"Junior","2":"4"},{"1":"Senior","2":"2"}],"options":{"columns":{"min":{},"max":[10]},"rows":{"min":[10],"max":[10]},"pages":{}}} </script> </div> ] --- # Example Research Question .pull-left[ .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt5[ .green[**Example**]: How much higher wages, on average, do men earn compared to women? ] ] .pull-right[ .center[ ![:scale 100%](../images/genderpaygap.jpg) ] ] --- # The Pure Statistics of Comparing Group Means .pull-left[ .smallest[ - Basic statistics: can test for statistically significant difference in group means with a **t-test**<sup>.magenta[†]</sup>, let: - .blue[`\\(Y_M\\)`]: average earnings of a sample of .blue[`\\(n_M\\)`] men - .magenta[`\\(Y_W\\)`]: average earnings of a sample of .magenta[`\\(n_W\\)`] women - **Difference** in group averages: `\(d=\)` .blue[`\\(\bar{Y}_M\\)`] `\(-\)` .magenta[`\\(\bar{Y}_W\\)`] - The hypothesis test is: - `\(H_0: d=0\)` - `\(H_1: d \neq 0\)` ] ] .pull-right[ .center[ ![:scale 100%](../images/genderpaygap.jpg) ] ] .footnote[<sup>.magenta[†]</sup> See [today’s class page](/content/3.6-content) for this example] --- # Plotting Factors in R .pull-left[ - If I plot a `factor` variable, e.g. `Gender` (which is either `Male` or `Female`), the scatterplot with `wage` looks like this - effectively `R` treats values of a factor variable as integers - in this case, `"Female"` = 0, `"Male"` = 1 - Let’s make this more explicit by making a .hi[dummy variable] to stand in for Gender ] .pull-right[ <img src="3.6-slides_files/figure-html/unnamed-chunk-7-1.png" width="504" /> ] --- class: inverse, center, middle # Regression with Dummy Variables --- # Comparing Groups with Regression .smallest[ - In a regression, we can easily compare across groups via a .hi[dummy variable]<sup>.magenta[†]</sup> - Dummy variable *only* `\(=0\)` or `\(=1\)`, if a condition is `TRUE` vs. `FALSE` - Signifies whether an observation belongs to a category or not ] .footnote[<sup>.magenta[†]</sup> Also called a .hi[binary variable] or .hi[dichotomous variable]] -- .smallest[ .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt5[ .green[**Example**]: `$$\widehat{Wage_i}=\hat{\beta_0}+\hat{\beta_1}Female_i \quad \quad \text{ where } Female_i = \begin{cases} 1 & \text{if individual } i \text{ is } Female \\ 0 & \text{if individual } i \text{ is } Male\\ \end{cases}$$` ] ] -- .smallest[ - Again, `\(\hat{\beta_1}\)` makes less sense as the “slope” of a line in this context ] --- # Comparing Groups in Regression: Scatterplot .pull-left[ - `Female` is our dummy `\(x\)`-variable - Hard to see relationships because of **overplotting** ] .pull-right[ <img src="3.6-slides_files/figure-html/unnamed-chunk-8-1.png" width="504" /> ] --- # Comparing Groups in Regression: Scatterplot .pull-left[ - `Female` is our dummy `\(x\)`-variable - Hard to see relationships because of **overplotting** - Tip: use `geom_jitter()` instead of `geom_point()` to *randomly* nudge points to see them better! - Only used for *plotting*, does not affect actual data, regression, etc. ] .pull-right[ <img src="3.6-slides_files/figure-html/unnamed-chunk-9-1.png" width="504" /> ] --- # Comparing Groups in Regression: Scatterplot .pull-left[ - `Female` is our dummy `\(x\)`-variable - Hard to see relationships because of **overplotting** - Use `geom_jitter()` instead of `geom_point()` to *randomly* nudge points - *Only* for plotting purposes, does not affect actual data, regression, etc. ] .pull-right[ <img src="3.6-slides_files/figure-html/unnamed-chunk-10-1.png" width="504" /> ] --- # Dummy Variables as Group Means .smallest[ `$$\hat{Y_i}=\hat{\beta_0}+\hat{\beta_1} D_i \quad \text{ where }D_i=\{\color{#6A5ACD}{0},\color{#e64173}{1}\}$$` ] -- .smallest[ - .purple[When `\\(D_i=0\\)` (“Control group”):] - `\(\hat{Y_i}=\hat{\beta_0}\)` - `\(\color{#6A5ACD}{E[Y_i|D_i=0]}=\hat{\beta_0}\)` `\(\iff\)` the mean of `\(Y\)` when `\(D_i=0\)` ] -- .smallest[ - .hi[When `\\(D_i=1\\)` (“Treatment group”):] - `\(\hat{Y_i}=\hat{\beta_0}+\hat{\beta_1} D_i\)` - `\(\color{#e64173}{E[Y_i|D_i=1]}=\hat{\beta_0}+\hat{\beta_1}\)` `\(\iff\)` the mean of `\(Y\)` when `\(D_i=1\)` ] -- .smallest[ - So the **difference** in group means: `$$\begin{align*} &=\color{#e64173}{E[Y_i|D_i=1]}-\color{#6A5ACD}{E[Y_i|D_i=0]}\\ &=(\hat{\beta_0}+\hat{\beta_1})-(\hat{\beta_0})\\ &=\hat{\beta_1}\\ \end{align*}$$` ] --- # Dummy Variables as Group Means: Our Example .pull-left[ .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt5[ .green[**Example**]: `$$\widehat{Wage_i}=\hat{\beta_0}+\hat{\beta_1}Female_i$$` `$$\text{where } Female_i = \begin{cases} 1 & \text{if } i \text{ is }Female \\ 0 & \text{if } i \text{ is } Male\\ \end{cases}$$` ] ] .pull-right[ - Mean wage for men: ] --- # Dummy Variables as Group Means: Our Example .pull-left[ .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt5[ .green[**Example**]: `$$\widehat{Wage_i}=\hat{\beta_0}+\hat{\beta_1}Female_i$$` `$$\text{where } Female_i = \begin{cases} 1 & \text{if } i \text{ is }Female \\ 0 & \text{if } i \text{ is } Male\\ \end{cases}$$` ] ] .pull-right[ - Mean wage for men: `$$E[Wage|Female=0]=\hat{\beta_0}$$` ] --- # Dummy Variables as Group Means: Our Example .pull-left[ .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt5[ .green[**Example**]: `$$\widehat{Wage_i}=\hat{\beta_0}+\hat{\beta_1}Female_i$$` `$$\text{where } Female_i = \begin{cases} 1 & \text{if } i \text{ is }Female \\ 0 & \text{if } i \text{ is } Male\\ \end{cases}$$` ] ] .pull-right[ - Mean wage for men: `$$E[Wage|Female=0]=\hat{\beta_0}$$` - Mean wage for women: ] --- # Dummy Variables as Group Means: Our Example .pull-left[ .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt5[ .green[**Example**]: `$$\widehat{Wage_i}=\hat{\beta_0}+\hat{\beta_1}Female_i$$` `$$\text{where } Female_i = \begin{cases} 1 & \text{if } i \text{ is }Female \\ 0 & \text{if } i \text{ is } Male\\ \end{cases}$$` ] ] .pull-right[ - Mean wage for men: `$$E[Wage|Female=0]=\hat{\beta_0}$$` - Mean wage for women: `$$E[Wage|Female=1]=\hat{\beta_0}+\hat{\beta_1}$$` ] --- # Dummy Variables as Group Means: Our Example .pull-left[ .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt5[ .green[**Example**]: `$$\widehat{Wage_i}=\hat{\beta_0}+\hat{\beta_1}Female_i$$` `$$\text{where } Female_i = \begin{cases} 1 & \text{if } i \text{ is }Female \\ 0 & \text{if } i \text{ is } Male\\ \end{cases}$$` ] ] .pull-right[ - Mean wage for men: `$$E[Wage|Female=0]=\hat{\beta_0}$$` - Mean wage for women: `$$E[Wage|Female=1]=\hat{\beta_0}+\hat{\beta_1}$$` - Difference in wage between men & women: ] --- # Dummy Variables as Group Means: Our Example .pull-left[ .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt5[ .green[**Example**]: `$$\widehat{Wage_i}=\hat{\beta_0}+\hat{\beta_1}Female_i$$` `$$\text{where } Female_i = \begin{cases} 1 & \text{if } i \text{ is }Female \\ 0 & \text{if } i \text{ is } Male\\ \end{cases}$$` ] ] .pull-right[ - Mean wage for men: `$$E[Wage|Female=0]=\hat{\beta_0}$$` - Mean wage for women: `$$E[Wage|Female=1]=\hat{\beta_0}+\hat{\beta_1}$$` - Difference in wage between men & women: `$$\hat{\beta_1}$$` ] --- # Comparing Groups in Regression: Scatterplot .pull-left[ `$$\widehat{Wage_i}=\hat{\beta_0}+\hat{\beta_1}Female_i$$` `$$\text{where } Female_i = \begin{cases} 1 & \text{if } i \text{ is }Female \\ 0 & \text{if } i \text{ is } Male\\ \end{cases}$$` ] .pull-right[ <img src="3.6-slides_files/figure-html/unnamed-chunk-11-1.png" width="504" /> ] --- # The Data .quitesmall[ ```r # comes from wooldridge package # install.packages("wooldridge") library(wooldridge) # data is called "wage1", save as a dataframe I'll call "wages" wages <- wage1 wages %>% head() ``` ] --- # Get Group Averages & Std. Devs. .pull-left[ .smallest[ ```r # Summarize for Men wages %>% filter(female==0) %>% summarize(mean = mean(wage), sd = sd(wage)) ``` <div data-pagedtable="false"> <script data-pagedtable-source type="application/json"> {"columns":[{"label":["mean"],"name":[1],"type":["dbl"],"align":["right"]},{"label":["sd"],"name":[2],"type":["dbl"],"align":["right"]}],"data":[{"1":"7.099489","2":"4.160858"}],"options":{"columns":{"min":{},"max":[10]},"rows":{"min":[10],"max":[10]},"pages":{}}} </script> </div> ] ] .pull-right[ .smallest[ ```r # Summarize for Women wages %>% filter(female==1) %>% summarize(mean = mean(wage), sd = sd(wage)) ``` <div data-pagedtable="false"> <script data-pagedtable-source type="application/json"> {"columns":[{"label":["mean"],"name":[1],"type":["dbl"],"align":["right"]},{"label":["sd"],"name":[2],"type":["dbl"],"align":["right"]}],"data":[{"1":"4.587659","2":"2.529363"}],"options":{"columns":{"min":{},"max":[10]},"rows":{"min":[10],"max":[10]},"pages":{}}} </script> </div> ] ] --- # Visualize Differences <img src="3.6-slides_files/figure-html/unnamed-chunk-15-1.png" width="1008" /> --- # The Regression I .pull-left[ .tiny[ .code90[ ```r femalereg <- lm(wage ~ female, data = wages) summary(femalereg) ``` ``` ## ## Call: ## lm(formula = wage ~ female, data = wages) ## ## Residuals: ## Min 1Q Median 3Q Max ## -5.5995 -1.8495 -0.9877 1.4260 17.8805 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 7.0995 0.2100 33.806 < 2e-16 *** ## female -2.5118 0.3034 -8.279 1.04e-15 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.476 on 524 degrees of freedom ## Multiple R-squared: 0.1157, Adjusted R-squared: 0.114 ## F-statistic: 68.54 on 1 and 524 DF, p-value: 1.042e-15 ``` ] ] ] -- .pull-right[ .tiny[ ```r library(broom) tidy(femalereg) ``` <div data-pagedtable="false"> <script data-pagedtable-source type="application/json"> {"columns":[{"label":["term"],"name":[1],"type":["chr"],"align":["left"]},{"label":["estimate"],"name":[2],"type":["dbl"],"align":["right"]},{"label":["std.error"],"name":[3],"type":["dbl"],"align":["right"]},{"label":["statistic"],"name":[4],"type":["dbl"],"align":["right"]},{"label":["p.value"],"name":[5],"type":["dbl"],"align":["right"]}],"data":[{"1":"(Intercept)","2":"7.099489","3":"0.2100082","4":"33.805777","5":"8.971839e-134"},{"1":"female","2":"-2.511830","3":"0.3034092","4":"-8.278688","5":"1.041764e-15"}],"options":{"columns":{"min":{},"max":[10]},"rows":{"min":[10],"max":[10]},"pages":{}}} </script> </div> ] ] --- # Dummy Regression vs. Group Means .pull-left[ .smallest[ From tabulation of group means | Gender | Avg. Wage | Std. Dev. | `\(n\)` | |--------|-------------|-----------|-------| | Female | `\(4.59\)` | `\(2.33\)` | `\(252\)` | | Male | `\(7.10\)` | `\(4.16\)` | `\(274\)` | | Difference | `\(2.51\)` | `\(0.30\)` | `\(-\)` | From `\(t\)`-test of difference in group means ] ] .pull-right[ .quitesmall[ <div data-pagedtable="false"> <script data-pagedtable-source type="application/json"> {"columns":[{"label":["term"],"name":[1],"type":["chr"],"align":["left"]},{"label":["estimate"],"name":[2],"type":["dbl"],"align":["right"]},{"label":["std.error"],"name":[3],"type":["dbl"],"align":["right"]},{"label":["statistic"],"name":[4],"type":["dbl"],"align":["right"]},{"label":["p.value"],"name":[5],"type":["dbl"],"align":["right"]}],"data":[{"1":"(Intercept)","2":"7.099489","3":"0.2100082","4":"33.805777","5":"8.971839e-134"},{"1":"female","2":"-2.511830","3":"0.3034092","4":"-8.278688","5":"1.041764e-15"}],"options":{"columns":{"min":{},"max":[10]},"rows":{"min":[10],"max":[10]},"pages":{}}} </script> </div> ] `$$\widehat{\text{Wages}_i}=7.10-2.51 \, \text{Female}_i$$` ] --- class: inverse, center, middle # Recoding Dummies --- # Recoding Dummies .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt5[ .green[**Example**]: - Suppose instead of `\(female\)` we had used: `$$\widehat{Wage_i}=\hat{\beta_0}+\hat{\beta_1}Male_i \quad \quad \text{ where } Male_i = \begin{cases} 1 & \text{if person } i \text{ is } Male \\ 0 & \text{if person } i \text{ is } Female\\ \end{cases}$$` ] --- # Recoding Dummies with Data .quitesmall[ ```r wages<-wages %>% mutate(male = ifelse(female == 0, # condition: is female equal to 0? yes = 1, # if true: code as "1" no = 0)) # if false: code as "0" # verify it worked wages %>% select(wage, female, male) %>% head() ``` <div data-pagedtable="false"> <script data-pagedtable-source type="application/json"> {"columns":[{"label":[""],"name":["_rn_"],"type":[""],"align":["left"]},{"label":["wage"],"name":[1],"type":["dbl"],"align":["right"]},{"label":["female"],"name":[2],"type":["int"],"align":["right"]},{"label":["male"],"name":[3],"type":["dbl"],"align":["right"]}],"data":[{"1":"3.10","2":"1","3":"0","_rn_":"1"},{"1":"3.24","2":"1","3":"0","_rn_":"2"},{"1":"3.00","2":"0","3":"1","_rn_":"3"},{"1":"6.00","2":"0","3":"1","_rn_":"4"},{"1":"5.30","2":"0","3":"1","_rn_":"5"},{"1":"8.75","2":"0","3":"1","_rn_":"6"}],"options":{"columns":{"min":{},"max":[10]},"rows":{"min":[10],"max":[10]},"pages":{}}} </script> </div> ] --- # Scatterplot with Male .pull-left[ <img src="3.6-slides_files/figure-html/unnamed-chunk-20-1.png" width="504" /> ] -- .pull-right[ <img src="3.6-slides_files/figure-html/unnamed-chunk-21-1.png" width="504" /> ] --- # Dummy Variables as Group Means: With Male .pull-left[ .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt5[ .green[**Example**]: `$$\widehat{Wage_i}=\hat{\beta_0}+\hat{\beta_1}Male_i$$` `$$\text{where } Male_i = \begin{cases} 1 & \text{if } i \text{ is } Male \\ 0 & \text{if } i \text{ is } Female\\ \end{cases}$$` ] ] .pull-right[ - Mean wage for men: `$$E[Wage|Male=1]=\hat{\beta_0}+\hat{\beta_1}$$` - Mean wage for women: `$$E[Wage|Male=0]=\hat{\beta_0}$$` - Difference in wage between men & women: `$$\hat{\beta_1}$$` ] --- # Scatterplot with Male .pull-left[ <img src="3.6-slides_files/figure-html/unnamed-chunk-22-1.png" width="504" /> ] -- .pull-right[ <img src="3.6-slides_files/figure-html/unnamed-chunk-23-1.png" width="504" /> ] --- # The Regression with Male I .pull-left[ .tiny[ .code90[ ```r malereg <- lm(wage ~ male, data = wages) summary(malereg) ``` ``` ## ## Call: ## lm(formula = wage ~ male, data = wages) ## ## Residuals: ## Min 1Q Median 3Q Max ## -5.5995 -1.8495 -0.9877 1.4260 17.8805 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 4.5877 0.2190 20.950 < 2e-16 *** ## male 2.5118 0.3034 8.279 1.04e-15 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.476 on 524 degrees of freedom ## Multiple R-squared: 0.1157, Adjusted R-squared: 0.114 ## F-statistic: 68.54 on 1 and 524 DF, p-value: 1.042e-15 ``` ] ] ] -- .pull-right[ .tiny[ ```r library(broom) tidy(malereg) ``` <div data-pagedtable="false"> <script data-pagedtable-source type="application/json"> {"columns":[{"label":["term"],"name":[1],"type":["chr"],"align":["left"]},{"label":["estimate"],"name":[2],"type":["dbl"],"align":["right"]},{"label":["std.error"],"name":[3],"type":["dbl"],"align":["right"]},{"label":["statistic"],"name":[4],"type":["dbl"],"align":["right"]},{"label":["p.value"],"name":[5],"type":["dbl"],"align":["right"]}],"data":[{"1":"(Intercept)","2":"4.587659","3":"0.2189834","4":"20.949802","5":"3.012371e-71"},{"1":"male","2":"2.511830","3":"0.3034092","4":"8.278688","5":"1.041764e-15"}],"options":{"columns":{"min":{},"max":[10]},"rows":{"min":[10],"max":[10]},"pages":{}}} </script> </div> ] ] --- # The Dummy Regression: Male or Female .pull-left[ .quitesmall[
(1)
(2)
Constant
4.59 ***
7.10 ***
(0.22)
(0.21)
Female
-2.51 ***
(0.30)
Male
2.51 ***
(0.30)
N
526
526
R-Squared
0.12
0.12
SER
3.48
3.48
*** p < 0.001; ** p < 0.01; * p < 0.05.
] ] .pull-right[ - Note it doesn't matter if we use `male` or `female`, males always earn $2.51 more than females - Compare the constant (average for the `\(D=0\)` group) - Should you use `male` AND `female`? We'll come to that... ] --- class: inverse, center, middle # Categorical Variables (More than 2 Categories) --- # Categorical Variables with More than 2 Categories - A .hi[categorical variable] expresses membership in a category, where there is no ranking or hierarchy of the categories - We've looked at categorical variables with 2 categories only - e.g. Male/Female, Spring/Summer/Fall/Winter, Democratic/Republican/Independent -- - Might be an .hi[ordinal variable] expresses rank or an ordering of data, but not necessarily their relative magnitude - e.g. Order of finalists in a competition (1st, 2nd, 3rd) - e.g. Highest education attained (1=elementary school, 2=high school, 3=bachelor's degree, 4=graduate degree) --- # Using Categorical Variables in Regression I .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt5[ .green[**Example**]: How do wages vary by region of the country? Let `\(Region_i=\{Northeast, \, Midwest, \, South, \, West\}\)` ] -- - Can we run the following regression? `$$\widehat{Wages_i}=\hat{\beta_0}+\hat{\beta_1}Region_i$$` --- # Using Categorical Variables in Regression II .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt5[ .green[**Example**]: How do wages vary by region of the country? ] Code region numerically: `$$Region_i= \begin{cases}1 & \text{if } i \text{ is in }Northeast\\ 2 & \text{if } i \text{ is in } Midwest\\ 3 & \text{if } i \text{ is in } South \\ 4 & \text{if } i \text{ is in } West\\ \end{cases}$$` -- - Can we run the following regression? `$$\widehat{Wages_i}=\hat{\beta_0}+\hat{\beta_1}Region_i$$` --- # Using Categorical Variables in Regression III .smallest[ .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt5[ .green[**Example**]: How do wages vary by region of the country? Create a dummy variable for *each* region: - `\(Northeast_i = 1\)` if `\(i\)` is in Northeast, otherwise `\(=0\)` - `\(Midwest_i = 1\)` if `\(i\)` is in Midwest, otherwise `\(=0\)` - `\(South_i = 1\)` if `\(i\)` is in South, otherwise `\(=0\)` - `\(West_i = 1\)` if `\(i\)` is in West, otherwise `\(=0\)` ] ] -- .smallest[ - Can we run the following regression? `$$\widehat{Wages_i}=\hat{\beta_0}+\hat{\beta_1}Northeast_i+\hat{\beta_2}Midwest_i+\hat{\beta_3}South_i+\hat{\beta_4}West_i$$` ] -- .smallest[ - For every `\(i: \, Northeast_i+Midwest_i+South_i+West_i=1\)`! ] --- # The Dummy Variable Trap .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt5[ .green[**Example**]: `\(\widehat{Wages_i}=\hat{\beta_0}+\hat{\beta_1}Northeast_i+\hat{\beta_2}Midwest_i+\hat{\beta_3}South_i+\hat{\beta_4}West_i\)` ] - If we include *all* possible categories, they are .hi-purple[perfectly multicollinear], an exact linear function of one another: `$$Northeast_i+Midwest_i+South_i+West_i=1 \quad \forall i$$` - This is known as the .hi[dummy variable trap], a common source of perfect multicollinearity --- # The Reference Category - To avoid the dummy variable trap, always omit one category from the regression, known as the .hi[“reference category”] - It does not matter which category we omit! - .hi-purple[Coefficients on each dummy variable measure the *difference* between the *reference* category and each category dummy] --- # The Reference Category: Example .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt5[ .green[**Example**]: `\(\widehat{Wages_i}=\hat{\beta_0}+\hat{\beta_1}Northeast_i+\hat{\beta_2}Midwest_i+\hat{\beta_3}South_i\)` ] - `\(West_i\)` is omitted (arbitrarily chosen) -- - `\(\hat{\beta_0}\)`: --- # The Reference Category: Example .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt5[ .green[**Example**]: `\(\widehat{Wages_i}=\hat{\beta_0}+\hat{\beta_1}Northeast_i+\hat{\beta_2}Midwest_i+\hat{\beta_3}South_i\)` ] - `\(West_i\)` is omitted (arbitrarily chosen) - `\(\hat{\beta_0}\)`: average wage for `\(i\)` in the West -- - `\(\hat{\beta_1}\)`: --- # The Reference Category: Example .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt5[ .green[**Example**]: `\(\widehat{Wages_i}=\hat{\beta_0}+\hat{\beta_1}Northeast_i+\hat{\beta_2}Midwest_i+\hat{\beta_3}South_i\)` ] - `\(West_i\)` is omitted (arbitrarily chosen) - `\(\hat{\beta_0}\)`: average wage for `\(i\)` in the West - `\(\hat{\beta_1}\)`: difference between West and Northeast -- - `\(\hat{\beta_2}\)`: --- # The Reference Category: Example .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt5[ .green[**Example**]: `\(\widehat{Wages_i}=\hat{\beta_0}+\hat{\beta_1}Northeast_i+\hat{\beta_2}Midwest_i+\hat{\beta_3}South_i\)` ] - `\(West_i\)` is omitted (arbitrarily chosen) - `\(\hat{\beta_0}\)`: average wage for `\(i\)` in the West - `\(\hat{\beta_1}\)`: difference between West and Northeast - `\(\hat{\beta_2}\)`: difference between West and Midwest -- - `\(\hat{\beta_3}\)`: --- # The Reference Category: Example .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt5[ .green[**Example**]: `\(\widehat{Wages_i}=\hat{\beta_0}+\hat{\beta_1}Northeast_i+\hat{\beta_2}Midwest_i+\hat{\beta_3}South_i\)` ] - `\(West_i\)` is omitted (arbitrarily chosen) - `\(\hat{\beta_0}\)`: average wage for `\(i\)` in the West - `\(\hat{\beta_1}\)`: difference between West and Northeast - `\(\hat{\beta_2}\)`: difference between West and Midwest - `\(\hat{\beta_3}\)`: difference between West and South --- # Dummy Variable Trap in R .smallest[ .code50[ ```r lm(wage ~ noreast + northcen + south + west, data = wages) %>% summary() ``` ``` ## ## Call: ## lm(formula = wage ~ noreast + northcen + south + west, data = wages) ## ## Residuals: ## Min 1Q Median 3Q Max ## -6.083 -2.387 -1.097 1.157 18.610 ## ## Coefficients: (1 not defined because of singularities) ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 6.6134 0.3891 16.995 < 2e-16 *** ## noreast -0.2436 0.5154 -0.473 0.63664 ## northcen -0.9029 0.5035 -1.793 0.07352 . ## south -1.2265 0.4728 -2.594 0.00974 ** ## west NA NA NA NA ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.671 on 522 degrees of freedom ## Multiple R-squared: 0.0175, Adjusted R-squared: 0.01185 ## F-statistic: 3.099 on 3 and 522 DF, p-value: 0.02646 ``` ] ] --- # Using Different Reference Categories in R .code50[ ```r # let's run 4 regressions, each one we omit a different region no_noreast_reg <- lm(wage ~ northcen + south + west, data = wages) no_northcen_reg <- lm(wage ~ noreast + south + west, data = wages) no_south_reg <- lm(wage ~ noreast + northcen + west, data = wages) no_west_reg <- lm(wage ~ noreast + northcen + south, data = wages) # now make an output table library(huxtable) huxreg(no_noreast_reg, no_northcen_reg, no_south_reg, no_west_reg, coefs = c("Constant" = "(Intercept)", "Northeast" = "noreast", "Midwest" = "northcen", "South" = "south", "West" = "west"), statistics = c("N" = "nobs", "R-Squared" = "r.squared", "SER" = "sigma"), number_format = 3) ``` ] --- # Using Different Reference Categories in R II .pull-left[ .tiny[
(1)
(2)
(3)
(4)
Constant
6.370 ***
5.710 ***
5.387 ***
6.613 ***
(0.338)
(0.320)
(0.268)
(0.389)
Northeast
0.659
0.983 *
-0.244
(0.465)
(0.432)
(0.515)
Midwest
-0.659
0.324
-0.903
(0.465)
(0.417)
(0.504)
South
-0.983 *
-0.324
-1.226 **
(0.432)
(0.417)
(0.473)
West
0.244
0.903
1.226 **
(0.515)
(0.504)
(0.473)
N
526
526
526
526
R-Squared
0.017
0.017
0.017
0.017
SER
3.671
3.671
3.671
3.671
*** p < 0.001; ** p < 0.01; * p < 0.05.
] ] .pull-right[ .smallest[ - Constant is alsways average wage for reference (omitted) region - Compare coefficients between Midwest in (1) and Northeast in (2)... - Compare coefficients between West in (3) and South in (4)... - Does not matter which region we omit! - Same `\(R^2\)`, SER, coefficients give same results ] ] --- # Dummy *Dependent* (Y) Variables .smallest[ - In many contexts, we will want to have our *dependent* `\((Y)\)` variable be a dummy variable ] -- .smallest[ .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt5[ .green[**Example**]: `$$\widehat{Admitted_i}=\hat{\beta_0}+\hat{\beta_1}GPA_i \quad \text{ where } Admitted_i = \begin{cases} 1 & \text{if } i \text{ is Admitted} \\ 0 & \text{if } i \text{ is Not Admitted}\\ \end{cases}$$` ] ] -- .smallest[ - A model where `\(Y\)` is a dummy is called a .hi[linear probability model], as it measures the .hi-purple[probability of `\\(Y\\)` occuring `\\((=1)\\)` given the X's, i.e. `\\(P(Y_i=1|X_1, \cdots, X_k)\\)`] - e.g. the probability person `\(i\)` is Admitted to a program with a given GPA ] -- .smallest[ - Requires special tools to properly interpret and extend this (**logit**, **probit**, etc) - Feel free to write papers that have dummy `\(Y\)` variables (but you may have to ask me some more questions)! ]