Overview
Today we look at how to use data that is categorical (i.e. variables that indicate an observation’s membership in a particular group or category). We introduce them into regression models as dummy variables that can equal 0 or 1: where 1 indicates membership in a category, and 0 indicates non-membership.
We also look at what happens when categorical variables have more than two values: for regression, we introduce a dummy variable for each possible category - but be sure to leave out one reference category to avoid the dummy variable trap.
Slides
Below, you can find the slides in two formats. Clicking the image will bring you to the html version of the slides in a new tab. Note while in going through the slides, you can type h to see a special list of viewing options, and type o for an outline view of all the slides.
The lower button will allow you to download a PDF version of the slides. I suggest printing the slides beforehand and using them to take additional notes in class (not everything is in the slides)!
Assignments
Problem Set 4 Due Tues Nov 9
Problem Set 4 is due by the end of the day on Tuesday, November 9.
Appendix: T-Test for Difference in Group Means
Often we want to compare the means between two groups, and see if the difference is statistically significant. As an example, is there a statistically significant difference in average hourly earnings between men and women? Let:
- : mean hourly earnings for female college graduates
- : mean hourly earnings for male college graduates
We want to run a hypothesis test for the difference in these two population means:
Our null hypothesis is that there is no statistically significant difference. Let’s also have a two-sided alternative hypothesis, simply that there is a difference (positive or negative).
Note a logical one-sided alternative would be , i.e. men earn more than women
The Sampling Distribution of
The true population means are unknown, we must estimate them from samples of men and women. Let:
- the average earnings of a sample of men
- the average earnings of a sample of women
We then estimate with the sample .
We would then run a t-test and calculate the test-statistic for the difference in means. The formula for the test statistic is:
We then compare against the critical value , or calculate the -value as usual to determine if we have sufficient evidence to reject
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.5 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 2.0.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(wooldridge)
# Our data comes from wage1 in the wooldridge package
wages <- wage1
# look at average wage for men
wages %>%
filter(female == 0) %>%
summarize(average = mean(wage),
sd = sd(wage))
## average sd
## 1 7.099489 4.160858
# look at average wage for women
wages %>%
filter(female == 1) %>%
summarize(average = mean(wage),
sd = sd(wage))
## average sd
## 1 4.587659 2.529363
So our data is telling us that male and female average hourly earnings are distributed as such:
We can plot this to see visually. There is a lot of overlap in the two distributions, but the male average is higher than the female average, and there is also a lot more variation in males than females, noticeably the male distribution skews further to the right.
wages$female <- as.factor(wages$female)
ggplot(data = wages)+
aes(x = wage,
fill = female)+
geom_density(alpha = 0.5)+
scale_x_continuous(breaks = seq(0,25,5),
name = "Wage",
labels = scales::dollar)+
theme_light()
Knowing the distributions of male and female average hourly earnings, we can estimate the sampling distribution of the difference in group eans between men and women as:
The mean:
The standard error of the mean:
So the sampling distribution of the difference in group means is distributed:
ggplot(data = data.frame(x = 0:6))+
aes(x = x)+
stat_function(fun = dnorm, args = list(mean = 2.51, sd = 0.29), color = "purple")+
labs(x = "Wage Difference",
y = "Density")+
scale_x_continuous(breaks = seq(0,6,1),
labels = scales::dollar)+
theme_light()
Now we the -test like any other:
This is statistically significant. The -value, is 0.000000000000000000410, or basically, 0.
pt(8.66,456.33, lower.tail = FALSE)
## [1] 4.102729e-17
The -test in R
t.test(wage ~ female, data = wages, var.equal = FALSE)
##
## Welch Two Sample t-test
##
## data: wage by female
## t = 8.44, df = 456.33, p-value = 4.243e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## 1.926971 3.096690
## sample estimates:
## mean in group 0 mean in group 1
## 7.099489 4.587659
reg <- lm(wage~female, data = wages)
summary(reg)
##
## Call:
## lm(formula = wage ~ female, data = wages)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.5995 -1.8495 -0.9877 1.4260 17.8805
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.0995 0.2100 33.806 < 2e-16 ***
## female1 -2.5118 0.3034 -8.279 1.04e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.476 on 524 degrees of freedom
## Multiple R-squared: 0.1157, Adjusted R-squared: 0.114
## F-statistic: 68.54 on 1 and 524 DF, p-value: 1.042e-15