3.6 — Regression with Categorical Data — Class Content

Overview
Readings
Slides
Assignments
- Problem Set 4 Due Tues Nov 9
Appendix: T-Test for Difference in Group Means
- The Sampling Distribution of $d$
- The $t$ -test in R

Thursday, November 4, 2021

Overview

Today we look at how to use data that is categorical (i.e. variables that indicate an observation’s membership in a particular group or category). We introduce them into regression models as dummy variables that can equal 0 or 1: where 1 indicates membership in a category, and 0 indicates non-membership.

We also look at what happens when categorical variables have more than two values: for regression, we introduce a dummy variable for each possible category - but be sure to leave out one reference category to avoid the dummy variable trap.

Readings

Ch. 6.1—6.2 in Bailey, Real Econometrics

Slides

Below, you can find the slides in two formats. Clicking the image will bring you to the html version of the slides in a new tab. Note while in going through the slides, you can type h to see a special list of viewing options, and type o for an outline view of all the slides.

The lower button will allow you to download a PDF version of the slides. I suggest printing the slides beforehand and using them to take additional notes in class (not everything is in the slides)!

Download as PDF

Assignments

Problem Set 4 Due Tues Nov 9

Problem Set 4 is due by the end of the day on Tuesday, November 9.

Appendix: T-Test for Difference in Group Means

Often we want to compare the means between two groups, and see if the difference is statistically significant. As an example, is there a statistically significant difference in average hourly earnings between men and women? Let:

$μ_{W}$ : mean hourly earnings for female college graduates
$μ_{M}$ : mean hourly earnings for male college graduates

We want to run a hypothesis test for the difference $(d)$ in these two population means: $μ_{M} - μ_{W} = d_{0}$

Our null hypothesis is that there is no statistically significant difference. Let’s also have a two-sided alternative hypothesis, simply that there is a difference (positive or negative).

$H_{0} : d = 0$
$H_{1} : d \neq 0$

Note a logical one-sided alternative would be $H_{2} : d > 0$ , i.e. men earn more than women

The Sampling Distribution of $d$

The true population means $μ_{M}, μ_{W}$ are unknown, we must estimate them from samples of men and women. Let: - ${\bar{Y}}_{M}$ the average earnings of a sample of $n_{M}$ men
- ${\bar{Y}}_{W}$ the average earnings of a sample of $n_{W}$ women

We then estimate $(μ_{M} - μ_{W})$ with the sample $({\bar{Y}}_{M} - {\bar{Y}}_{W})$ .

We would then run a t-test and calculate the test-statistic for the difference in means. The formula for the test statistic is:

$t = \frac{(\bar{Y_{M}} - \bar{Y_{W}}) - d_{0}}{\sqrt{\frac{s_{M}^{2}}{n_{M}} + \frac{s_{W}^{2}}{n_{W}}}}$

We then compare $t$ against the critical value $t^{*}$ , or calculate the $p$ -value $P (T > t)$ as usual to determine if we have sufficient evidence to reject $H_{0}$

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.5     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   2.0.0     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(wooldridge)
# Our data comes from wage1 in the wooldridge package

wages <- wage1

# look at average wage for men

wages %>%
  filter(female == 0) %>%
  summarize(average = mean(wage),
            sd = sd(wage))

##    average       sd
## 1 7.099489 4.160858

# look at average wage for women

wages %>%
  filter(female == 1) %>%
  summarize(average = mean(wage),
            sd = sd(wage))

##    average       sd
## 1 4.587659 2.529363

So our data is telling us that male and female average hourly earnings are distributed as such:

$\begin{aligned} {\bar{Y}}_{M} & \sim N (7.10, 4.16) \\ {\bar{Y}}_{W} & \sim N (4.59, 2.53) \end{aligned}$

We can plot this to see visually. There is a lot of overlap in the two distributions, but the male average is higher than the female average, and there is also a lot more variation in males than females, noticeably the male distribution skews further to the right.

wages$female <- as.factor(wages$female)

ggplot(data = wages)+
  aes(x = wage,
      fill = female)+
  geom_density(alpha = 0.5)+
  scale_x_continuous(breaks = seq(0,25,5),
                     name = "Wage",
                     labels = scales::dollar)+
  theme_light()

Knowing the distributions of male and female average hourly earnings, we can estimate the sampling distribution of the difference in group eans between men and women as:

The mean: $\begin{aligned} \bar{d} & = {\bar{Y}}_{M} - {\bar{Y}}_{W} \\ \bar{d} & = 7.10 - 4.59 \\ \bar{d} & = 2.51 \end{aligned}$

The standard error of the mean: $\begin{aligned} S E (\bar{d}) & = \sqrt{\frac{s_{M}^{2}}{n_{M}} + \frac{s_{W}^{2}}{n_{W}}} \\ = \sqrt{\frac{{4.16}^{2}}{274} + \frac{{2.33}^{2}}{252}} \\ \approx 0.29 \end{aligned}$

So the sampling distribution of the difference in group means is distributed: $\bar{d} \sim N (2.51, 0.29)$

ggplot(data = data.frame(x = 0:6))+
  aes(x = x)+
  stat_function(fun = dnorm, args = list(mean = 2.51, sd = 0.29), color = "purple")+
  labs(x = "Wage Difference",
       y = "Density")+
  scale_x_continuous(breaks = seq(0,6,1),
                     labels = scales::dollar)+
  theme_light()

Now we the $t$ -test like any other:

$\begin{aligned} t & = \frac{estimate - null hypothesis}{standard error of the estimate} \\ = \frac{d - 0}{S E (d)} \\ = \frac{2.51 - 0}{0.29} \\ = 8.66 \end{aligned}$

This is statistically significant. The $p$ -value, $P (t > 8.66) =$ is 0.000000000000000000410, or basically, 0.

pt(8.66,456.33, lower.tail = FALSE)

## [1] 4.102729e-17

The $t$ -test in `R`

t.test(wage ~ female, data = wages, var.equal = FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  wage by female
## t = 8.44, df = 456.33, p-value = 4.243e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  1.926971 3.096690
## sample estimates:
## mean in group 0 mean in group 1 
##        7.099489        4.587659

reg <- lm(wage~female, data = wages)
summary(reg)

## 
## Call:
## lm(formula = wage ~ female, data = wages)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.5995 -1.8495 -0.9877  1.4260 17.8805 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   7.0995     0.2100  33.806  < 2e-16 ***
## female1      -2.5118     0.3034  -8.279 1.04e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.476 on 524 degrees of freedom
## Multiple R-squared:  0.1157, Adjusted R-squared:  0.114 
## F-statistic: 68.54 on 1 and 524 DF,  p-value: 1.042e-15

Last updated on Nov 4, 2021