Categorical data place an individual into one of several possible categories
R
calls these factors
factor
Variables in R
factor
is a special type of character
object class that indicates membership in a category (called a level
)
Suppose I have data on students:
students %>% head(n = 5)
ABCDEFGHIJ0123456789 |
ID <dbl> | Rank <chr> | Grade <dbl> | ||
---|---|---|---|---|
1 | Sophomore | 77 | ||
2 | Senior | 72 | ||
3 | Freshman | 73 | ||
4 | Senior | 73 | ||
5 | Junior | 84 |
factor
is a special type of character
object class that indicates membership in a category (called a level
)
Suppose I have data on students:
students %>% head(n = 5)
ABCDEFGHIJ0123456789 |
ID <dbl> | Rank <chr> | Grade <dbl> | ||
---|---|---|---|---|
1 | Sophomore | 77 | ||
2 | Senior | 72 | ||
3 | Freshman | 73 | ||
4 | Senior | 73 | ||
5 | Junior | 84 |
Rank
is a character
(<chr>
) variable, just a string of textRank
a factor
variable, to indicate a student is a member of one of the possible categories: (freshman, sophomore, junior, senior)students <- students %>% mutate(Rank = as.factor(Rank)) # overwrite and change class of Rank to factorstudents %>% head(n = 5)
ABCDEFGHIJ0123456789 |
ID <dbl> | Rank <fct> | Grade <dbl> | ||
---|---|---|---|---|
1 | Sophomore | 77 | ||
2 | Senior | 72 | ||
3 | Freshman | 73 | ||
4 | Senior | 73 | ||
5 | Junior | 84 |
factor
(<fct>
)# what are the categories?students %>% group_by(Rank) %>% count()
ABCDEFGHIJ0123456789 |
Rank <fct> | n <int> | |||
---|---|---|---|---|
Freshman | 1 | |||
Junior | 4 | |||
Senior | 2 | |||
Sophomore | 3 |
# note the order is arbitrary! This is an "unordered" factor
ordered
(factor
) variablelevels
from 1st to laststudents <- students %>% mutate(Rank = ordered(Rank, # overwrite and change class of Rank to ordered # next, specify the levels, in order levels = c("Freshman", "Sophomore", "Junior", "Senior") ) )students %>% head(n = 5)
ABCDEFGHIJ0123456789 |
ID <dbl> | Rank <ord> | Grade <dbl> | ||
---|---|---|---|---|
1 | Sophomore | 77 | ||
2 | Senior | 72 | ||
3 | Freshman | 73 | ||
4 | Senior | 73 | ||
5 | Junior | 84 |
students %>% group_by(Rank) %>% count()
ABCDEFGHIJ0123456789 |
Rank <ord> | n <int> | |||
---|---|---|---|---|
Freshman | 1 | |||
Sophomore | 3 | |||
Junior | 4 | |||
Senior | 2 |
Example: How much higher wages, on average, do men earn compared to women?
Basic statistics: can test for statistically significant difference in group means with a t-test†, let:
YM: average earnings of a sample of nM men
YW: average earnings of a sample of nW women
Difference in group averages: d= ˉYM − ˉYW
The hypothesis test is:
† See today’s class page for this example
If I plot a factor
variable, e.g. Gender
(which is either Male
or Female
), the scatterplot with wage
looks like this
R
treats values of a factor variable as integers"Female"
= 0, "Male"
= 1Let’s make this more explicit by making a dummy variable to stand in for Gender
In a regression, we can easily compare across groups via a dummy variable†
Dummy variable only =0 or =1, if a condition is TRUE
vs. FALSE
Signifies whether an observation belongs to a category or not
† Also called a binary variable or dichotomous variable
In a regression, we can easily compare across groups via a dummy variable†
Dummy variable only =0 or =1, if a condition is TRUE
vs. FALSE
Signifies whether an observation belongs to a category or not
† Also called a binary variable or dichotomous variable
Example:
^Wagei=^β0+^β1Femalei where Femalei={1if individual i is Female0if individual i is Male
In a regression, we can easily compare across groups via a dummy variable†
Dummy variable only =0 or =1, if a condition is TRUE
vs. FALSE
Signifies whether an observation belongs to a category or not
† Also called a binary variable or dichotomous variable
Example:
^Wagei=^β0+^β1Femalei where Femalei={1if individual i is Female0if individual i is Male
Female
is our dummy x-variable
Hard to see relationships because of overplotting
Female
is our dummy x-variable
Hard to see relationships because of overplotting
Tip: use geom_jitter()
instead of geom_point()
to randomly nudge points to see them better!
Female
is our dummy x-variable
Hard to see relationships because of overplotting
Use geom_jitter()
instead of geom_point()
to randomly nudge points
^Yi=^β0+^β1Di where Di={0,1}
^Yi=^β0+^β1Di where Di={0,1}
^Yi=^β0+^β1Di where Di={0,1}
^Yi=^β0+^β1Di where Di={0,1}
=E[Yi|Di=1]−E[Yi|Di=0]=(^β0+^β1)−(^β0)=^β1
Example:
^Wagei=^β0+^β1Femalei
where Femalei={1if i is Female0if i is Male
Example:
^Wagei=^β0+^β1Femalei
where Femalei={1if i is Female0if i is Male
Example:
^Wagei=^β0+^β1Femalei
where Femalei={1if i is Female0if i is Male
Mean wage for men: E[Wage|Female=0]=^β0
Mean wage for women:
Example:
^Wagei=^β0+^β1Femalei
where Femalei={1if i is Female0if i is Male
Mean wage for men: E[Wage|Female=0]=^β0
Mean wage for women: E[Wage|Female=1]=^β0+^β1
Example:
^Wagei=^β0+^β1Femalei
where Femalei={1if i is Female0if i is Male
Mean wage for men: E[Wage|Female=0]=^β0
Mean wage for women: E[Wage|Female=1]=^β0+^β1
Difference in wage between men & women:
Example:
^Wagei=^β0+^β1Femalei
where Femalei={1if i is Female0if i is Male
Mean wage for men: E[Wage|Female=0]=^β0
Mean wage for women: E[Wage|Female=1]=^β0+^β1
Difference in wage between men & women: ^β1
^Wagei=^β0+^β1Femalei
where Femalei={1if i is Female0if i is Male
# comes from wooldridge package# install.packages("wooldridge")library(wooldridge)# data is called "wage1", save as a dataframe I'll call "wages"wages <- wage1wages %>% head()
# Summarize for Menwages %>% filter(female==0) %>% summarize(mean = mean(wage), sd = sd(wage))
ABCDEFGHIJ0123456789 |
mean <dbl> | sd <dbl> | |
---|---|---|
7.099489 | 4.160858 |
# Summarize for Womenwages %>% filter(female==1) %>% summarize(mean = mean(wage), sd = sd(wage))
ABCDEFGHIJ0123456789 |
mean <dbl> | sd <dbl> | |
---|---|---|
4.587659 | 2.529363 |
femalereg <- lm(wage ~ female, data = wages)summary(femalereg)
## ## Call:## lm(formula = wage ~ female, data = wages)## ## Residuals:## Min 1Q Median 3Q Max ## -5.5995 -1.8495 -0.9877 1.4260 17.8805 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 7.0995 0.2100 33.806 < 2e-16 ***## female -2.5118 0.3034 -8.279 1.04e-15 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 3.476 on 524 degrees of freedom## Multiple R-squared: 0.1157, Adjusted R-squared: 0.114 ## F-statistic: 68.54 on 1 and 524 DF, p-value: 1.042e-15
femalereg <- lm(wage ~ female, data = wages)summary(femalereg)
## ## Call:## lm(formula = wage ~ female, data = wages)## ## Residuals:## Min 1Q Median 3Q Max ## -5.5995 -1.8495 -0.9877 1.4260 17.8805 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 7.0995 0.2100 33.806 < 2e-16 ***## female -2.5118 0.3034 -8.279 1.04e-15 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 3.476 on 524 degrees of freedom## Multiple R-squared: 0.1157, Adjusted R-squared: 0.114 ## F-statistic: 68.54 on 1 and 524 DF, p-value: 1.042e-15
library(broom)tidy(femalereg)
ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | |
---|---|---|---|---|
(Intercept) | 7.099489 | 0.2100082 | 33.805777 | |
female | -2.511830 | 0.3034092 | -8.278688 |
From tabulation of group means
Gender | Avg. Wage | Std. Dev. | n |
---|---|---|---|
Female | 4.59 | 2.33 | 252 |
Male | 7.10 | 4.16 | 274 |
Difference | 2.51 | 0.30 | − |
From t-test of difference in group means
ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | |
---|---|---|---|
(Intercept) | 7.099489 | 0.2100082 | |
female | -2.511830 | 0.3034092 |
^Wagesi=7.10−2.51Femalei
Example:
^Wagei=^β0+^β1Malei where Malei={1if person i is Male0if person i is Female
wages<-wages %>% mutate(male = ifelse(female == 0, # condition: is female equal to 0? yes = 1, # if true: code as "1" no = 0)) # if false: code as "0"# verify it workedwages %>% select(wage, female, male) %>% head()
ABCDEFGHIJ0123456789 |
wage <dbl> | female <int> | male <dbl> | ||
---|---|---|---|---|
1 | 3.10 | 1 | 0 | |
2 | 3.24 | 1 | 0 | |
3 | 3.00 | 0 | 1 | |
4 | 6.00 | 0 | 1 | |
5 | 5.30 | 0 | 1 | |
6 | 8.75 | 0 | 1 |
Example:
^Wagei=^β0+^β1Malei
where Malei={1if i is Male0if i is Female
Mean wage for men: E[Wage|Male=1]=^β0+^β1
Mean wage for women: E[Wage|Male=0]=^β0
Difference in wage between men & women: ^β1
malereg <- lm(wage ~ male, data = wages)summary(malereg)
## ## Call:## lm(formula = wage ~ male, data = wages)## ## Residuals:## Min 1Q Median 3Q Max ## -5.5995 -1.8495 -0.9877 1.4260 17.8805 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 4.5877 0.2190 20.950 < 2e-16 ***## male 2.5118 0.3034 8.279 1.04e-15 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 3.476 on 524 degrees of freedom## Multiple R-squared: 0.1157, Adjusted R-squared: 0.114 ## F-statistic: 68.54 on 1 and 524 DF, p-value: 1.042e-15
malereg <- lm(wage ~ male, data = wages)summary(malereg)
## ## Call:## lm(formula = wage ~ male, data = wages)## ## Residuals:## Min 1Q Median 3Q Max ## -5.5995 -1.8495 -0.9877 1.4260 17.8805 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 4.5877 0.2190 20.950 < 2e-16 ***## male 2.5118 0.3034 8.279 1.04e-15 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 3.476 on 524 degrees of freedom## Multiple R-squared: 0.1157, Adjusted R-squared: 0.114 ## F-statistic: 68.54 on 1 and 524 DF, p-value: 1.042e-15
library(broom)tidy(malereg)
ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | |
---|---|---|---|---|
(Intercept) | 4.587659 | 0.2189834 | 20.949802 | |
male | 2.511830 | 0.3034092 | 8.278688 |
(1) | (2) | |
---|---|---|
Constant | 4.59 *** | 7.10 *** |
(0.22) | (0.21) | |
Female | -2.51 *** | |
(0.30) | ||
Male | 2.51 *** | |
(0.30) | ||
N | 526 | 526 |
R-Squared | 0.12 | 0.12 |
SER | 3.48 | 3.48 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
Note it doesn't matter if we use male
or female
, males always earn $2.51 more than females
Compare the constant (average for the D=0 group)
Should you use male
AND female
? We'll come to that...
Example: How do wages vary by region of the country? Let Regioni={Northeast,Midwest,South,West}
Example: How do wages vary by region of the country? Let Regioni={Northeast,Midwest,South,West}
^Wagesi=^β0+^β1Regioni
Example: How do wages vary by region of the country?
Code region numerically: Regioni={1if i is in Northeast2if i is in Midwest3if i is in South4if i is in West
Example: How do wages vary by region of the country?
Code region numerically: Regioni={1if i is in Northeast2if i is in Midwest3if i is in South4if i is in West
^Wagesi=^β0+^β1Regioni
Example: How do wages vary by region of the country?
Create a dummy variable for each region:
Example: How do wages vary by region of the country?
Create a dummy variable for each region:
^Wagesi=^β0+^β1Northeasti+^β2Midwesti+^β3Southi+^β4Westi
Example: How do wages vary by region of the country?
Create a dummy variable for each region:
^Wagesi=^β0+^β1Northeasti+^β2Midwesti+^β3Southi+^β4Westi
Example: ^Wagesi=^β0+^β1Northeasti+^β2Midwesti+^β3Southi+^β4Westi
Northeasti+Midwesti+Southi+Westi=1∀i
To avoid the dummy variable trap, always omit one category from the regression, known as the “reference category”
It does not matter which category we omit!
Coefficients on each dummy variable measure the difference between the reference category and each category dummy
Example: ^Wagesi=^β0+^β1Northeasti+^β2Midwesti+^β3Southi
Example: ^Wagesi=^β0+^β1Northeasti+^β2Midwesti+^β3Southi
Westi is omitted (arbitrarily chosen)
^β0:
Example: ^Wagesi=^β0+^β1Northeasti+^β2Midwesti+^β3Southi
Westi is omitted (arbitrarily chosen)
^β0: average wage for i in the West
Example: ^Wagesi=^β0+^β1Northeasti+^β2Midwesti+^β3Southi
Westi is omitted (arbitrarily chosen)
^β0: average wage for i in the West
^β1:
Example: ^Wagesi=^β0+^β1Northeasti+^β2Midwesti+^β3Southi
Westi is omitted (arbitrarily chosen)
^β0: average wage for i in the West
^β1: difference between West and Northeast
Example: ^Wagesi=^β0+^β1Northeasti+^β2Midwesti+^β3Southi
Westi is omitted (arbitrarily chosen)
^β0: average wage for i in the West
^β1: difference between West and Northeast
^β2:
Example: ^Wagesi=^β0+^β1Northeasti+^β2Midwesti+^β3Southi
Westi is omitted (arbitrarily chosen)
^β0: average wage for i in the West
^β1: difference between West and Northeast
^β2: difference between West and Midwest
Example: ^Wagesi=^β0+^β1Northeasti+^β2Midwesti+^β3Southi
Westi is omitted (arbitrarily chosen)
^β0: average wage for i in the West
^β1: difference between West and Northeast
^β2: difference between West and Midwest
^β3:
Example: ^Wagesi=^β0+^β1Northeasti+^β2Midwesti+^β3Southi
Westi is omitted (arbitrarily chosen)
^β0: average wage for i in the West
^β1: difference between West and Northeast
^β2: difference between West and Midwest
^β3: difference between West and South
lm(wage ~ noreast + northcen + south + west, data = wages) %>% summary()
## ## Call:## lm(formula = wage ~ noreast + northcen + south + west, data = wages)## ## Residuals:## Min 1Q Median 3Q Max ## -6.083 -2.387 -1.097 1.157 18.610 ## ## Coefficients: (1 not defined because of singularities)## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 6.6134 0.3891 16.995 < 2e-16 ***## noreast -0.2436 0.5154 -0.473 0.63664 ## northcen -0.9029 0.5035 -1.793 0.07352 . ## south -1.2265 0.4728 -2.594 0.00974 ** ## west NA NA NA NA ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 3.671 on 522 degrees of freedom## Multiple R-squared: 0.0175, Adjusted R-squared: 0.01185 ## F-statistic: 3.099 on 3 and 522 DF, p-value: 0.02646
# let's run 4 regressions, each one we omit a different regionno_noreast_reg <- lm(wage ~ northcen + south + west, data = wages)no_northcen_reg <- lm(wage ~ noreast + south + west, data = wages)no_south_reg <- lm(wage ~ noreast + northcen + west, data = wages)no_west_reg <- lm(wage ~ noreast + northcen + south, data = wages)# now make an output tablelibrary(huxtable)huxreg(no_noreast_reg, no_northcen_reg, no_south_reg, no_west_reg, coefs = c("Constant" = "(Intercept)", "Northeast" = "noreast", "Midwest" = "northcen", "South" = "south", "West" = "west"), statistics = c("N" = "nobs", "R-Squared" = "r.squared", "SER" = "sigma"), number_format = 3)
(1) | (2) | (3) | (4) | |
---|---|---|---|---|
Constant | 6.370 *** | 5.710 *** | 5.387 *** | 6.613 *** |
(0.338) | (0.320) | (0.268) | (0.389) | |
Northeast | 0.659 | 0.983 * | -0.244 | |
(0.465) | (0.432) | (0.515) | ||
Midwest | -0.659 | 0.324 | -0.903 | |
(0.465) | (0.417) | (0.504) | ||
South | -0.983 * | -0.324 | -1.226 ** | |
(0.432) | (0.417) | (0.473) | ||
West | 0.244 | 0.903 | 1.226 ** | |
(0.515) | (0.504) | (0.473) | ||
N | 526 | 526 | 526 | 526 |
R-Squared | 0.017 | 0.017 | 0.017 | 0.017 |
SER | 3.671 | 3.671 | 3.671 | 3.671 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
Constant is alsways average wage for reference (omitted) region
Compare coefficients between Midwest in (1) and Northeast in (2)...
Compare coefficients between West in (3) and South in (4)...
Does not matter which region we omit!
Example: ^Admittedi=^β0+^β1GPAi where Admittedi={1if i is Admitted0if i is Not Admitted
Example: ^Admittedi=^β0+^β1GPAi where Admittedi={1if i is Admitted0if i is Not Admitted
Example: ^Admittedi=^β0+^β1GPAi where Admittedi={1if i is Admitted0if i is Not Admitted
Requires special tools to properly interpret and extend this (logit, probit, etc)
Feel free to write papers that have dummy Y variables (but you may have to ask me some more questions)!
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
o | Tile View: Overview of Slides |
Esc | Back to slideshow |
Categorical data place an individual into one of several possible categories
R
calls these factors
factor
Variables in R
factor
is a special type of character
object class that indicates membership in a category (called a level
)
Suppose I have data on students:
students %>% head(n = 5)
ABCDEFGHIJ0123456789 |
ID <dbl> | Rank <chr> | Grade <dbl> |
---|---|---|
1 | Sophomore | 77 |
2 | Senior | 72 |
3 | Freshman | 73 |
4 | Senior | 73 |
5 | Junior | 84 |
factor
is a special type of character
object class that indicates membership in a category (called a level
)
Suppose I have data on students:
students %>% head(n = 5)
ABCDEFGHIJ0123456789 |
ID <dbl> | Rank <chr> | Grade <dbl> |
---|---|---|
1 | Sophomore | 77 |
2 | Senior | 72 |
3 | Freshman | 73 |
4 | Senior | 73 |
5 | Junior | 84 |
Rank
is a character
(<chr>
) variable, just a string of textRank
a factor
variable, to indicate a student is a member of one of the possible categories: (freshman, sophomore, junior, senior)students <- students %>% mutate(Rank = as.factor(Rank)) # overwrite and change class of Rank to factorstudents %>% head(n = 5)
ABCDEFGHIJ0123456789 |
ID <dbl> | Rank <fct> | Grade <dbl> |
---|---|---|
1 | Sophomore | 77 |
2 | Senior | 72 |
3 | Freshman | 73 |
4 | Senior | 73 |
5 | Junior | 84 |
factor
(<fct>
)# what are the categories?students %>% group_by(Rank) %>% count()
ABCDEFGHIJ0123456789 |
Rank <fct> | n <int> |
---|---|
Freshman | 1 |
Junior | 4 |
Senior | 2 |
Sophomore | 3 |
# note the order is arbitrary! This is an "unordered" factor
ordered
(factor
) variablelevels
from 1st to laststudents <- students %>% mutate(Rank = ordered(Rank, # overwrite and change class of Rank to ordered # next, specify the levels, in order levels = c("Freshman", "Sophomore", "Junior", "Senior") ) )students %>% head(n = 5)
ABCDEFGHIJ0123456789 |
ID <dbl> | Rank <ord> | Grade <dbl> |
---|---|---|
1 | Sophomore | 77 |
2 | Senior | 72 |
3 | Freshman | 73 |
4 | Senior | 73 |
5 | Junior | 84 |
students %>% group_by(Rank) %>% count()
ABCDEFGHIJ0123456789 |
Rank <ord> | n <int> |
---|---|
Freshman | 1 |
Sophomore | 3 |
Junior | 4 |
Senior | 2 |
Example: How much higher wages, on average, do men earn compared to women?
Basic statistics: can test for statistically significant difference in group means with a t-test†, let:
YM: average earnings of a sample of nM men
YW: average earnings of a sample of nW women
Difference in group averages: d= ˉYM − ˉYW
The hypothesis test is:
† See today’s class page for this example
If I plot a factor
variable, e.g. Gender
(which is either Male
or Female
), the scatterplot with wage
looks like this
R
treats values of a factor variable as integers"Female"
= 0, "Male"
= 1Let’s make this more explicit by making a dummy variable to stand in for Gender
In a regression, we can easily compare across groups via a dummy variable†
Dummy variable only =0 or =1, if a condition is TRUE
vs. FALSE
Signifies whether an observation belongs to a category or not
† Also called a binary variable or dichotomous variable
In a regression, we can easily compare across groups via a dummy variable†
Dummy variable only =0 or =1, if a condition is TRUE
vs. FALSE
Signifies whether an observation belongs to a category or not
† Also called a binary variable or dichotomous variable
Example:
^Wagei=^β0+^β1Femalei where Femalei={1if individual i is Female0if individual i is Male
In a regression, we can easily compare across groups via a dummy variable†
Dummy variable only =0 or =1, if a condition is TRUE
vs. FALSE
Signifies whether an observation belongs to a category or not
† Also called a binary variable or dichotomous variable
Example:
^Wagei=^β0+^β1Femalei where Femalei={1if individual i is Female0if individual i is Male
Female
is our dummy x-variable
Hard to see relationships because of overplotting
Female
is our dummy x-variable
Hard to see relationships because of overplotting
Tip: use geom_jitter()
instead of geom_point()
to randomly nudge points to see them better!
Female
is our dummy x-variable
Hard to see relationships because of overplotting
Use geom_jitter()
instead of geom_point()
to randomly nudge points
^Yi=^β0+^β1Di where Di={0,1}
^Yi=^β0+^β1Di where Di={0,1}
^Yi=^β0+^β1Di where Di={0,1}
^Yi=^β0+^β1Di where Di={0,1}
=E[Yi|Di=1]−E[Yi|Di=0]=(^β0+^β1)−(^β0)=^β1
Example:
^Wagei=^β0+^β1Femalei
where Femalei={1if i is Female0if i is Male
Example:
^Wagei=^β0+^β1Femalei
where Femalei={1if i is Female0if i is Male
Example:
^Wagei=^β0+^β1Femalei
where Femalei={1if i is Female0if i is Male
Mean wage for men: E[Wage|Female=0]=^β0
Mean wage for women:
Example:
^Wagei=^β0+^β1Femalei
where Femalei={1if i is Female0if i is Male
Mean wage for men: E[Wage|Female=0]=^β0
Mean wage for women: E[Wage|Female=1]=^β0+^β1
Example:
^Wagei=^β0+^β1Femalei
where Femalei={1if i is Female0if i is Male
Mean wage for men: E[Wage|Female=0]=^β0
Mean wage for women: E[Wage|Female=1]=^β0+^β1
Difference in wage between men & women:
Example:
^Wagei=^β0+^β1Femalei
where Femalei={1if i is Female0if i is Male
Mean wage for men: E[Wage|Female=0]=^β0
Mean wage for women: E[Wage|Female=1]=^β0+^β1
Difference in wage between men & women: ^β1
^Wagei=^β0+^β1Femalei
where Femalei={1if i is Female0if i is Male
# comes from wooldridge package# install.packages("wooldridge")library(wooldridge)# data is called "wage1", save as a dataframe I'll call "wages"wages <- wage1wages %>% head()
# Summarize for Menwages %>% filter(female==0) %>% summarize(mean = mean(wage), sd = sd(wage))
ABCDEFGHIJ0123456789 |
mean <dbl> | sd <dbl> |
---|---|
7.099489 | 4.160858 |
# Summarize for Womenwages %>% filter(female==1) %>% summarize(mean = mean(wage), sd = sd(wage))
ABCDEFGHIJ0123456789 |
mean <dbl> | sd <dbl> |
---|---|
4.587659 | 2.529363 |
femalereg <- lm(wage ~ female, data = wages)summary(femalereg)
## ## Call:## lm(formula = wage ~ female, data = wages)## ## Residuals:## Min 1Q Median 3Q Max ## -5.5995 -1.8495 -0.9877 1.4260 17.8805 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 7.0995 0.2100 33.806 < 2e-16 ***## female -2.5118 0.3034 -8.279 1.04e-15 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 3.476 on 524 degrees of freedom## Multiple R-squared: 0.1157, Adjusted R-squared: 0.114 ## F-statistic: 68.54 on 1 and 524 DF, p-value: 1.042e-15
femalereg <- lm(wage ~ female, data = wages)summary(femalereg)
## ## Call:## lm(formula = wage ~ female, data = wages)## ## Residuals:## Min 1Q Median 3Q Max ## -5.5995 -1.8495 -0.9877 1.4260 17.8805 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 7.0995 0.2100 33.806 < 2e-16 ***## female -2.5118 0.3034 -8.279 1.04e-15 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 3.476 on 524 degrees of freedom## Multiple R-squared: 0.1157, Adjusted R-squared: 0.114 ## F-statistic: 68.54 on 1 and 524 DF, p-value: 1.042e-15
library(broom)tidy(femalereg)
ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> |
---|---|---|---|---|
(Intercept) | 7.099489 | 0.2100082 | 33.805777 | 8.971839e-134 |
female | -2.511830 | 0.3034092 | -8.278688 | 1.041764e-15 |
From tabulation of group means
Gender | Avg. Wage | Std. Dev. | n |
---|---|---|---|
Female | 4.59 | 2.33 | 252 |
Male | 7.10 | 4.16 | 274 |
Difference | 2.51 | 0.30 | − |
From t-test of difference in group means
ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> |
---|---|---|---|---|
(Intercept) | 7.099489 | 0.2100082 | 33.805777 | 8.971839e-134 |
female | -2.511830 | 0.3034092 | -8.278688 | 1.041764e-15 |
^Wagesi=7.10−2.51Femalei
Example:
^Wagei=^β0+^β1Malei where Malei={1if person i is Male0if person i is Female
wages<-wages %>% mutate(male = ifelse(female == 0, # condition: is female equal to 0? yes = 1, # if true: code as "1" no = 0)) # if false: code as "0"# verify it workedwages %>% select(wage, female, male) %>% head()
ABCDEFGHIJ0123456789 |
wage <dbl> | female <int> | male <dbl> | |
---|---|---|---|
1 | 3.10 | 1 | 0 |
2 | 3.24 | 1 | 0 |
3 | 3.00 | 0 | 1 |
4 | 6.00 | 0 | 1 |
5 | 5.30 | 0 | 1 |
6 | 8.75 | 0 | 1 |
Example:
^Wagei=^β0+^β1Malei
where Malei={1if i is Male0if i is Female
Mean wage for men: E[Wage|Male=1]=^β0+^β1
Mean wage for women: E[Wage|Male=0]=^β0
Difference in wage between men & women: ^β1
malereg <- lm(wage ~ male, data = wages)summary(malereg)
## ## Call:## lm(formula = wage ~ male, data = wages)## ## Residuals:## Min 1Q Median 3Q Max ## -5.5995 -1.8495 -0.9877 1.4260 17.8805 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 4.5877 0.2190 20.950 < 2e-16 ***## male 2.5118 0.3034 8.279 1.04e-15 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 3.476 on 524 degrees of freedom## Multiple R-squared: 0.1157, Adjusted R-squared: 0.114 ## F-statistic: 68.54 on 1 and 524 DF, p-value: 1.042e-15
malereg <- lm(wage ~ male, data = wages)summary(malereg)
## ## Call:## lm(formula = wage ~ male, data = wages)## ## Residuals:## Min 1Q Median 3Q Max ## -5.5995 -1.8495 -0.9877 1.4260 17.8805 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 4.5877 0.2190 20.950 < 2e-16 ***## male 2.5118 0.3034 8.279 1.04e-15 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 3.476 on 524 degrees of freedom## Multiple R-squared: 0.1157, Adjusted R-squared: 0.114 ## F-statistic: 68.54 on 1 and 524 DF, p-value: 1.042e-15
library(broom)tidy(malereg)
ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> |
---|---|---|---|---|
(Intercept) | 4.587659 | 0.2189834 | 20.949802 | 3.012371e-71 |
male | 2.511830 | 0.3034092 | 8.278688 | 1.041764e-15 |
(1) | (2) | |
---|---|---|
Constant | 4.59 *** | 7.10 *** |
(0.22) | (0.21) | |
Female | -2.51 *** | |
(0.30) | ||
Male | 2.51 *** | |
(0.30) | ||
N | 526 | 526 |
R-Squared | 0.12 | 0.12 |
SER | 3.48 | 3.48 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
Note it doesn't matter if we use male
or female
, males always earn $2.51 more than females
Compare the constant (average for the D=0 group)
Should you use male
AND female
? We'll come to that...
Example: How do wages vary by region of the country? Let Regioni={Northeast,Midwest,South,West}
Example: How do wages vary by region of the country? Let Regioni={Northeast,Midwest,South,West}
^Wagesi=^β0+^β1Regioni
Example: How do wages vary by region of the country?
Code region numerically: Regioni={1if i is in Northeast2if i is in Midwest3if i is in South4if i is in West
Example: How do wages vary by region of the country?
Code region numerically: Regioni={1if i is in Northeast2if i is in Midwest3if i is in South4if i is in West
^Wagesi=^β0+^β1Regioni
Example: How do wages vary by region of the country?
Create a dummy variable for each region:
Example: How do wages vary by region of the country?
Create a dummy variable for each region:
^Wagesi=^β0+^β1Northeasti+^β2Midwesti+^β3Southi+^β4Westi
Example: How do wages vary by region of the country?
Create a dummy variable for each region:
^Wagesi=^β0+^β1Northeasti+^β2Midwesti+^β3Southi+^β4Westi
Example: ^Wagesi=^β0+^β1Northeasti+^β2Midwesti+^β3Southi+^β4Westi
Northeasti+Midwesti+Southi+Westi=1∀i
To avoid the dummy variable trap, always omit one category from the regression, known as the “reference category”
It does not matter which category we omit!
Coefficients on each dummy variable measure the difference between the reference category and each category dummy
Example: ^Wagesi=^β0+^β1Northeasti+^β2Midwesti+^β3Southi
Example: ^Wagesi=^β0+^β1Northeasti+^β2Midwesti+^β3Southi
Westi is omitted (arbitrarily chosen)
^β0:
Example: ^Wagesi=^β0+^β1Northeasti+^β2Midwesti+^β3Southi
Westi is omitted (arbitrarily chosen)
^β0: average wage for i in the West
Example: ^Wagesi=^β0+^β1Northeasti+^β2Midwesti+^β3Southi
Westi is omitted (arbitrarily chosen)
^β0: average wage for i in the West
^β1:
Example: ^Wagesi=^β0+^β1Northeasti+^β2Midwesti+^β3Southi
Westi is omitted (arbitrarily chosen)
^β0: average wage for i in the West
^β1: difference between West and Northeast
Example: ^Wagesi=^β0+^β1Northeasti+^β2Midwesti+^β3Southi
Westi is omitted (arbitrarily chosen)
^β0: average wage for i in the West
^β1: difference between West and Northeast
^β2:
Example: ^Wagesi=^β0+^β1Northeasti+^β2Midwesti+^β3Southi
Westi is omitted (arbitrarily chosen)
^β0: average wage for i in the West
^β1: difference between West and Northeast
^β2: difference between West and Midwest
Example: ^Wagesi=^β0+^β1Northeasti+^β2Midwesti+^β3Southi
Westi is omitted (arbitrarily chosen)
^β0: average wage for i in the West
^β1: difference between West and Northeast
^β2: difference between West and Midwest
^β3:
Example: ^Wagesi=^β0+^β1Northeasti+^β2Midwesti+^β3Southi
Westi is omitted (arbitrarily chosen)
^β0: average wage for i in the West
^β1: difference between West and Northeast
^β2: difference between West and Midwest
^β3: difference between West and South
lm(wage ~ noreast + northcen + south + west, data = wages) %>% summary()
## ## Call:## lm(formula = wage ~ noreast + northcen + south + west, data = wages)## ## Residuals:## Min 1Q Median 3Q Max ## -6.083 -2.387 -1.097 1.157 18.610 ## ## Coefficients: (1 not defined because of singularities)## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 6.6134 0.3891 16.995 < 2e-16 ***## noreast -0.2436 0.5154 -0.473 0.63664 ## northcen -0.9029 0.5035 -1.793 0.07352 . ## south -1.2265 0.4728 -2.594 0.00974 ** ## west NA NA NA NA ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 3.671 on 522 degrees of freedom## Multiple R-squared: 0.0175, Adjusted R-squared: 0.01185 ## F-statistic: 3.099 on 3 and 522 DF, p-value: 0.02646
# let's run 4 regressions, each one we omit a different regionno_noreast_reg <- lm(wage ~ northcen + south + west, data = wages)no_northcen_reg <- lm(wage ~ noreast + south + west, data = wages)no_south_reg <- lm(wage ~ noreast + northcen + west, data = wages)no_west_reg <- lm(wage ~ noreast + northcen + south, data = wages)# now make an output tablelibrary(huxtable)huxreg(no_noreast_reg, no_northcen_reg, no_south_reg, no_west_reg, coefs = c("Constant" = "(Intercept)", "Northeast" = "noreast", "Midwest" = "northcen", "South" = "south", "West" = "west"), statistics = c("N" = "nobs", "R-Squared" = "r.squared", "SER" = "sigma"), number_format = 3)
(1) | (2) | (3) | (4) | |
---|---|---|---|---|
Constant | 6.370 *** | 5.710 *** | 5.387 *** | 6.613 *** |
(0.338) | (0.320) | (0.268) | (0.389) | |
Northeast | 0.659 | 0.983 * | -0.244 | |
(0.465) | (0.432) | (0.515) | ||
Midwest | -0.659 | 0.324 | -0.903 | |
(0.465) | (0.417) | (0.504) | ||
South | -0.983 * | -0.324 | -1.226 ** | |
(0.432) | (0.417) | (0.473) | ||
West | 0.244 | 0.903 | 1.226 ** | |
(0.515) | (0.504) | (0.473) | ||
N | 526 | 526 | 526 | 526 |
R-Squared | 0.017 | 0.017 | 0.017 | 0.017 |
SER | 3.671 | 3.671 | 3.671 | 3.671 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
Constant is alsways average wage for reference (omitted) region
Compare coefficients between Midwest in (1) and Northeast in (2)...
Compare coefficients between West in (3) and South in (4)...
Does not matter which region we omit!
Example: ^Admittedi=^β0+^β1GPAi where Admittedi={1if i is Admitted0if i is Not Admitted
Example: ^Admittedi=^β0+^β1GPAi where Admittedi={1if i is Admitted0if i is Not Admitted
Example: ^Admittedi=^β0+^β1GPAi where Admittedi={1if i is Admitted0if i is Not Admitted
Requires special tools to properly interpret and extend this (logit, probit, etc)
Feel free to write papers that have dummy Y variables (but you may have to ask me some more questions)!