gapminder example
gapminder example^Life Expectancyi=^β0+^β1GDPi

gapminder example^Life Expectancyi=^β0+^β1GDPi
^Life Expectancyi=^β0+^β1GDPi+^β2GDP2i

gapminder example^Life Expectancyi=^β0+^β1GDPi
^Life Expectancyi=^β0+^β1GDPi+^β2GDP2i
^Life Expectancyi=^β0+^β1lnGDPi

Another useful model for nonlinear data is the logarithmic model†
Logarithmic model has two additional advantages
† Don’t confuse this with a logistic (logit) model for dependent dummy variables.


The exponential function, Y=eX or Y=exp(X), where base e=2.71828...
Natural logarithm is the inverse, Y=ln(X)
ln(1x)=−ln(x)
ln(ab)=ln(a)+ln(b)
ln(xa)=ln(x)−ln(a)
ln(xa)=aln(x)
dlnxdx=1x
ln(x+Δx)−ln(x)⏟Difference in logs≈Δxx⏟Relative change
ln(x+Δx)−ln(x)⏟Difference in logs≈Δxx⏟Relative change
Example: Let x=100 and Δx=1, relative change is:
Δxx=(101−100)100=0.01 or 1%
ln(x+Δx)−ln(x)⏟Difference in logs≈Δxx⏟Relative change
Example: Let x=100 and Δx=1, relative change is:
Δxx=(101−100)100=0.01 or 1%
ϵY,X=%ΔY%ΔX=(ΔYY)(ΔXX)
ϵY,X=%ΔY%ΔX=(ΔYY)(ΔXX)
ϵY,X=%ΔY%ΔX=(ΔYY)(ΔXX)
One of the (many) reasons why economists love Cobb-Douglas functions: Y=ALαKβ
Taking logs, relationship becomes linear:
One of the (many) reasons why economists love Cobb-Douglas functions: Y=ALαKβ
Taking logs, relationship becomes linear:
ln(Y)=ln(A)+αln(L)+βln(K)
One of the (many) reasons why economists love Cobb-Douglas functions: Y=ALαKβ
Taking logs, relationship becomes linear:
ln(Y)=ln(A)+αln(L)+βln(K)
Example: Cobb-Douglas production function: Y=2L0.75K0.25
Example: Cobb-Douglas production function: Y=2L0.75K0.25
lnY=ln2+0.75lnL+0.25lnK
Example: Cobb-Douglas production function: Y=2L0.75K0.25
lnY=ln2+0.75lnL+0.25lnK
A 1% change in L will yield a 0.75% change in output Y
A 1% change in K will yield a 0.25% change in output Y
log() function can easily take the logarithmgapminder <- gapminder %>% mutate(loggdp = log(gdpPercap)) # log GDP per capitagapminder %>% head() # look at it
| ABCDEFGHIJ0123456789 |
country <fct> | continent <fct> | year <int> | lifeExp <dbl> | pop <int> | gdpPercap <dbl> | loggdp <dbl> |
|---|---|---|---|---|---|---|
| Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.4453 | 6.658583 |
| Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.8530 | 6.710344 |
| Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.1007 | 6.748878 |
| Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.1971 | 6.728864 |
| Afghanistan | Asia | 1972 | 36.088 | 13079460 | 739.9811 | 6.606625 |
| Afghanistan | Asia | 1977 | 38.438 | 14880372 | 786.1134 | 6.667101 |
log() by default is the natural logarithm ln(), i.e. base elog(x, base = 5)log10, log2 log10(100)
## [1] 2log2(16)
## [1] 4log(19683, base=3)
## [1] 9log() around a variable in the regressionlm(lifeExp ~ loggdp, data = gapminder) %>% tidy()
| ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | |
|---|---|---|---|---|
| (Intercept) | -9.100889 | 1.227674 | -7.413117 | |
| loggdp | 8.405085 | 0.148762 | 56.500206 |
lm(lifeExp ~ log(gdpPercap), data = gapminder) %>% tidy()
| ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | |
|---|---|---|---|---|
| (Intercept) | -9.100889 | 1.227674 | -7.413117 | |
| log(gdpPercap) | 8.405085 | 0.148762 | 56.500206 |
Linear-log model: Yi=β0+β1lnXi
Log-linear model: lnYi=β0+β1Xi
Linear-log model: Yi=β0+β1lnXi
Log-linear model: lnYi=β0+β1Xi
Log-log model: lnYi=β0+β1lnXi
Y=β0+β1lnXiβ1=ΔY(ΔXX)
Y=β0+β1lnXiβ1=ΔY(ΔXX)
lin_log_reg <- lm(lifeExp ~ loggdp, data = gapminder)library(broom)lin_log_reg %>% tidy()
| ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | |
|---|---|---|---|---|
| (Intercept) | -9.100889 | 1.227674 | -7.413117 | |
| loggdp | 8.405085 | 0.148762 | 56.500206 |
^Life Expectancyi=−9.10+8.41ln GDPi
lin_log_reg <- lm(lifeExp ~ loggdp, data = gapminder)library(broom)lin_log_reg %>% tidy()
| ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | |
|---|---|---|---|---|
| (Intercept) | -9.100889 | 1.227674 | -7.413117 | |
| loggdp | 8.405085 | 0.148762 | 56.500206 |
^Life Expectancyi=−9.10+8.41ln GDPi
lin_log_reg <- lm(lifeExp ~ loggdp, data = gapminder)library(broom)lin_log_reg %>% tidy()
| ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | |
|---|---|---|---|---|
| (Intercept) | -9.100889 | 1.227674 | -7.413117 | |
| loggdp | 8.405085 | 0.148762 | 56.500206 |
^Life Expectancyi=−9.10+8.41ln GDPi
A 1% change in GDP → a 9.41100= 0.0841 year increase in Life Expectancy
A 25% fall in GDP → a (−25×0.0841)= 2.1025 year decrease in Life Expectancy
lin_log_reg <- lm(lifeExp ~ loggdp, data = gapminder)library(broom)lin_log_reg %>% tidy()
| ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | |
|---|---|---|---|---|
| (Intercept) | -9.100889 | 1.227674 | -7.413117 | |
| loggdp | 8.405085 | 0.148762 | 56.500206 |
^Life Expectancyi=−9.10+8.41ln GDPi
A 1% change in GDP → a 9.41100= 0.0841 year increase in Life Expectancy
A 25% fall in GDP → a (−25×0.0841)= 2.1025 year decrease in Life Expectancy
A 100% rise in GDP → a (100×0.0841)= 8.4100 year increase in Life Expectancy
ggplot(data = gapminder)+ aes(x = gdpPercap, y = lifeExp)+ geom_point(color="blue", alpha=0.5)+ geom_smooth(method="lm", formula=y~log(x), color="orange")+ scale_x_continuous(labels=scales::dollar, breaks=seq(0,120000,20000))+ scale_y_continuous(breaks=seq(0,100,10), limits=c(0,100))+ labs(x = "GDP per Capita", y = "Life Expectancy (Years)")+ ggthemes::theme_pander(base_family = "Fira Sans Condensed", base_size=16)

ggplot(data = gapminder)+ aes(x = loggdp, y = lifeExp)+ geom_point(color="blue", alpha=0.5)+ geom_smooth(method="lm", color="orange")+ scale_y_continuous(breaks=seq(0,100,10), limits=c(0,100))+ labs(x = "Log GDP per Capita", y = "Life Expectancy (Years)")+ ggthemes::theme_pander(base_family = "Fira Sans Condensed", base_size=16)

lnYi=β0+β1Xβ1=(ΔYY)ΔX
lnYi=β0+β1Xβ1=(ΔYY)ΔX
We will again have very large/small coefficients if we deal with GDP directly, again let's transform gdpPercap into $1,000s, call it gdp_t
Then log LifeExp
We will again have very large/small coefficients if we deal with GDP directly, again let's transform gdpPercap into $1,000s, call it gdp_t
Then log LifeExp
gapminder <- gapminder %>% mutate(gdp_t = gdpPercap/1000, # first make GDP/capita in $1000s loglife = log(lifeExp)) # take the log of LifeExpgapminder %>% head() # look at it
| ABCDEFGHIJ0123456789 |
country <fct> | continent <fct> | year <int> | lifeExp <dbl> | pop <int> | gdpPercap <dbl> | loggdp <dbl> | gdp_t <dbl> | loglife <dbl> |
|---|---|---|---|---|---|---|---|---|
| Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.4453 | 6.658583 | 0.7794453 | 3.360410 |
| Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.8530 | 6.710344 | 0.8208530 | 3.412203 |
| Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.1007 | 6.748878 | 0.8531007 | 3.465642 |
| Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.1971 | 6.728864 | 0.8361971 | 3.526949 |
| Afghanistan | Asia | 1972 | 36.088 | 13079460 | 739.9811 | 6.606625 | 0.7399811 | 3.585960 |
| Afghanistan | Asia | 1977 | 38.438 | 14880372 | 786.1134 | 6.667101 | 0.7861134 | 3.649047 |
log_lin_reg <- lm(loglife~gdp_t, data = gapminder)log_lin_reg %>% tidy()
| ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | |
|---|---|---|---|---|
| (Intercept) | 3.966639 | 0.0058345501 | 679.85339 | |
| gdp_t | 0.012917 | 0.0004777072 | 27.03958 |
^lnLife Expectancyi=3.967+0.013GDPi
log_lin_reg <- lm(loglife~gdp_t, data = gapminder)log_lin_reg %>% tidy()
| ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | |
|---|---|---|---|---|
| (Intercept) | 3.966639 | 0.0058345501 | 679.85339 | |
| gdp_t | 0.012917 | 0.0004777072 | 27.03958 |
^lnLife Expectancyi=3.967+0.013GDPi
log_lin_reg <- lm(loglife~gdp_t, data = gapminder)log_lin_reg %>% tidy()
| ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | |
|---|---|---|---|---|
| (Intercept) | 3.966639 | 0.0058345501 | 679.85339 | |
| gdp_t | 0.012917 | 0.0004777072 | 27.03958 |
^ln(Life Expectancy)i=3.967+0.013GDPi
A $1 (thousand) change in GDP → a 0.013×100%= 1.3% increase in Life Expectancy
A $25 (thousand) fall in GDP → a (−25×1.3%)= 32.5% decrease in Life Expectancy
log_lin_reg <- lm(loglife~gdp_t, data = gapminder)log_lin_reg %>% tidy()
| ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | |
|---|---|---|---|---|
| (Intercept) | 3.966639 | 0.0058345501 | 679.85339 | |
| gdp_t | 0.012917 | 0.0004777072 | 27.03958 |
^ln(Life Expectancy)i=3.967+0.013GDPi
A $1 (thousand) change in GDP → a 0.013×100%= 1.3% increase in Life Expectancy
A $25 (thousand) fall in GDP → a (−25×1.3%)= 32.5% decrease in Life Expectancy
A $100 (thousand) rise in GDP → a (100×1.3%)= 130% increase in Life Expectancy
ggplot(data = gapminder)+ aes(x = gdp_t, y = loglife)+ geom_point(color="blue", alpha=0.5)+ geom_smooth(method="lm", color="orange")+ scale_x_continuous(labels=scales::dollar, breaks=seq(0,120,20))+ labs(x = "GDP per Capita ($ Thousands)", y = "Log Life Expectancy")+ ggthemes::theme_pander(base_family = "Fira Sans Condensed", base_size=16)

lnYi=β0+β1lnXiβ1=(ΔYY)(ΔXX)
lnYi=β0+β1lnXiβ1=(ΔYY)(ΔXX)
Marginal effect of X→Y: a 1% change in X→ a β1 % change in Y
β1 is the elasticity of Y with respect to X!
log_log_reg <- lm(loglife ~ loggdp, data = gapminder)log_log_reg %>% tidy()
| ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> |
|---|---|---|---|---|
| (Intercept) | 2.864177 | 0.02328274 | 123.01718 | 0 |
| loggdp | 0.146549 | 0.00282126 | 51.94452 | 0 |
^ln Life Expectancyi=2.864+0.147ln GDPi
log_log_reg <- lm(loglife ~ loggdp, data = gapminder)log_log_reg %>% tidy()
| ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> |
|---|---|---|---|---|
| (Intercept) | 2.864177 | 0.02328274 | 123.01718 | 0 |
| loggdp | 0.146549 | 0.00282126 | 51.94452 | 0 |
^ln Life Expectancyi=2.864+0.147ln GDPi
log_log_reg <- lm(loglife ~ loggdp, data = gapminder)log_log_reg %>% tidy()
| ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> |
|---|---|---|---|---|
| (Intercept) | 2.864177 | 0.02328274 | 123.01718 | 0 |
| loggdp | 0.146549 | 0.00282126 | 51.94452 | 0 |
^ln Life Expectancyi=2.864+0.147ln GDPi
A 1% change in GDP → a 0.147% increase in Life Expectancy
A 25% fall in GDP → a (−25×0.147%)= 3.675% decrease in Life Expectancy
log_log_reg <- lm(loglife ~ loggdp, data = gapminder)log_log_reg %>% tidy()
| ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> |
|---|---|---|---|---|
| (Intercept) | 2.864177 | 0.02328274 | 123.01718 | 0 |
| loggdp | 0.146549 | 0.00282126 | 51.94452 | 0 |
^ln Life Expectancyi=2.864+0.147ln GDPi
A 1% change in GDP → a 0.147% increase in Life Expectancy
A 25% fall in GDP → a (−25×0.147%)= 3.675% decrease in Life Expectancy
A 100% rise in GDP → a (100×0.147%)= 14.7% increase in Life Expectancy
ggplot(data = gapminder)+ aes(x = loggdp, y = loglife)+ geom_point(color="blue", alpha=0.5)+ geom_smooth(method="lm", color="orange")+ labs(x = "Log GDP per Capita", y = "Log Life Expectancy")+ ggthemes::theme_pander(base_family = "Fira Sans Condensed", base_size=16)

| Model | Equation | Interpretation |
|---|---|---|
| Linear-Log | Y=β0+β1lnX | 1% change in X→^β1100 unit change in Y |
| Log-Linear | lnY=β0+β1X | 1 unit change in X→^β1×100% change in Y |
| Log-Log | lnY=β0+β1lnX | 1% change in X→^β1% change in Y |
library(huxtable)huxreg("Life Exp." = lin_log_reg, "Log Life Exp." = log_lin_reg, "Log Life Exp." = log_log_reg, coefs = c("Constant" = "(Intercept)", "GDP ($1000s)" = "gdp_t", "Log GDP" = "loggdp"), statistics = c("N" = "nobs", "R-Squared" = "r.squared", "SER" = "sigma"), number_format = 2)
| Life Exp. | Log Life Exp. | Log Life Exp. | |
|---|---|---|---|
| Constant | -9.10 *** | 3.97 *** | 2.86 *** |
| (1.23) | (0.01) | (0.02) | |
| GDP ($1000s) | 0.01 *** | ||
| (0.00) | |||
| Log GDP | 8.41 *** | 0.15 *** | |
| (0.15) | (0.00) | ||
| N | 1704 | 1704 | 1704 |
| R-Squared | 0.65 | 0.30 | 0.61 |
| SER | 7.62 | 0.19 | 0.14 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | |||
| Linear-Log | Log-Linear | Log-Log |
|---|---|---|
![]() |
![]() |
![]() |
| ^Yi=^β0+^β1lnXi | lnYi=^β0+^β1Xi | lnYi=^β0+^β1lnXi |
| R2=0.65 | R2=0.30 | R2=0.61 |
^Yi=β0+β1X1+β2X2
We often want to compare coefficients to see which variable X1 or X2 has a bigger effect on Y
What if X1 and X2 are different units?
Example: ^Salaryi=β0+β1Batting averagei+β2Home runsi^Salaryi=−2,869,439.40+12,417,629.72Batting averagei+129,627.36Home runsi
XZ=Xi−¯Xsd(X)
† Also called “centering” or “scaling.”
| Variable | Mean | Std. Dev. |
|---|---|---|
| Salary | $2,024,616 | $2,764,512 |
| Batting Average | 0.267 | 0.031 |
| Home Runs | 12.11 | 10.31 |
^Salaryi=−2,869,439.40+12,417,629.72Batting averagei+129,627.36Home runsi^SalaryZ=0.00+0.14Batting averageZ+0.48Home runsZ
| Variable | Mean | Std. Dev. |
|---|---|---|
| Salary | $2,024,616 | $2,764,512 |
| Batting Average | 0.267 | 0.031 |
| Home Runs | 12.11 | 10.31 |
^Salaryi=−2,869,439.40+12,417,629.72Batting averagei+129,627.36Home runsi^SalaryZ=0.00+0.14Batting averageZ+0.48Home runsZ
Marginal effects on Y (in standard deviations of Y) from 1 standard deviation change in X:
^β1: a 1 standard deviation increase in Batting Average increases Salary by 0.14 standard deviations
0.14×$2,764,512=$387,032
0.48×$2,764,512=$1,326,966
R| Variable | Mean | SD |
|---|---|---|
LifeExp |
59.47 | 12.92 |
gdpPercap |
$7215.32 | $9857.46 |
scale() command inside mutate() function to standardize a variablegapminder <- gapminder %>% mutate(life_Z = scale(lifeExp), gdp_Z = scale(gdpPercap))std_reg <- lm(life_Z ~ gdp_Z, data = gapminder)tidy(std_reg)
## # A tibble: 2 × 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 1.10e-16 0.0197 5.57e-15 1.00e+ 0## 2 gdp_Z 5.84e- 1 0.0197 2.97e+ 1 3.57e-156gdpPercap will increase lifeExp by 0.584 standard deviations (0.584×12.92==7.55 years)Example: Return again to:
^Wagei=^β0+^β1Malei+^β2Northeasti+^β3Midwesti+^β4Southi
Example: Return again to:
^Wagei=^β0+^β1Malei+^β2Northeasti+^β3Midwesti+^β4Southi
Example: Return again to:
^Wagei=^β0+^β1Malei+^β2Northeasti+^β3Midwesti+^β4Southi
Maybe region doesn't affect wages at all?
H0:β2=0,β3=0,β4=0
Example: Return again to:
^Wagei=^β0+^β1Malei+^β2Northeasti+^β3Midwesti+^β4Southi
Maybe region doesn't affect wages at all?
H0:β2=0,β3=0,β4=0
This is a joint hypothesis to test
A joint hypothesis tests against the null hypothesis of a value for multiple parameters: H0:β1=β2=0
Our alternative hypothesis is that: H1: either β1≠0 or β2≠0 or both
1) H0: β1=β2=0
1) H0: β1=β2=0
2) H0: β1=β2
1) H0: β1=β2=0
2) H0: β1=β2
3) H0: ALL β's =0
The F-statistic is the test-statistic used to test joint hypotheses about regression coefficients with an F-test
This involves comparing two models:
The F-statistic is the test-statistic used to test joint hypotheses about regression coefficients with an F-test
This involves comparing two models:
F is an analysis of variance (ANOVA)
The F-statistic is the test-statistic used to test joint hypotheses about regression coefficients with an F-test
This involves comparing two models:
F is an analysis of variance (ANOVA)
F has its own distribution, with two sets of degrees of freedom
Example: Return again to:
^Wagei=^β0+^β1Malei+^β2Northeasti+^β3Midwesti+^β4Southi
Example: Return again to:
^Wagei=^β0+^β1Malei+^β2Northeasti+^β3Midwesti+^β4Southi
Example: Return again to:
^Wagei=^β0+^β1Malei+^β2Northeasti+^β3Midwesti+^β4Southi
H0:β2=β3=β4=0
Ha: H0 is not true (at least one βi≠0)
Example: Return again to:
^Wagei=^β0+^β1Malei+^β2Northeasti+^β3Midwesti+^β4Southi
^Wagei=^β0+^β1Malei+^β2Northeasti+^β3Midwesti+^β4Southi
Example: Return again to:
^Wagei=^β0+^β1Malei+^β2Northeasti+^β3Midwesti+^β4Southi
^Wagei=^β0+^β1Malei+^β2Northeasti+^β3Midwesti+^β4Southi
^Wagei=^β0+^β1Malei
Example: Return again to:
^Wagei=^β0+^β1Malei+^β2Northeasti+^β3Midwesti+^β4Southi
^Wagei=^β0+^β1Malei+^β2Northeasti+^β3Midwesti+^β4Southi
^Wagei=^β0+^β1Malei
Fq,(n−k−1)=((R2u−R2r)q)((1−R2u)(n−k−1))
Fq,(n−k−1)=((R2u−R2r)q)((1−R2u)(n−k−1))
Fq,(n−k−1)=((R2u−R2r)q)((1−R2u)(n−k−1))
R2u: the R2 from the unrestricted model (all variables)
R2r: the R2 from the restricted model (null hypothesis)
Fq,(n−k−1)=((R2u−R2r)q)((1−R2u)(n−k−1))
R2u: the R2 from the unrestricted model (all variables)
R2r: the R2 from the restricted model (null hypothesis)
q: number of restrictions (number of β′s=0 under null hypothesis)
Fq,(n−k−1)=((R2u−R2r)q)((1−R2u)(n−k−1))
R2u: the R2 from the unrestricted model (all variables)
R2r: the R2 from the restricted model (null hypothesis)
q: number of restrictions (number of β′s=0 under null hypothesis)
k: number of X variables in unrestricted model (all variables)
Fq,(n−k−1)=((R2u−R2r)q)((1−R2u)(n−k−1))
R2u: the R2 from the unrestricted model (all variables)
R2r: the R2 from the restricted model (null hypothesis)
q: number of restrictions (number of β′s=0 under null hypothesis)
k: number of X variables in unrestricted model (all variables)
F has two sets of degrees of freedom:
Fq,(n−k−1)=((R2u−R2r)q)((1−R2u)(n−k−1))
Key takeaway: The bigger the difference between (R2u−R2r), the greater the improvement in fit by adding variables, the larger the F!
This formula is (believe it or not) actually a simplified version (assuming homoskedasticity)
wooldridge package's wage1 data again# load in data from wooldridge packagelibrary(wooldridge)wages <- wage1# run regressionsunrestricted_reg <- lm(wage ~ female + northcen + west + south, data = wages)restricted_reg <- lm(wage ~ female, data = wages)^Wagei=^β0+^β1Femalei+^β2Northeasti+^β3Northcen+^β4Southi
^Wagei=^β0+^β1Femalei
H0:β2=β3=β4=0
q=3 restrictions (F numerator df)
n−k−1=526−4−1=521 (F denominator df)
car package's linearHypothesis() command to run an F-test:car package's linearHypothesis() command to run an F-test:# load car package for additional regression toolslibrary(car) # F-testlinearHypothesis(unrestricted_reg, c("northcen", "west", "south"))
## Linear hypothesis test## ## Hypothesis:## northcen = 0## west = 0## south = 0## ## Model 1: restricted model## Model 2: wage ~ female + northcen + west + south## ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 524 6332.2 ## 2 521 6174.8 3 157.36 4.4258 0.004377 **## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1car package's linearHypothesis() command to run an F-test:# load car package for additional regression toolslibrary(car) # F-testlinearHypothesis(unrestricted_reg, c("northcen", "west", "south"))
## Linear hypothesis test## ## Hypothesis:## northcen = 0## west = 0## south = 0## ## Model 1: restricted model## Model 2: wage ~ female + northcen + west + south## ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 524 6332.2 ## 2 521 6174.8 3 157.36 4.4258 0.004377 **## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Example:
^wagei=β0+β1Adolescent heighti+β2Adult heighti+β3Malei
Example:
^wagei=β0+β1Adolescent heighti+β2Adult heighti+β3Malei
H0:β1=β2
Example:
^wagei=β0+β1Adolescent heighti+β2Adult heighti+β3Malei
H0:β1=β2
^wagei=β0+β1(Adolescent heighti+Adult heighti)+β3Malei
# load in dataheightwages <- read_csv("../data/heightwages.csv")# make a "heights" variable as the sum of adolescent (height81) and adult (height85) heightheightwages <- heightwages %>% mutate(heights = height81 + height85)height_reg <- lm(wage96 ~ height81 + height85 + male, data = heightwages)height_restricted_reg <- lm(wage96 ~ heights + male, data = heightwages)linearHypothesis(height_reg, "height81 = height85") # F-test
## Linear hypothesis test## ## Hypothesis:## height81 - height85 = 0## ## Model 1: restricted model## Model 2: wage96 ~ height81 + height85 + male## ## Res.Df RSS Df Sum of Sq F Pr(>F)## 1 6591 5128243 ## 2 6590 5127284 1 959.2 1.2328 0.2669Insufficient evidence to reject H0!
The effect of adolescent and adult height on wages is the same
summary(unrestricted_reg)
## ## Call:## lm(formula = wage ~ female + northcen + west + south, data = wages)## ## Residuals:## Min 1Q Median 3Q Max ## -6.3269 -2.0105 -0.7871 1.1898 17.4146 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 7.5654 0.3466 21.827 <2e-16 ***## female -2.5652 0.3011 -8.520 <2e-16 ***## northcen -0.5918 0.4362 -1.357 0.1755 ## west 0.4315 0.4838 0.892 0.3729 ## south -1.0262 0.4048 -2.535 0.0115 * ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 3.443 on 521 degrees of freedom## Multiple R-squared: 0.1376, Adjusted R-squared: 0.131 ## F-statistic: 20.79 on 4 and 521 DF, p-value: 6.501e-16summary() is an All F-testF-statistic that, if high enough, is significant (p-value <0.05) enough to reject H0broom instead of summary():glance() command makes table of regression summary statisticstidy() only shows coefficientslibrary(broom)glance(unrestricted_reg)
## # A tibble: 1 × 12## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 0.138 0.131 3.44 20.8 6.50e-16 4 -1394. 2800. 2826.## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>statistic is the All F-test, p.value next to it is the p-value from the F testKeyboard shortcuts
| ↑, ←, Pg Up, k | Go to previous slide |
| ↓, →, Pg Dn, Space, j | Go to next slide |
| Home | Go to first slide |
| End | Go to last slide |
| Number + Return | Go to specific slide |
| b / m / f | Toggle blackout / mirrored / fullscreen mode |
| c | Clone slideshow |
| p | Toggle presenter mode |
| t | Restart the presentation timer |
| ?, h | Toggle this help |
| o | Tile View: Overview of Slides |
| Esc | Back to slideshow |
gapminder example
gapminder example^Life Expectancyi=^β0+^β1GDPi

gapminder example^Life Expectancyi=^β0+^β1GDPi
^Life Expectancyi=^β0+^β1GDPi+^β2GDP2i

gapminder example^Life Expectancyi=^β0+^β1GDPi
^Life Expectancyi=^β0+^β1GDPi+^β2GDP2i
^Life Expectancyi=^β0+^β1lnGDPi

Another useful model for nonlinear data is the logarithmic model†
Logarithmic model has two additional advantages
† Don’t confuse this with a logistic (logit) model for dependent dummy variables.


The exponential function, Y=eX or Y=exp(X), where base e=2.71828...
Natural logarithm is the inverse, Y=ln(X)
ln(1x)=−ln(x)
ln(ab)=ln(a)+ln(b)
ln(xa)=ln(x)−ln(a)
ln(xa)=aln(x)
dlnxdx=1x
ln(x+Δx)−ln(x)⏟Difference in logs≈Δxx⏟Relative change
ln(x+Δx)−ln(x)⏟Difference in logs≈Δxx⏟Relative change
Example: Let x=100 and Δx=1, relative change is:
Δxx=(101−100)100=0.01 or 1%
ln(x+Δx)−ln(x)⏟Difference in logs≈Δxx⏟Relative change
Example: Let x=100 and Δx=1, relative change is:
Δxx=(101−100)100=0.01 or 1%
ϵY,X=%ΔY%ΔX=(ΔYY)(ΔXX)
ϵY,X=%ΔY%ΔX=(ΔYY)(ΔXX)
ϵY,X=%ΔY%ΔX=(ΔYY)(ΔXX)
One of the (many) reasons why economists love Cobb-Douglas functions: Y=ALαKβ
Taking logs, relationship becomes linear:
One of the (many) reasons why economists love Cobb-Douglas functions: Y=ALαKβ
Taking logs, relationship becomes linear:
ln(Y)=ln(A)+αln(L)+βln(K)
One of the (many) reasons why economists love Cobb-Douglas functions: Y=ALαKβ
Taking logs, relationship becomes linear:
ln(Y)=ln(A)+αln(L)+βln(K)
Example: Cobb-Douglas production function: Y=2L0.75K0.25
Example: Cobb-Douglas production function: Y=2L0.75K0.25
lnY=ln2+0.75lnL+0.25lnK
Example: Cobb-Douglas production function: Y=2L0.75K0.25
lnY=ln2+0.75lnL+0.25lnK
A 1% change in L will yield a 0.75% change in output Y
A 1% change in K will yield a 0.25% change in output Y
log() function can easily take the logarithmgapminder <- gapminder %>% mutate(loggdp = log(gdpPercap)) # log GDP per capitagapminder %>% head() # look at it
| ABCDEFGHIJ0123456789 |
country <fct> | continent <fct> | year <int> | lifeExp <dbl> | pop <int> | |
|---|---|---|---|---|---|
| Afghanistan | Asia | 1952 | 28.801 | 8425333 | |
| Afghanistan | Asia | 1957 | 30.332 | 9240934 | |
| Afghanistan | Asia | 1962 | 31.997 | 10267083 | |
| Afghanistan | Asia | 1967 | 34.020 | 11537966 | |
| Afghanistan | Asia | 1972 | 36.088 | 13079460 | |
| Afghanistan | Asia | 1977 | 38.438 | 14880372 |
log() by default is the natural logarithm ln(), i.e. base elog(x, base = 5)log10, log2 log10(100)
## [1] 2log2(16)
## [1] 4log(19683, base=3)
## [1] 9log() around a variable in the regressionlm(lifeExp ~ loggdp, data = gapminder) %>% tidy()
| ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> |
|---|---|---|---|---|
| (Intercept) | -9.100889 | 1.227674 | -7.413117 | 1.934812e-13 |
| loggdp | 8.405085 | 0.148762 | 56.500206 | 0.000000e+00 |
lm(lifeExp ~ log(gdpPercap), data = gapminder) %>% tidy()
| ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> |
|---|---|---|---|---|
| (Intercept) | -9.100889 | 1.227674 | -7.413117 | 1.934812e-13 |
| log(gdpPercap) | 8.405085 | 0.148762 | 56.500206 | 0.000000e+00 |
Linear-log model: Yi=β0+β1lnXi
Log-linear model: lnYi=β0+β1Xi
Linear-log model: Yi=β0+β1lnXi
Log-linear model: lnYi=β0+β1Xi
Log-log model: lnYi=β0+β1lnXi
Y=β0+β1lnXiβ1=ΔY(ΔXX)
Y=β0+β1lnXiβ1=ΔY(ΔXX)
lin_log_reg <- lm(lifeExp ~ loggdp, data = gapminder)library(broom)lin_log_reg %>% tidy()
| ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> |
|---|---|---|---|---|
| (Intercept) | -9.100889 | 1.227674 | -7.413117 | 1.934812e-13 |
| loggdp | 8.405085 | 0.148762 | 56.500206 | 0.000000e+00 |
^Life Expectancyi=−9.10+8.41ln GDPi
lin_log_reg <- lm(lifeExp ~ loggdp, data = gapminder)library(broom)lin_log_reg %>% tidy()
| ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> |
|---|---|---|---|---|
| (Intercept) | -9.100889 | 1.227674 | -7.413117 | 1.934812e-13 |
| loggdp | 8.405085 | 0.148762 | 56.500206 | 0.000000e+00 |
^Life Expectancyi=−9.10+8.41ln GDPi
lin_log_reg <- lm(lifeExp ~ loggdp, data = gapminder)library(broom)lin_log_reg %>% tidy()
| ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> |
|---|---|---|---|---|
| (Intercept) | -9.100889 | 1.227674 | -7.413117 | 1.934812e-13 |
| loggdp | 8.405085 | 0.148762 | 56.500206 | 0.000000e+00 |
^Life Expectancyi=−9.10+8.41ln GDPi
A 1% change in GDP → a 9.41100= 0.0841 year increase in Life Expectancy
A 25% fall in GDP → a (−25×0.0841)= 2.1025 year decrease in Life Expectancy
lin_log_reg <- lm(lifeExp ~ loggdp, data = gapminder)library(broom)lin_log_reg %>% tidy()
| ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> |
|---|---|---|---|---|
| (Intercept) | -9.100889 | 1.227674 | -7.413117 | 1.934812e-13 |
| loggdp | 8.405085 | 0.148762 | 56.500206 | 0.000000e+00 |
^Life Expectancyi=−9.10+8.41ln GDPi
A 1% change in GDP → a 9.41100= 0.0841 year increase in Life Expectancy
A 25% fall in GDP → a (−25×0.0841)= 2.1025 year decrease in Life Expectancy
A 100% rise in GDP → a (100×0.0841)= 8.4100 year increase in Life Expectancy
ggplot(data = gapminder)+ aes(x = gdpPercap, y = lifeExp)+ geom_point(color="blue", alpha=0.5)+ geom_smooth(method="lm", formula=y~log(x), color="orange")+ scale_x_continuous(labels=scales::dollar, breaks=seq(0,120000,20000))+ scale_y_continuous(breaks=seq(0,100,10), limits=c(0,100))+ labs(x = "GDP per Capita", y = "Life Expectancy (Years)")+ ggthemes::theme_pander(base_family = "Fira Sans Condensed", base_size=16)

ggplot(data = gapminder)+ aes(x = loggdp, y = lifeExp)+ geom_point(color="blue", alpha=0.5)+ geom_smooth(method="lm", color="orange")+ scale_y_continuous(breaks=seq(0,100,10), limits=c(0,100))+ labs(x = "Log GDP per Capita", y = "Life Expectancy (Years)")+ ggthemes::theme_pander(base_family = "Fira Sans Condensed", base_size=16)

lnYi=β0+β1Xβ1=(ΔYY)ΔX
lnYi=β0+β1Xβ1=(ΔYY)ΔX
We will again have very large/small coefficients if we deal with GDP directly, again let's transform gdpPercap into $1,000s, call it gdp_t
Then log LifeExp
We will again have very large/small coefficients if we deal with GDP directly, again let's transform gdpPercap into $1,000s, call it gdp_t
Then log LifeExp
gapminder <- gapminder %>% mutate(gdp_t = gdpPercap/1000, # first make GDP/capita in $1000s loglife = log(lifeExp)) # take the log of LifeExpgapminder %>% head() # look at it
| ABCDEFGHIJ0123456789 |
country <fct> | continent <fct> | year <int> | lifeExp <dbl> | pop <int> | |
|---|---|---|---|---|---|
| Afghanistan | Asia | 1952 | 28.801 | 8425333 | |
| Afghanistan | Asia | 1957 | 30.332 | 9240934 | |
| Afghanistan | Asia | 1962 | 31.997 | 10267083 | |
| Afghanistan | Asia | 1967 | 34.020 | 11537966 | |
| Afghanistan | Asia | 1972 | 36.088 | 13079460 | |
| Afghanistan | Asia | 1977 | 38.438 | 14880372 |
log_lin_reg <- lm(loglife~gdp_t, data = gapminder)log_lin_reg %>% tidy()
| ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> |
|---|---|---|---|---|
| (Intercept) | 3.966639 | 0.0058345501 | 679.85339 | 0.000000e+00 |
| gdp_t | 0.012917 | 0.0004777072 | 27.03958 | 2.920378e-134 |
^lnLife Expectancyi=3.967+0.013GDPi
log_lin_reg <- lm(loglife~gdp_t, data = gapminder)log_lin_reg %>% tidy()
| ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> |
|---|---|---|---|---|
| (Intercept) | 3.966639 | 0.0058345501 | 679.85339 | 0.000000e+00 |
| gdp_t | 0.012917 | 0.0004777072 | 27.03958 | 2.920378e-134 |
^lnLife Expectancyi=3.967+0.013GDPi
log_lin_reg <- lm(loglife~gdp_t, data = gapminder)log_lin_reg %>% tidy()
| ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> |
|---|---|---|---|---|
| (Intercept) | 3.966639 | 0.0058345501 | 679.85339 | 0.000000e+00 |
| gdp_t | 0.012917 | 0.0004777072 | 27.03958 | 2.920378e-134 |
^ln(Life Expectancy)i=3.967+0.013GDPi
A $1 (thousand) change in GDP → a 0.013×100%= 1.3% increase in Life Expectancy
A $25 (thousand) fall in GDP → a (−25×1.3%)= 32.5% decrease in Life Expectancy
log_lin_reg <- lm(loglife~gdp_t, data = gapminder)log_lin_reg %>% tidy()
| ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> |
|---|---|---|---|---|
| (Intercept) | 3.966639 | 0.0058345501 | 679.85339 | 0.000000e+00 |
| gdp_t | 0.012917 | 0.0004777072 | 27.03958 | 2.920378e-134 |
^ln(Life Expectancy)i=3.967+0.013GDPi
A $1 (thousand) change in GDP → a 0.013×100%= 1.3% increase in Life Expectancy
A $25 (thousand) fall in GDP → a (−25×1.3%)= 32.5% decrease in Life Expectancy
A $100 (thousand) rise in GDP → a (100×1.3%)= 130% increase in Life Expectancy
ggplot(data = gapminder)+ aes(x = gdp_t, y = loglife)+ geom_point(color="blue", alpha=0.5)+ geom_smooth(method="lm", color="orange")+ scale_x_continuous(labels=scales::dollar, breaks=seq(0,120,20))+ labs(x = "GDP per Capita ($ Thousands)", y = "Log Life Expectancy")+ ggthemes::theme_pander(base_family = "Fira Sans Condensed", base_size=16)

lnYi=β0+β1lnXiβ1=(ΔYY)(ΔXX)
lnYi=β0+β1lnXiβ1=(ΔYY)(ΔXX)
Marginal effect of X→Y: a 1% change in X→ a β1 % change in Y
β1 is the elasticity of Y with respect to X!
log_log_reg <- lm(loglife ~ loggdp, data = gapminder)log_log_reg %>% tidy()
| ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> |
|---|---|---|---|---|
| (Intercept) | 2.864177 | 0.02328274 | 123.01718 | 0 |
| loggdp | 0.146549 | 0.00282126 | 51.94452 | 0 |
^ln Life Expectancyi=2.864+0.147ln GDPi
log_log_reg <- lm(loglife ~ loggdp, data = gapminder)log_log_reg %>% tidy()
| ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> |
|---|---|---|---|---|
| (Intercept) | 2.864177 | 0.02328274 | 123.01718 | 0 |
| loggdp | 0.146549 | 0.00282126 | 51.94452 | 0 |
^ln Life Expectancyi=2.864+0.147ln GDPi
log_log_reg <- lm(loglife ~ loggdp, data = gapminder)log_log_reg %>% tidy()
| ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> |
|---|---|---|---|---|
| (Intercept) | 2.864177 | 0.02328274 | 123.01718 | 0 |
| loggdp | 0.146549 | 0.00282126 | 51.94452 | 0 |
^ln Life Expectancyi=2.864+0.147ln GDPi
A 1% change in GDP → a 0.147% increase in Life Expectancy
A 25% fall in GDP → a (−25×0.147%)= 3.675% decrease in Life Expectancy
log_log_reg <- lm(loglife ~ loggdp, data = gapminder)log_log_reg %>% tidy()
| ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> |
|---|---|---|---|---|
| (Intercept) | 2.864177 | 0.02328274 | 123.01718 | 0 |
| loggdp | 0.146549 | 0.00282126 | 51.94452 | 0 |
^ln Life Expectancyi=2.864+0.147ln GDPi
A 1% change in GDP → a 0.147% increase in Life Expectancy
A 25% fall in GDP → a (−25×0.147%)= 3.675% decrease in Life Expectancy
A 100% rise in GDP → a (100×0.147%)= 14.7% increase in Life Expectancy
ggplot(data = gapminder)+ aes(x = loggdp, y = loglife)+ geom_point(color="blue", alpha=0.5)+ geom_smooth(method="lm", color="orange")+ labs(x = "Log GDP per Capita", y = "Log Life Expectancy")+ ggthemes::theme_pander(base_family = "Fira Sans Condensed", base_size=16)

| Model | Equation | Interpretation |
|---|---|---|
| Linear-Log | Y=β0+β1lnX | 1% change in X→^β1100 unit change in Y |
| Log-Linear | lnY=β0+β1X | 1 unit change in X→^β1×100% change in Y |
| Log-Log | lnY=β0+β1lnX | 1% change in X→^β1% change in Y |
library(huxtable)huxreg("Life Exp." = lin_log_reg, "Log Life Exp." = log_lin_reg, "Log Life Exp." = log_log_reg, coefs = c("Constant" = "(Intercept)", "GDP ($1000s)" = "gdp_t", "Log GDP" = "loggdp"), statistics = c("N" = "nobs", "R-Squared" = "r.squared", "SER" = "sigma"), number_format = 2)
| Life Exp. | Log Life Exp. | Log Life Exp. | |
|---|---|---|---|
| Constant | -9.10 *** | 3.97 *** | 2.86 *** |
| (1.23) | (0.01) | (0.02) | |
| GDP ($1000s) | 0.01 *** | ||
| (0.00) | |||
| Log GDP | 8.41 *** | 0.15 *** | |
| (0.15) | (0.00) | ||
| N | 1704 | 1704 | 1704 |
| R-Squared | 0.65 | 0.30 | 0.61 |
| SER | 7.62 | 0.19 | 0.14 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | |||
| Linear-Log | Log-Linear | Log-Log |
|---|---|---|
![]() |
![]() |
![]() |
| ^Yi=^β0+^β1lnXi | lnYi=^β0+^β1Xi | lnYi=^β0+^β1lnXi |
| R2=0.65 | R2=0.30 | R2=0.61 |
^Yi=β0+β1X1+β2X2
We often want to compare coefficients to see which variable X1 or X2 has a bigger effect on Y
What if X1 and X2 are different units?
Example: ^Salaryi=β0+β1Batting averagei+β2Home runsi^Salaryi=−2,869,439.40+12,417,629.72Batting averagei+129,627.36Home runsi
XZ=Xi−¯Xsd(X)
† Also called “centering” or “scaling.”
| Variable | Mean | Std. Dev. |
|---|---|---|
| Salary | $2,024,616 | $2,764,512 |
| Batting Average | 0.267 | 0.031 |
| Home Runs | 12.11 | 10.31 |
^Salaryi=−2,869,439.40+12,417,629.72Batting averagei+129,627.36Home runsi^SalaryZ=0.00+0.14Batting averageZ+0.48Home runsZ
| Variable | Mean | Std. Dev. |
|---|---|---|
| Salary | $2,024,616 | $2,764,512 |
| Batting Average | 0.267 | 0.031 |
| Home Runs | 12.11 | 10.31 |
^Salaryi=−2,869,439.40+12,417,629.72Batting averagei+129,627.36Home runsi^SalaryZ=0.00+0.14Batting averageZ+0.48Home runsZ
Marginal effects on Y (in standard deviations of Y) from 1 standard deviation change in X:
^β1: a 1 standard deviation increase in Batting Average increases Salary by 0.14 standard deviations
0.14×$2,764,512=$387,032
0.48×$2,764,512=$1,326,966
R| Variable | Mean | SD |
|---|---|---|
LifeExp |
59.47 | 12.92 |
gdpPercap |
$7215.32 | $9857.46 |
scale() command inside mutate() function to standardize a variablegapminder <- gapminder %>% mutate(life_Z = scale(lifeExp), gdp_Z = scale(gdpPercap))std_reg <- lm(life_Z ~ gdp_Z, data = gapminder)tidy(std_reg)
## # A tibble: 2 × 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 1.10e-16 0.0197 5.57e-15 1.00e+ 0## 2 gdp_Z 5.84e- 1 0.0197 2.97e+ 1 3.57e-156gdpPercap will increase lifeExp by 0.584 standard deviations (0.584×12.92==7.55 years)Example: Return again to:
^Wagei=^β0+^β1Malei+^β2Northeasti+^β3Midwesti+^β4Southi
Example: Return again to:
^Wagei=^β0+^β1Malei+^β2Northeasti+^β3Midwesti+^β4Southi
Example: Return again to:
^Wagei=^β0+^β1Malei+^β2Northeasti+^β3Midwesti+^β4Southi
Maybe region doesn't affect wages at all?
H0:β2=0,β3=0,β4=0
Example: Return again to:
^Wagei=^β0+^β1Malei+^β2Northeasti+^β3Midwesti+^β4Southi
Maybe region doesn't affect wages at all?
H0:β2=0,β3=0,β4=0
This is a joint hypothesis to test
A joint hypothesis tests against the null hypothesis of a value for multiple parameters: H0:β1=β2=0 the hypotheses that multiple regressors are equal to zero (have no causal effect on the outcome)
Our alternative hypothesis is that: H1: either β1≠0 or β2≠0 or both or simply, that H0 is not true
1) H0: β1=β2=0
1) H0: β1=β2=0
2) H0: β1=β2
1) H0: β1=β2=0
2) H0: β1=β2
3) H0: ALL β's =0
The F-statistic is the test-statistic used to test joint hypotheses about regression coefficients with an F-test
This involves comparing two models:
The F-statistic is the test-statistic used to test joint hypotheses about regression coefficients with an F-test
This involves comparing two models:
F is an analysis of variance (ANOVA)
The F-statistic is the test-statistic used to test joint hypotheses about regression coefficients with an F-test
This involves comparing two models:
F is an analysis of variance (ANOVA)
F has its own distribution, with two sets of degrees of freedom
Example: Return again to:
^Wagei=^β0+^β1Malei+^β2Northeasti+^β3Midwesti+^β4Southi
Example: Return again to:
^Wagei=^β0+^β1Malei+^β2Northeasti+^β3Midwesti+^β4Southi
Example: Return again to:
^Wagei=^β0+^β1Malei+^β2Northeasti+^β3Midwesti+^β4Southi
H0:β2=β3=β4=0
Ha: H0 is not true (at least one βi≠0)
Example: Return again to:
^Wagei=^β0+^β1Malei+^β2Northeasti+^β3Midwesti+^β4Southi
^Wagei=^β0+^β1Malei+^β2Northeasti+^β3Midwesti+^β4Southi
Example: Return again to:
^Wagei=^β0+^β1Malei+^β2Northeasti+^β3Midwesti+^β4Southi
^Wagei=^β0+^β1Malei+^β2Northeasti+^β3Midwesti+^β4Southi
^Wagei=^β0+^β1Malei
Example: Return again to:
^Wagei=^β0+^β1Malei+^β2Northeasti+^β3Midwesti+^β4Southi
^Wagei=^β0+^β1Malei+^β2Northeasti+^β3Midwesti+^β4Southi
^Wagei=^β0+^β1Malei
Fq,(n−k−1)=((R2u−R2r)q)((1−R2u)(n−k−1))
Fq,(n−k−1)=((R2u−R2r)q)((1−R2u)(n−k−1))
Fq,(n−k−1)=((R2u−R2r)q)((1−R2u)(n−k−1))
R2u: the R2 from the unrestricted model (all variables)
R2r: the R2 from the restricted model (null hypothesis)
Fq,(n−k−1)=((R2u−R2r)q)((1−R2u)(n−k−1))
R2u: the R2 from the unrestricted model (all variables)
R2r: the R2 from the restricted model (null hypothesis)
q: number of restrictions (number of β′s=0 under null hypothesis)
Fq,(n−k−1)=((R2u−R2r)q)((1−R2u)(n−k−1))
R2u: the R2 from the unrestricted model (all variables)
R2r: the R2 from the restricted model (null hypothesis)
q: number of restrictions (number of β′s=0 under null hypothesis)
k: number of X variables in unrestricted model (all variables)
Fq,(n−k−1)=((R2u−R2r)q)((1−R2u)(n−k−1))
R2u: the R2 from the unrestricted model (all variables)
R2r: the R2 from the restricted model (null hypothesis)
q: number of restrictions (number of β′s=0 under null hypothesis)
k: number of X variables in unrestricted model (all variables)
F has two sets of degrees of freedom:
Fq,(n−k−1)=((R2u−R2r)q)((1−R2u)(n−k−1))
Key takeaway: The bigger the difference between (R2u−R2r), the greater the improvement in fit by adding variables, the larger the F!
This formula is (believe it or not) actually a simplified version (assuming homoskedasticity)
wooldridge package's wage1 data again# load in data from wooldridge packagelibrary(wooldridge)wages <- wage1# run regressionsunrestricted_reg <- lm(wage ~ female + northcen + west + south, data = wages)restricted_reg <- lm(wage ~ female, data = wages)^Wagei=^β0+^β1Femalei+^β2Northeasti+^β3Northcen+^β4Southi
^Wagei=^β0+^β1Femalei
H0:β2=β3=β4=0
q=3 restrictions (F numerator df)
n−k−1=526−4−1=521 (F denominator df)
car package's linearHypothesis() command to run an F-test:car package's linearHypothesis() command to run an F-test:# load car package for additional regression toolslibrary(car) # F-testlinearHypothesis(unrestricted_reg, c("northcen", "west", "south"))
## Linear hypothesis test## ## Hypothesis:## northcen = 0## west = 0## south = 0## ## Model 1: restricted model## Model 2: wage ~ female + northcen + west + south## ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 524 6332.2 ## 2 521 6174.8 3 157.36 4.4258 0.004377 **## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1car package's linearHypothesis() command to run an F-test:# load car package for additional regression toolslibrary(car) # F-testlinearHypothesis(unrestricted_reg, c("northcen", "west", "south"))
## Linear hypothesis test## ## Hypothesis:## northcen = 0## west = 0## south = 0## ## Model 1: restricted model## Model 2: wage ~ female + northcen + west + south## ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 524 6332.2 ## 2 521 6174.8 3 157.36 4.4258 0.004377 **## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Example:
^wagei=β0+β1Adolescent heighti+β2Adult heighti+β3Malei
Example:
^wagei=β0+β1Adolescent heighti+β2Adult heighti+β3Malei
H0:β1=β2
Example:
^wagei=β0+β1Adolescent heighti+β2Adult heighti+β3Malei
H0:β1=β2
^wagei=β0+β1(Adolescent heighti+Adult heighti)+β3Malei
# load in dataheightwages <- read_csv("../data/heightwages.csv")# make a "heights" variable as the sum of adolescent (height81) and adult (height85) heightheightwages <- heightwages %>% mutate(heights = height81 + height85)height_reg <- lm(wage96 ~ height81 + height85 + male, data = heightwages)height_restricted_reg <- lm(wage96 ~ heights + male, data = heightwages)linearHypothesis(height_reg, "height81 = height85") # F-test
## Linear hypothesis test## ## Hypothesis:## height81 - height85 = 0## ## Model 1: restricted model## Model 2: wage96 ~ height81 + height85 + male## ## Res.Df RSS Df Sum of Sq F Pr(>F)## 1 6591 5128243 ## 2 6590 5127284 1 959.2 1.2328 0.2669Insufficient evidence to reject H0!
The effect of adolescent and adult height on wages is the same
summary(unrestricted_reg)
## ## Call:## lm(formula = wage ~ female + northcen + west + south, data = wages)## ## Residuals:## Min 1Q Median 3Q Max ## -6.3269 -2.0105 -0.7871 1.1898 17.4146 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 7.5654 0.3466 21.827 <2e-16 ***## female -2.5652 0.3011 -8.520 <2e-16 ***## northcen -0.5918 0.4362 -1.357 0.1755 ## west 0.4315 0.4838 0.892 0.3729 ## south -1.0262 0.4048 -2.535 0.0115 * ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 3.443 on 521 degrees of freedom## Multiple R-squared: 0.1376, Adjusted R-squared: 0.131 ## F-statistic: 20.79 on 4 and 521 DF, p-value: 6.501e-16summary() is an All F-testF-statistic that, if high enough, is significant (p-value <0.05) enough to reject H0broom instead of summary():glance() command makes table of regression summary statisticstidy() only shows coefficientslibrary(broom)glance(unrestricted_reg)
## # A tibble: 1 × 12## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 0.138 0.131 3.44 20.8 6.50e-16 4 -1394. 2800. 2826.## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>statistic is the All F-test, p.value next to it is the p-value from the F test