gapminder
examplegapminder
example$$\color{red}{\widehat{\text{Life Expectancy}_i}=\hat{\beta_0}+\hat{\beta_1}\text{GDP}_i}$$
gapminder
example$$\color{red}{\widehat{\text{Life Expectancy}_i}=\hat{\beta_0}+\hat{\beta_1}\text{GDP}_i}$$
$$\color{green}{\widehat{\text{Life Expectancy}_i}=\hat{\beta_0}+\hat{\beta_1}\text{GDP}_i+\hat{\beta_2}\text{GDP}_i^2}$$
gapminder
example$$\color{red}{\widehat{\text{Life Expectancy}_i}=\hat{\beta_0}+\hat{\beta_1}\text{GDP}_i}$$
$$\color{green}{\widehat{\text{Life Expectancy}_i}=\hat{\beta_0}+\hat{\beta_1}\text{GDP}_i+\hat{\beta_2}\text{GDP}_i^2}$$
$$\color{orange}{\widehat{\text{Life Expectancy}_i}=\hat{\beta_0}+\hat{\beta_1}\ln \text{GDP}_i}$$
Another useful model for nonlinear data is the logarithmic model†
Logarithmic model has two additional advantages
† Don’t confuse this with a logistic (logit) model for dependent dummy variables.
The exponential function, \(Y=e^X\) or \(Y=exp(X)\), where base \(e=2.71828...\)
Natural logarithm is the inverse, \(Y=ln(X)\)
\(\ln(\frac{1}{x})=-\ln(x)\)
\(\ln(ab)=\ln(a)+\ln(b)\)
\(\ln(\frac{x}{a})=\ln(x)-\ln(a)\)
\(\ln(x^a)=a \, \ln(x)\)
\(\frac{d \, \ln \, x}{d \, x} = \frac{1}{x}\)
$$\underbrace{\ln(x+\Delta x) - \ln(x)}_{\text{Difference in logs}} \approx \underbrace{\frac{\Delta x}{x}}_{\text{Relative change}}$$
$$\underbrace{\ln(x+\Delta x) - \ln(x)}_{\text{Difference in logs}} \approx \underbrace{\frac{\Delta x}{x}}_{\text{Relative change}}$$
Example: Let \(x=100\) and \(\Delta x =1\), relative change is:
$$\frac{\Delta x}{x} = \frac{(101-100)}{100} = 0.01 \text{ or }1\%$$
$$\underbrace{\ln(x+\Delta x) - \ln(x)}_{\text{Difference in logs}} \approx \underbrace{\frac{\Delta x}{x}}_{\text{Relative change}}$$
Example: Let \(x=100\) and \(\Delta x =1\), relative change is:
$$\frac{\Delta x}{x} = \frac{(101-100)}{100} = 0.01 \text{ or }1\%$$
$$\epsilon_{Y,X}=\frac{\% \Delta Y}{\% \Delta X} =\cfrac{\left(\frac{\Delta Y}{Y}\right)}{\left( \frac{\Delta X}{X}\right)}$$
$$\epsilon_{Y,X}=\frac{\% \Delta Y}{\% \Delta X} =\cfrac{\left(\frac{\Delta Y}{Y}\right)}{\left( \frac{\Delta X}{X}\right)}$$
$$\epsilon_{Y,X}=\frac{\% \Delta Y}{\% \Delta X} =\cfrac{\left(\frac{\Delta Y}{Y}\right)}{\left( \frac{\Delta X}{X}\right)}$$
One of the (many) reasons why economists love Cobb-Douglas functions: $$Y=AL^{\alpha}K^{\beta}$$
Taking logs, relationship becomes linear:
One of the (many) reasons why economists love Cobb-Douglas functions: $$Y=AL^{\alpha}K^{\beta}$$
Taking logs, relationship becomes linear:
$$\ln(Y)=\ln(A)+\alpha \ln(L)+ \beta \ln(K)$$
One of the (many) reasons why economists love Cobb-Douglas functions: $$Y=AL^{\alpha}K^{\beta}$$
Taking logs, relationship becomes linear:
$$\ln(Y)=\ln(A)+\alpha \ln(L)+ \beta \ln(K)$$
Example: Cobb-Douglas production function: $$Y=2L^{0.75}K^{0.25}$$
Example: Cobb-Douglas production function: $$Y=2L^{0.75}K^{0.25}$$
$$\ln Y=\ln 2+0.75 \ln L + 0.25 \ln K$$
Example: Cobb-Douglas production function: $$Y=2L^{0.75}K^{0.25}$$
$$\ln Y=\ln 2+0.75 \ln L + 0.25 \ln K$$
A 1% change in \(L\) will yield a 0.75% change in output \(Y\)
A 1% change in \(K\) will yield a 0.25% change in output \(Y\)
log()
function can easily take the logarithmgapminder <- gapminder %>% mutate(loggdp = log(gdpPercap)) # log GDP per capitagapminder %>% head() # look at it
log()
by default is the natural logarithm \(ln()\), i.e. base e
log(x, base = 5)
log10
, log2
log10(100)
## [1] 2
log2(16)
## [1] 4
log(19683, base=3)
## [1] 9
log()
around a variable in the regressionlm(lifeExp ~ loggdp, data = gapminder) %>% tidy()
lm(lifeExp ~ log(gdpPercap), data = gapminder) %>% tidy()
Linear-log model: \(Y_i=\beta_0+\beta_1 \color{#e64173}{\ln X_i}\)
Log-linear model: \(\color{#e64173}{\ln Y_i}=\beta_0+\beta_1X_i\)
Linear-log model: \(Y_i=\beta_0+\beta_1 \color{#e64173}{\ln X_i}\)
Log-linear model: \(\color{#e64173}{\ln Y_i}=\beta_0+\beta_1X_i\)
Log-log model: \(\color{#e64173}{\ln Y_i}=\beta_0+\beta_1 \color{#e64173}{\ln X_i}\)
$$\begin{align*} Y&=\beta_0+\beta_1 \color{#e64173}{\ln X_i}\\ \beta_1&=\cfrac{\Delta Y}{\big(\frac{\Delta X}{X}\big)}\\ \end{align*}$$
$$\begin{align*} Y&=\beta_0+\beta_1 \color{#e64173}{\ln X_i}\\ \beta_1&=\cfrac{\Delta Y}{\big(\frac{\Delta X}{X}\big)}\\ \end{align*}$$
lin_log_reg <- lm(lifeExp ~ loggdp, data = gapminder)library(broom)lin_log_reg %>% tidy()
$$\widehat{\text{Life Expectancy}}_i=-9.10+8.41 \, \text{ln GDP}_i$$
lin_log_reg <- lm(lifeExp ~ loggdp, data = gapminder)library(broom)lin_log_reg %>% tidy()
$$\widehat{\text{Life Expectancy}}_i=-9.10+8.41 \, \text{ln GDP}_i$$
lin_log_reg <- lm(lifeExp ~ loggdp, data = gapminder)library(broom)lin_log_reg %>% tidy()
$$\widehat{\text{Life Expectancy}}_i=-9.10+8.41 \, \text{ln GDP}_i$$
A 1% change in GDP \(\rightarrow\) a \(\frac{9.41}{100}=\) 0.0841 year increase in Life Expectancy
A 25% fall in GDP \(\rightarrow\) a \((-25 \times 0.0841)=\) 2.1025 year decrease in Life Expectancy
lin_log_reg <- lm(lifeExp ~ loggdp, data = gapminder)library(broom)lin_log_reg %>% tidy()
$$\widehat{\text{Life Expectancy}}_i=-9.10+8.41 \, \text{ln GDP}_i$$
A 1% change in GDP \(\rightarrow\) a \(\frac{9.41}{100}=\) 0.0841 year increase in Life Expectancy
A 25% fall in GDP \(\rightarrow\) a \((-25 \times 0.0841)=\) 2.1025 year decrease in Life Expectancy
A 100% rise in GDP \(\rightarrow\) a \((100 \times 0.0841)=\) 8.4100 year increase in Life Expectancy
ggplot(data = gapminder)+ aes(x = gdpPercap, y = lifeExp)+ geom_point(color="blue", alpha=0.5)+ geom_smooth(method="lm", formula=y~log(x), color="orange")+ scale_x_continuous(labels=scales::dollar, breaks=seq(0,120000,20000))+ scale_y_continuous(breaks=seq(0,100,10), limits=c(0,100))+ labs(x = "GDP per Capita", y = "Life Expectancy (Years)")+ ggthemes::theme_pander(base_family = "Fira Sans Condensed", base_size=16)
ggplot(data = gapminder)+ aes(x = loggdp, y = lifeExp)+ geom_point(color="blue", alpha=0.5)+ geom_smooth(method="lm", color="orange")+ scale_y_continuous(breaks=seq(0,100,10), limits=c(0,100))+ labs(x = "Log GDP per Capita", y = "Life Expectancy (Years)")+ ggthemes::theme_pander(base_family = "Fira Sans Condensed", base_size=16)
$$\begin{align*} \color{#e64173}{\ln Y_i}&=\beta_0+\beta_1 X\\ \beta_1&=\cfrac{\big(\frac{\Delta Y}{Y}\big)}{\Delta X}\\ \end{align*}$$
$$\begin{align*} \color{#e64173}{\ln Y_i}&=\beta_0+\beta_1 X\\ \beta_1&=\cfrac{\big(\frac{\Delta Y}{Y}\big)}{\Delta X}\\ \end{align*}$$
We will again have very large/small coefficients if we deal with GDP directly, again let's transform gdpPercap
into $1,000s, call it gdp_t
Then log LifeExp
We will again have very large/small coefficients if we deal with GDP directly, again let's transform gdpPercap
into $1,000s, call it gdp_t
Then log LifeExp
gapminder <- gapminder %>% mutate(gdp_t = gdpPercap/1000, # first make GDP/capita in $1000s loglife = log(lifeExp)) # take the log of LifeExpgapminder %>% head() # look at it
log_lin_reg <- lm(loglife~gdp_t, data = gapminder)log_lin_reg %>% tidy()
$$\widehat{\ln\text{Life Expectancy}}_i=3.967+0.013 \, \text{GDP}_i$$
log_lin_reg <- lm(loglife~gdp_t, data = gapminder)log_lin_reg %>% tidy()
$$\widehat{\ln\text{Life Expectancy}}_i=3.967+0.013 \, \text{GDP}_i$$
log_lin_reg <- lm(loglife~gdp_t, data = gapminder)log_lin_reg %>% tidy()
$$\widehat{ln(\text{Life Expectancy})}_i=3.967+0.013 \, \text{GDP}_i$$
A $1 (thousand) change in GDP \(\rightarrow\) a \(0.013 \times 100\%=\) 1.3% increase in Life Expectancy
A $25 (thousand) fall in GDP \(\rightarrow\) a \((-25 \times 1.3\%)=\) 32.5% decrease in Life Expectancy
log_lin_reg <- lm(loglife~gdp_t, data = gapminder)log_lin_reg %>% tidy()
$$\widehat{ln(\text{Life Expectancy})}_i=3.967+0.013 \, \text{GDP}_i$$
A $1 (thousand) change in GDP \(\rightarrow\) a \(0.013 \times 100\%=\) 1.3% increase in Life Expectancy
A $25 (thousand) fall in GDP \(\rightarrow\) a \((-25 \times 1.3\%)=\) 32.5% decrease in Life Expectancy
A $100 (thousand) rise in GDP \(\rightarrow\) a \((100 \times 1.3\%)=\) 130% increase in Life Expectancy
ggplot(data = gapminder)+ aes(x = gdp_t, y = loglife)+ geom_point(color="blue", alpha=0.5)+ geom_smooth(method="lm", color="orange")+ scale_x_continuous(labels=scales::dollar, breaks=seq(0,120,20))+ labs(x = "GDP per Capita ($ Thousands)", y = "Log Life Expectancy")+ ggthemes::theme_pander(base_family = "Fira Sans Condensed", base_size=16)
$$\begin{align*} \color{#e64173}{\ln Y_i}&=\beta_0+\beta_1 \color{#e64173}{\ln X_i}\\ \beta_1&=\cfrac{\big(\frac{\Delta Y}{Y}\big)}{\big(\frac{\Delta X}{X}\big)}\\ \end{align*}$$
$$\begin{align*} \color{#e64173}{\ln Y_i}&=\beta_0+\beta_1 \color{#e64173}{\ln X_i}\\ \beta_1&=\cfrac{\big(\frac{\Delta Y}{Y}\big)}{\big(\frac{\Delta X}{X}\big)}\\ \end{align*}$$
Marginal effect of \(\mathbf{X \rightarrow Y}\): a 1% change in \(X \rightarrow\) a \(\beta_1\) % change in \(Y\)
\(\beta_1\) is the elasticity of \(Y\) with respect to \(X\)!
log_log_reg <- lm(loglife ~ loggdp, data = gapminder)log_log_reg %>% tidy()
$$\widehat{\text{ln Life Expectancy}}_i=2.864+0.147 \, \text{ln GDP}_i$$
log_log_reg <- lm(loglife ~ loggdp, data = gapminder)log_log_reg %>% tidy()
$$\widehat{\text{ln Life Expectancy}}_i=2.864+0.147 \, \text{ln GDP}_i$$
log_log_reg <- lm(loglife ~ loggdp, data = gapminder)log_log_reg %>% tidy()
$$\widehat{\text{ln Life Expectancy}}_i=2.864+0.147 \, \text{ln GDP}_i$$
A 1% change in GDP \(\rightarrow\) a 0.147% increase in Life Expectancy
A 25% fall in GDP \(\rightarrow\) a \((-25 \times 0.147\%)=\) 3.675% decrease in Life Expectancy
log_log_reg <- lm(loglife ~ loggdp, data = gapminder)log_log_reg %>% tidy()
$$\widehat{\text{ln Life Expectancy}}_i=2.864+0.147 \, \text{ln GDP}_i$$
A 1% change in GDP \(\rightarrow\) a 0.147% increase in Life Expectancy
A 25% fall in GDP \(\rightarrow\) a \((-25 \times 0.147\%)=\) 3.675% decrease in Life Expectancy
A 100% rise in GDP \(\rightarrow\) a \((100 \times 0.147\%)=\) 14.7% increase in Life Expectancy
ggplot(data = gapminder)+ aes(x = loggdp, y = loglife)+ geom_point(color="blue", alpha=0.5)+ geom_smooth(method="lm", color="orange")+ labs(x = "Log GDP per Capita", y = "Log Life Expectancy")+ ggthemes::theme_pander(base_family = "Fira Sans Condensed", base_size=16)
Model | Equation | Interpretation |
---|---|---|
Linear-Log | \(Y=\beta_0+\beta_1 \color{#e64173}{\ln X}\) | 1% change in \(X \rightarrow \frac{\hat{\beta_1}}{100}\) unit change in \(Y\) |
Log-Linear | \(\color{#e64173}{\ln Y}=\beta_0+\beta_1X\) | 1 unit change in \(X \rightarrow \hat{\beta_1}\times 100\)% change in \(Y\) |
Log-Log | \(\color{#e64173}{\ln Y}=\beta_0+\beta_1\color{#e64173}{\ln X}\) | 1% change in \(X \rightarrow \hat{\beta_1}\)% change in \(Y\) |
library(huxtable)huxreg("Life Exp." = lin_log_reg, "Log Life Exp." = log_lin_reg, "Log Life Exp." = log_log_reg, coefs = c("Constant" = "(Intercept)", "GDP ($1000s)" = "gdp_t", "Log GDP" = "loggdp"), statistics = c("N" = "nobs", "R-Squared" = "r.squared", "SER" = "sigma"), number_format = 2)
Life Exp. | Log Life Exp. | Log Life Exp. | |
---|---|---|---|
Constant | -9.10 *** | 3.97 *** | 2.86 *** |
(1.23) | (0.01) | (0.02) | |
GDP ($1000s) | 0.01 *** | ||
(0.00) | |||
Log GDP | 8.41 *** | 0.15 *** | |
(0.15) | (0.00) | ||
N | 1704 | 1704 | 1704 |
R-Squared | 0.65 | 0.30 | 0.61 |
SER | 7.62 | 0.19 | 0.14 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
Linear-Log | Log-Linear | Log-Log |
---|---|---|
![]() |
![]() |
![]() |
\(\hat{Y_i}=\hat{\beta_0}+\hat{\beta_1}\color{#e64173}{\ln X_i}\) | \(\color{#e64173}{\ln Y_i}=\hat{\beta_0}+\hat{\beta_1}X_i\) | \(\color{#e64173}{\ln Y_i}=\hat{\beta_0}+\hat{\beta_1}\color{#e64173}{\ln X_i}\) |
\(R^2=0.65\) | \(R^2=0.30\) | \(R^2=0.61\) |
$$\hat{Y_i}=\beta_0+\beta_1 X_1+\beta_2 X_2 $$
We often want to compare coefficients to see which variable \(X_1\) or \(X_2\) has a bigger effect on \(Y\)
What if \(X_1\) and \(X_2\) are different units?
Example: $$\begin{align*} \widehat{\text{Salary}_i}&=\beta_0+\beta_1\, \text{Batting average}_i+\beta_2\, \text{Home runs}_i\\ \widehat{\text{Salary}_i}&=-\text{2,869,439.40}+\text{12,417,629.72} \, \text{Batting average}_i+\text{129,627.36}\, \text{Home runs}_i\\ \end{align*}$$
$$X_Z=\frac{X_i-\overline{X}}{sd(X)}$$
† Also called “centering” or “scaling.”
Variable | Mean | Std. Dev. |
---|---|---|
Salary | $2,024,616 | $2,764,512 |
Batting Average | 0.267 | 0.031 |
Home Runs | 12.11 | 10.31 |
$$\begin{align*}\scriptsize \widehat{\text{Salary}_i}&=-\text{2,869,439.40}+\text{12,417,629.72} \, \text{Batting average}_i+\text{129,627.36} \, \text{Home runs}_i\\ \widehat{\text{Salary}_Z}&=\text{0.00}+\text{0.14} \, \text{Batting average}_Z+\text{0.48} \, \text{Home runs}_Z\\ \end{align*}$$
Variable | Mean | Std. Dev. |
---|---|---|
Salary | $2,024,616 | $2,764,512 |
Batting Average | 0.267 | 0.031 |
Home Runs | 12.11 | 10.31 |
$$\begin{align*}\scriptsize \widehat{\text{Salary}_i}&=-\text{2,869,439.40}+\text{12,417,629.72} \, \text{Batting average}_i+\text{129,627.36} \, \text{Home runs}_i\\ \widehat{\text{Salary}_Z}&=\text{0.00}+\text{0.14} \, \text{Batting average}_Z+\text{0.48} \, \text{Home runs}_Z\\ \end{align*}$$
Marginal effects on \(Y\) (in standard deviations of \(Y\)) from 1 standard deviation change in \(X\):
\(\hat{\beta_1}\): a 1 standard deviation increase in Batting Average increases Salary by 0.14 standard deviations
$$0.14 \times \$2,764,512=\$387,032$$
$$0.48 \times \$2,764,512=\$1,326,966$$
R
Variable | Mean | SD |
---|---|---|
LifeExp |
59.47 | 12.92 |
gdpPercap |
$7215.32 | $9857.46 |
scale()
command inside mutate()
function to standardize a variablegapminder <- gapminder %>% mutate(life_Z = scale(lifeExp), gdp_Z = scale(gdpPercap))std_reg <- lm(life_Z ~ gdp_Z, data = gapminder)tidy(std_reg)
## # A tibble: 2 × 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 1.10e-16 0.0197 5.57e-15 1.00e+ 0## 2 gdp_Z 5.84e- 1 0.0197 2.97e+ 1 3.57e-156
gdpPercap
will increase lifeExp
by 0.584 standard deviations \((0.584 \times 12.92 = = 7.55\) years)Example: Return again to:
$$\widehat{Wage_i}=\hat{\beta_0}+\hat{\beta_1}Male_i+\hat{\beta_2}Northeast_i+\hat{\beta_3}Midwest_i+\hat{\beta_4}South_i$$
Example: Return again to:
$$\widehat{Wage_i}=\hat{\beta_0}+\hat{\beta_1}Male_i+\hat{\beta_2}Northeast_i+\hat{\beta_3}Midwest_i+\hat{\beta_4}South_i$$
Example: Return again to:
$$\widehat{Wage_i}=\hat{\beta_0}+\hat{\beta_1}Male_i+\hat{\beta_2}Northeast_i+\hat{\beta_3}Midwest_i+\hat{\beta_4}South_i$$
Maybe region doesn't affect wages at all?
\(H_0: \beta_2=0, \, \beta_3=0, \, \beta_4=0\)
Example: Return again to:
$$\widehat{Wage_i}=\hat{\beta_0}+\hat{\beta_1}Male_i+\hat{\beta_2}Northeast_i+\hat{\beta_3}Midwest_i+\hat{\beta_4}South_i$$
Maybe region doesn't affect wages at all?
\(H_0: \beta_2=0, \, \beta_3=0, \, \beta_4=0\)
This is a joint hypothesis to test
A joint hypothesis tests against the null hypothesis of a value for multiple parameters: $$\mathbf{H_0: \beta_1= \beta_2=0}$$ the hypotheses that multiple regressors are equal to zero (have no causal effect on the outcome)
Our alternative hypothesis is that: $$H_1: \text{ either } \beta_1\neq0\text{ or } \beta_2\neq0\text{ or both}$$ or simply, that \(H_0\) is not true
1) \(H_0\): \(\beta_1=\beta_2=0\)
1) \(H_0\): \(\beta_1=\beta_2=0\)
2) \(H_0\): \(\beta_1=\beta_2\)
1) \(H_0\): \(\beta_1=\beta_2=0\)
2) \(H_0\): \(\beta_1=\beta_2\)
3) \(H_0:\) ALL \(\beta\)'s \(=0\)
The F-statistic is the test-statistic used to test joint hypotheses about regression coefficients with an F-test
This involves comparing two models:
The F-statistic is the test-statistic used to test joint hypotheses about regression coefficients with an F-test
This involves comparing two models:
\(F\) is an analysis of variance (ANOVA)
The F-statistic is the test-statistic used to test joint hypotheses about regression coefficients with an F-test
This involves comparing two models:
\(F\) is an analysis of variance (ANOVA)
\(F\) has its own distribution, with two sets of degrees of freedom
Example: Return again to:
$$\widehat{Wage_i}=\hat{\beta_0}+\hat{\beta_1}Male_i+\hat{\beta_2}Northeast_i+\hat{\beta_3}Midwest_i+\hat{\beta_4}South_i$$
Example: Return again to:
$$\widehat{Wage_i}=\hat{\beta_0}+\hat{\beta_1}Male_i+\hat{\beta_2}Northeast_i+\hat{\beta_3}Midwest_i+\hat{\beta_4}South_i$$
Example: Return again to:
$$\widehat{Wage_i}=\hat{\beta_0}+\hat{\beta_1}Male_i+\hat{\beta_2}Northeast_i+\hat{\beta_3}Midwest_i+\hat{\beta_4}South_i$$
\(H_0: \beta_2=\beta_3=\beta_4=0\)
\(H_a\): \(H_0\) is not true (at least one \(\beta_i \neq 0\))
Example: Return again to:
$$\widehat{Wage_i}=\hat{\beta_0}+\hat{\beta_1}Male_i+\hat{\beta_2}Northeast_i+\hat{\beta_3}Midwest_i+\hat{\beta_4}South_i$$
$$\widehat{Wage_i}=\hat{\beta_0}+\hat{\beta_1}Male_i+\hat{\beta_2}Northeast_i+\hat{\beta_3}Midwest_i+\hat{\beta_4}South_i$$
Example: Return again to:
$$\widehat{Wage_i}=\hat{\beta_0}+\hat{\beta_1}Male_i+\hat{\beta_2}Northeast_i+\hat{\beta_3}Midwest_i+\hat{\beta_4}South_i$$
$$\widehat{Wage_i}=\hat{\beta_0}+\hat{\beta_1}Male_i+\hat{\beta_2}Northeast_i+\hat{\beta_3}Midwest_i+\hat{\beta_4}South_i$$
$$\widehat{Wage_i}=\hat{\beta_0}+\hat{\beta_1}Male_i$$
Example: Return again to:
$$\widehat{Wage_i}=\hat{\beta_0}+\hat{\beta_1}Male_i+\hat{\beta_2}Northeast_i+\hat{\beta_3}Midwest_i+\hat{\beta_4}South_i$$
$$\widehat{Wage_i}=\hat{\beta_0}+\hat{\beta_1}Male_i+\hat{\beta_2}Northeast_i+\hat{\beta_3}Midwest_i+\hat{\beta_4}South_i$$
$$\widehat{Wage_i}=\hat{\beta_0}+\hat{\beta_1}Male_i$$
$$F_{q,(n-k-1)}=\cfrac{\left(\displaystyle\frac{(R^2_u-R^2_r)}{q}\right)}{\left(\displaystyle\frac{(1-R^2_u)}{(n-k-1)}\right)}$$
$$F_{q,(n-k-1)}=\cfrac{\left(\displaystyle\frac{(\color{#e64173}{R^2_u}-R^2_r)}{q}\right)}{\left(\displaystyle\frac{(1-\color{#e64173}{R^2_u})}{(n-k-1)}\right)}$$
$$F_{q,(n-k-1)}=\cfrac{\left(\displaystyle\frac{(\color{#e64173}{R^2_u}-\color{#6A5ACD}{R^2_r})}{q}\right)}{\left(\displaystyle\frac{(1-\color{#e64173}{R^2_u})}{(n-k-1)}\right)}$$
\(\color{#e64173}{R^2_u}\): the \(R^2\) from the unrestricted model (all variables)
\(\color{#6A5ACD}{R^2_r}\): the \(R^2\) from the restricted model (null hypothesis)
$$F_{q,(n-k-1)}=\cfrac{\left(\displaystyle\frac{(\color{#e64173}{R^2_u}-\color{#6A5ACD}{R^2_r})}{q}\right)}{\left(\displaystyle\frac{(1-\color{#e64173}{R^2_u})}{(n-k-1)}\right)}$$
\(\color{#e64173}{R^2_u}\): the \(R^2\) from the unrestricted model (all variables)
\(\color{#6A5ACD}{R^2_r}\): the \(R^2\) from the restricted model (null hypothesis)
\(q\): number of restrictions (number of \(\beta's=0\) under null hypothesis)
$$F_{q,(n-k-1)}=\cfrac{\left(\displaystyle\frac{(\color{#e64173}{R^2_u}-\color{#6A5ACD}{R^2_r})}{q}\right)}{\left(\displaystyle\frac{(1-\color{#e64173}{R^2_u})}{(n-k-1)}\right)}$$
\(\color{#e64173}{R^2_u}\): the \(R^2\) from the unrestricted model (all variables)
\(\color{#6A5ACD}{R^2_r}\): the \(R^2\) from the restricted model (null hypothesis)
\(q\): number of restrictions (number of \(\beta's=0\) under null hypothesis)
\(k\): number of \(X\) variables in unrestricted model (all variables)
$$F_{q,(n-k-1)}=\cfrac{\left(\displaystyle\frac{(\color{#e64173}{R^2_u}-\color{#6A5ACD}{R^2_r})}{q}\right)}{\left(\displaystyle\frac{(1-\color{#e64173}{R^2_u})}{(n-k-1)}\right)}$$
\(\color{#e64173}{R^2_u}\): the \(R^2\) from the unrestricted model (all variables)
\(\color{#6A5ACD}{R^2_r}\): the \(R^2\) from the restricted model (null hypothesis)
\(q\): number of restrictions (number of \(\beta's=0\) under null hypothesis)
\(k\): number of \(X\) variables in unrestricted model (all variables)
\(F\) has two sets of degrees of freedom:
$$F_{q,(n-k-1)}=\cfrac{\left(\displaystyle\frac{(R^2_u-R^2_r)}{q}\right)}{\left(\displaystyle\frac{(1-R^2_u)}{(n-k-1)}\right)}$$
Key takeaway: The bigger the difference between \((R^2_u-R^2_r)\), the greater the improvement in fit by adding variables, the larger the \(F\)!
This formula is (believe it or not) actually a simplified version (assuming homoskedasticity)
wooldridge
package's wage1
data again# load in data from wooldridge packagelibrary(wooldridge)wages <- wage1# run regressionsunrestricted_reg <- lm(wage ~ female + northcen + west + south, data = wages)restricted_reg <- lm(wage ~ female, data = wages)
$$\widehat{Wage_i}=\hat{\beta_0}+\hat{\beta_1}Female_i+\hat{\beta_2}Northeast_i+\hat{\beta_3}Northcen+\hat{\beta_4}South_i$$
$$\widehat{Wage_i}=\hat{\beta_0}+\hat{\beta_1}Female_i$$
\(H_0: \beta_2 = \beta_3 = \beta_4 =0\)
\(q = 3\) restrictions (F numerator df)
\(n-k-1 = 526-4-1=521\) (F denominator df)
car
package's linearHypothesis()
command to run an \(F\)-test:car
package's linearHypothesis()
command to run an \(F\)-test:# load car package for additional regression toolslibrary(car) # F-testlinearHypothesis(unrestricted_reg, c("northcen", "west", "south"))
## Linear hypothesis test## ## Hypothesis:## northcen = 0## west = 0## south = 0## ## Model 1: restricted model## Model 2: wage ~ female + northcen + west + south## ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 524 6332.2 ## 2 521 6174.8 3 157.36 4.4258 0.004377 **## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
car
package's linearHypothesis()
command to run an \(F\)-test:# load car package for additional regression toolslibrary(car) # F-testlinearHypothesis(unrestricted_reg, c("northcen", "west", "south"))
## Linear hypothesis test## ## Hypothesis:## northcen = 0## west = 0## south = 0## ## Model 1: restricted model## Model 2: wage ~ female + northcen + west + south## ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 524 6332.2 ## 2 521 6174.8 3 157.36 4.4258 0.004377 **## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Example:
$$\widehat{wage_i}=\beta_0+\beta_1 \text{Adolescent height}_i + \beta_2 \text{Adult height}_i + \beta_3 \text{Male}_i$$
Example:
$$\widehat{wage_i}=\beta_0+\beta_1 \text{Adolescent height}_i + \beta_2 \text{Adult height}_i + \beta_3 \text{Male}_i$$
$$H_0: \beta_1=\beta_2$$
Example:
$$\widehat{wage_i}=\beta_0+\beta_1 \text{Adolescent height}_i + \beta_2 \text{Adult height}_i + \beta_3 \text{Male}_i$$
$$H_0: \beta_1=\beta_2$$
$$\widehat{wage_i}=\beta_0+\beta_1(\text{Adolescent height}_i + \text{Adult height}_i )+ \beta_3 \text{Male}_i$$
# load in dataheightwages <- read_csv("../data/heightwages.csv")# make a "heights" variable as the sum of adolescent (height81) and adult (height85) heightheightwages <- heightwages %>% mutate(heights = height81 + height85)height_reg <- lm(wage96 ~ height81 + height85 + male, data = heightwages)height_restricted_reg <- lm(wage96 ~ heights + male, data = heightwages)
linearHypothesis(height_reg, "height81 = height85") # F-test
## Linear hypothesis test## ## Hypothesis:## height81 - height85 = 0## ## Model 1: restricted model## Model 2: wage96 ~ height81 + height85 + male## ## Res.Df RSS Df Sum of Sq F Pr(>F)## 1 6591 5128243 ## 2 6590 5127284 1 959.2 1.2328 0.2669
Insufficient evidence to reject \(H_0\)!
The effect of adolescent and adult height on wages is the same
summary(unrestricted_reg)
## ## Call:## lm(formula = wage ~ female + northcen + west + south, data = wages)## ## Residuals:## Min 1Q Median 3Q Max ## -6.3269 -2.0105 -0.7871 1.1898 17.4146 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 7.5654 0.3466 21.827 <2e-16 ***## female -2.5652 0.3011 -8.520 <2e-16 ***## northcen -0.5918 0.4362 -1.357 0.1755 ## west 0.4315 0.4838 0.892 0.3729 ## south -1.0262 0.4048 -2.535 0.0115 * ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 3.443 on 521 degrees of freedom## Multiple R-squared: 0.1376, Adjusted R-squared: 0.131 ## F-statistic: 20.79 on 4 and 521 DF, p-value: 6.501e-16
summary()
is an All F-testF-statistic
that, if high enough, is significant (p-value
\(<0.05)\) enough to reject \(H_0\)broom
instead of summary()
:glance()
command makes table of regression summary statisticstidy()
only shows coefficientslibrary(broom)glance(unrestricted_reg)
## # A tibble: 1 × 12## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 0.138 0.131 3.44 20.8 6.50e-16 4 -1394. 2800. 2826.## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
statistic
is the All F-test, p.value
next to it is the p-value from the F testKeyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
o | Tile View: Overview of Slides |
Esc | Back to slideshow |
gapminder
examplegapminder
example$$\color{red}{\widehat{\text{Life Expectancy}_i}=\hat{\beta_0}+\hat{\beta_1}\text{GDP}_i}$$
gapminder
example$$\color{red}{\widehat{\text{Life Expectancy}_i}=\hat{\beta_0}+\hat{\beta_1}\text{GDP}_i}$$
$$\color{green}{\widehat{\text{Life Expectancy}_i}=\hat{\beta_0}+\hat{\beta_1}\text{GDP}_i+\hat{\beta_2}\text{GDP}_i^2}$$
gapminder
example$$\color{red}{\widehat{\text{Life Expectancy}_i}=\hat{\beta_0}+\hat{\beta_1}\text{GDP}_i}$$
$$\color{green}{\widehat{\text{Life Expectancy}_i}=\hat{\beta_0}+\hat{\beta_1}\text{GDP}_i+\hat{\beta_2}\text{GDP}_i^2}$$
$$\color{orange}{\widehat{\text{Life Expectancy}_i}=\hat{\beta_0}+\hat{\beta_1}\ln \text{GDP}_i}$$
Another useful model for nonlinear data is the logarithmic model†
Logarithmic model has two additional advantages
† Don’t confuse this with a logistic (logit) model for dependent dummy variables.
The exponential function, \(Y=e^X\) or \(Y=exp(X)\), where base \(e=2.71828...\)
Natural logarithm is the inverse, \(Y=ln(X)\)
\(\ln(\frac{1}{x})=-\ln(x)\)
\(\ln(ab)=\ln(a)+\ln(b)\)
\(\ln(\frac{x}{a})=\ln(x)-\ln(a)\)
\(\ln(x^a)=a \, \ln(x)\)
\(\frac{d \, \ln \, x}{d \, x} = \frac{1}{x}\)
$$\underbrace{\ln(x+\Delta x) - \ln(x)}_{\text{Difference in logs}} \approx \underbrace{\frac{\Delta x}{x}}_{\text{Relative change}}$$
$$\underbrace{\ln(x+\Delta x) - \ln(x)}_{\text{Difference in logs}} \approx \underbrace{\frac{\Delta x}{x}}_{\text{Relative change}}$$
Example: Let \(x=100\) and \(\Delta x =1\), relative change is:
$$\frac{\Delta x}{x} = \frac{(101-100)}{100} = 0.01 \text{ or }1\%$$
$$\underbrace{\ln(x+\Delta x) - \ln(x)}_{\text{Difference in logs}} \approx \underbrace{\frac{\Delta x}{x}}_{\text{Relative change}}$$
Example: Let \(x=100\) and \(\Delta x =1\), relative change is:
$$\frac{\Delta x}{x} = \frac{(101-100)}{100} = 0.01 \text{ or }1\%$$
$$\epsilon_{Y,X}=\frac{\% \Delta Y}{\% \Delta X} =\cfrac{\left(\frac{\Delta Y}{Y}\right)}{\left( \frac{\Delta X}{X}\right)}$$
$$\epsilon_{Y,X}=\frac{\% \Delta Y}{\% \Delta X} =\cfrac{\left(\frac{\Delta Y}{Y}\right)}{\left( \frac{\Delta X}{X}\right)}$$
$$\epsilon_{Y,X}=\frac{\% \Delta Y}{\% \Delta X} =\cfrac{\left(\frac{\Delta Y}{Y}\right)}{\left( \frac{\Delta X}{X}\right)}$$
One of the (many) reasons why economists love Cobb-Douglas functions: $$Y=AL^{\alpha}K^{\beta}$$
Taking logs, relationship becomes linear:
One of the (many) reasons why economists love Cobb-Douglas functions: $$Y=AL^{\alpha}K^{\beta}$$
Taking logs, relationship becomes linear:
$$\ln(Y)=\ln(A)+\alpha \ln(L)+ \beta \ln(K)$$
One of the (many) reasons why economists love Cobb-Douglas functions: $$Y=AL^{\alpha}K^{\beta}$$
Taking logs, relationship becomes linear:
$$\ln(Y)=\ln(A)+\alpha \ln(L)+ \beta \ln(K)$$
Example: Cobb-Douglas production function: $$Y=2L^{0.75}K^{0.25}$$
Example: Cobb-Douglas production function: $$Y=2L^{0.75}K^{0.25}$$
$$\ln Y=\ln 2+0.75 \ln L + 0.25 \ln K$$
Example: Cobb-Douglas production function: $$Y=2L^{0.75}K^{0.25}$$
$$\ln Y=\ln 2+0.75 \ln L + 0.25 \ln K$$
A 1% change in \(L\) will yield a 0.75% change in output \(Y\)
A 1% change in \(K\) will yield a 0.25% change in output \(Y\)
log()
function can easily take the logarithmgapminder <- gapminder %>% mutate(loggdp = log(gdpPercap)) # log GDP per capitagapminder %>% head() # look at it
log()
by default is the natural logarithm \(ln()\), i.e. base e
log(x, base = 5)
log10
, log2
log10(100)
## [1] 2
log2(16)
## [1] 4
log(19683, base=3)
## [1] 9
log()
around a variable in the regressionlm(lifeExp ~ loggdp, data = gapminder) %>% tidy()
lm(lifeExp ~ log(gdpPercap), data = gapminder) %>% tidy()
Linear-log model: \(Y_i=\beta_0+\beta_1 \color{#e64173}{\ln X_i}\)
Log-linear model: \(\color{#e64173}{\ln Y_i}=\beta_0+\beta_1X_i\)
Linear-log model: \(Y_i=\beta_0+\beta_1 \color{#e64173}{\ln X_i}\)
Log-linear model: \(\color{#e64173}{\ln Y_i}=\beta_0+\beta_1X_i\)
Log-log model: \(\color{#e64173}{\ln Y_i}=\beta_0+\beta_1 \color{#e64173}{\ln X_i}\)
$$\begin{align*} Y&=\beta_0+\beta_1 \color{#e64173}{\ln X_i}\\ \beta_1&=\cfrac{\Delta Y}{\big(\frac{\Delta X}{X}\big)}\\ \end{align*}$$
$$\begin{align*} Y&=\beta_0+\beta_1 \color{#e64173}{\ln X_i}\\ \beta_1&=\cfrac{\Delta Y}{\big(\frac{\Delta X}{X}\big)}\\ \end{align*}$$
lin_log_reg <- lm(lifeExp ~ loggdp, data = gapminder)library(broom)lin_log_reg %>% tidy()
$$\widehat{\text{Life Expectancy}}_i=-9.10+8.41 \, \text{ln GDP}_i$$
lin_log_reg <- lm(lifeExp ~ loggdp, data = gapminder)library(broom)lin_log_reg %>% tidy()
$$\widehat{\text{Life Expectancy}}_i=-9.10+8.41 \, \text{ln GDP}_i$$
lin_log_reg <- lm(lifeExp ~ loggdp, data = gapminder)library(broom)lin_log_reg %>% tidy()
$$\widehat{\text{Life Expectancy}}_i=-9.10+8.41 \, \text{ln GDP}_i$$
A 1% change in GDP \(\rightarrow\) a \(\frac{9.41}{100}=\) 0.0841 year increase in Life Expectancy
A 25% fall in GDP \(\rightarrow\) a \((-25 \times 0.0841)=\) 2.1025 year decrease in Life Expectancy
lin_log_reg <- lm(lifeExp ~ loggdp, data = gapminder)library(broom)lin_log_reg %>% tidy()
$$\widehat{\text{Life Expectancy}}_i=-9.10+8.41 \, \text{ln GDP}_i$$
A 1% change in GDP \(\rightarrow\) a \(\frac{9.41}{100}=\) 0.0841 year increase in Life Expectancy
A 25% fall in GDP \(\rightarrow\) a \((-25 \times 0.0841)=\) 2.1025 year decrease in Life Expectancy
A 100% rise in GDP \(\rightarrow\) a \((100 \times 0.0841)=\) 8.4100 year increase in Life Expectancy
ggplot(data = gapminder)+ aes(x = gdpPercap, y = lifeExp)+ geom_point(color="blue", alpha=0.5)+ geom_smooth(method="lm", formula=y~log(x), color="orange")+ scale_x_continuous(labels=scales::dollar, breaks=seq(0,120000,20000))+ scale_y_continuous(breaks=seq(0,100,10), limits=c(0,100))+ labs(x = "GDP per Capita", y = "Life Expectancy (Years)")+ ggthemes::theme_pander(base_family = "Fira Sans Condensed", base_size=16)
ggplot(data = gapminder)+ aes(x = loggdp, y = lifeExp)+ geom_point(color="blue", alpha=0.5)+ geom_smooth(method="lm", color="orange")+ scale_y_continuous(breaks=seq(0,100,10), limits=c(0,100))+ labs(x = "Log GDP per Capita", y = "Life Expectancy (Years)")+ ggthemes::theme_pander(base_family = "Fira Sans Condensed", base_size=16)
$$\begin{align*} \color{#e64173}{\ln Y_i}&=\beta_0+\beta_1 X\\ \beta_1&=\cfrac{\big(\frac{\Delta Y}{Y}\big)}{\Delta X}\\ \end{align*}$$
$$\begin{align*} \color{#e64173}{\ln Y_i}&=\beta_0+\beta_1 X\\ \beta_1&=\cfrac{\big(\frac{\Delta Y}{Y}\big)}{\Delta X}\\ \end{align*}$$
We will again have very large/small coefficients if we deal with GDP directly, again let's transform gdpPercap
into $1,000s, call it gdp_t
Then log LifeExp
We will again have very large/small coefficients if we deal with GDP directly, again let's transform gdpPercap
into $1,000s, call it gdp_t
Then log LifeExp
gapminder <- gapminder %>% mutate(gdp_t = gdpPercap/1000, # first make GDP/capita in $1000s loglife = log(lifeExp)) # take the log of LifeExpgapminder %>% head() # look at it
log_lin_reg <- lm(loglife~gdp_t, data = gapminder)log_lin_reg %>% tidy()
$$\widehat{\ln\text{Life Expectancy}}_i=3.967+0.013 \, \text{GDP}_i$$
log_lin_reg <- lm(loglife~gdp_t, data = gapminder)log_lin_reg %>% tidy()
$$\widehat{\ln\text{Life Expectancy}}_i=3.967+0.013 \, \text{GDP}_i$$
log_lin_reg <- lm(loglife~gdp_t, data = gapminder)log_lin_reg %>% tidy()
$$\widehat{ln(\text{Life Expectancy})}_i=3.967+0.013 \, \text{GDP}_i$$
A $1 (thousand) change in GDP \(\rightarrow\) a \(0.013 \times 100\%=\) 1.3% increase in Life Expectancy
A $25 (thousand) fall in GDP \(\rightarrow\) a \((-25 \times 1.3\%)=\) 32.5% decrease in Life Expectancy
log_lin_reg <- lm(loglife~gdp_t, data = gapminder)log_lin_reg %>% tidy()
$$\widehat{ln(\text{Life Expectancy})}_i=3.967+0.013 \, \text{GDP}_i$$
A $1 (thousand) change in GDP \(\rightarrow\) a \(0.013 \times 100\%=\) 1.3% increase in Life Expectancy
A $25 (thousand) fall in GDP \(\rightarrow\) a \((-25 \times 1.3\%)=\) 32.5% decrease in Life Expectancy
A $100 (thousand) rise in GDP \(\rightarrow\) a \((100 \times 1.3\%)=\) 130% increase in Life Expectancy
ggplot(data = gapminder)+ aes(x = gdp_t, y = loglife)+ geom_point(color="blue", alpha=0.5)+ geom_smooth(method="lm", color="orange")+ scale_x_continuous(labels=scales::dollar, breaks=seq(0,120,20))+ labs(x = "GDP per Capita ($ Thousands)", y = "Log Life Expectancy")+ ggthemes::theme_pander(base_family = "Fira Sans Condensed", base_size=16)
$$\begin{align*} \color{#e64173}{\ln Y_i}&=\beta_0+\beta_1 \color{#e64173}{\ln X_i}\\ \beta_1&=\cfrac{\big(\frac{\Delta Y}{Y}\big)}{\big(\frac{\Delta X}{X}\big)}\\ \end{align*}$$
$$\begin{align*} \color{#e64173}{\ln Y_i}&=\beta_0+\beta_1 \color{#e64173}{\ln X_i}\\ \beta_1&=\cfrac{\big(\frac{\Delta Y}{Y}\big)}{\big(\frac{\Delta X}{X}\big)}\\ \end{align*}$$
Marginal effect of \(\mathbf{X \rightarrow Y}\): a 1% change in \(X \rightarrow\) a \(\beta_1\) % change in \(Y\)
\(\beta_1\) is the elasticity of \(Y\) with respect to \(X\)!
log_log_reg <- lm(loglife ~ loggdp, data = gapminder)log_log_reg %>% tidy()
$$\widehat{\text{ln Life Expectancy}}_i=2.864+0.147 \, \text{ln GDP}_i$$
log_log_reg <- lm(loglife ~ loggdp, data = gapminder)log_log_reg %>% tidy()
$$\widehat{\text{ln Life Expectancy}}_i=2.864+0.147 \, \text{ln GDP}_i$$
log_log_reg <- lm(loglife ~ loggdp, data = gapminder)log_log_reg %>% tidy()
$$\widehat{\text{ln Life Expectancy}}_i=2.864+0.147 \, \text{ln GDP}_i$$
A 1% change in GDP \(\rightarrow\) a 0.147% increase in Life Expectancy
A 25% fall in GDP \(\rightarrow\) a \((-25 \times 0.147\%)=\) 3.675% decrease in Life Expectancy
log_log_reg <- lm(loglife ~ loggdp, data = gapminder)log_log_reg %>% tidy()
$$\widehat{\text{ln Life Expectancy}}_i=2.864+0.147 \, \text{ln GDP}_i$$
A 1% change in GDP \(\rightarrow\) a 0.147% increase in Life Expectancy
A 25% fall in GDP \(\rightarrow\) a \((-25 \times 0.147\%)=\) 3.675% decrease in Life Expectancy
A 100% rise in GDP \(\rightarrow\) a \((100 \times 0.147\%)=\) 14.7% increase in Life Expectancy
ggplot(data = gapminder)+ aes(x = loggdp, y = loglife)+ geom_point(color="blue", alpha=0.5)+ geom_smooth(method="lm", color="orange")+ labs(x = "Log GDP per Capita", y = "Log Life Expectancy")+ ggthemes::theme_pander(base_family = "Fira Sans Condensed", base_size=16)
Model | Equation | Interpretation |
---|---|---|
Linear-Log | \(Y=\beta_0+\beta_1 \color{#e64173}{\ln X}\) | 1% change in \(X \rightarrow \frac{\hat{\beta_1}}{100}\) unit change in \(Y\) |
Log-Linear | \(\color{#e64173}{\ln Y}=\beta_0+\beta_1X\) | 1 unit change in \(X \rightarrow \hat{\beta_1}\times 100\)% change in \(Y\) |
Log-Log | \(\color{#e64173}{\ln Y}=\beta_0+\beta_1\color{#e64173}{\ln X}\) | 1% change in \(X \rightarrow \hat{\beta_1}\)% change in \(Y\) |
library(huxtable)huxreg("Life Exp." = lin_log_reg, "Log Life Exp." = log_lin_reg, "Log Life Exp." = log_log_reg, coefs = c("Constant" = "(Intercept)", "GDP ($1000s)" = "gdp_t", "Log GDP" = "loggdp"), statistics = c("N" = "nobs", "R-Squared" = "r.squared", "SER" = "sigma"), number_format = 2)
Life Exp. | Log Life Exp. | Log Life Exp. | |
---|---|---|---|
Constant | -9.10 *** | 3.97 *** | 2.86 *** |
(1.23) | (0.01) | (0.02) | |
GDP ($1000s) | 0.01 *** | ||
(0.00) | |||
Log GDP | 8.41 *** | 0.15 *** | |
(0.15) | (0.00) | ||
N | 1704 | 1704 | 1704 |
R-Squared | 0.65 | 0.30 | 0.61 |
SER | 7.62 | 0.19 | 0.14 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
Linear-Log | Log-Linear | Log-Log |
---|---|---|
![]() |
![]() |
![]() |
\(\hat{Y_i}=\hat{\beta_0}+\hat{\beta_1}\color{#e64173}{\ln X_i}\) | \(\color{#e64173}{\ln Y_i}=\hat{\beta_0}+\hat{\beta_1}X_i\) | \(\color{#e64173}{\ln Y_i}=\hat{\beta_0}+\hat{\beta_1}\color{#e64173}{\ln X_i}\) |
\(R^2=0.65\) | \(R^2=0.30\) | \(R^2=0.61\) |
$$\hat{Y_i}=\beta_0+\beta_1 X_1+\beta_2 X_2 $$
We often want to compare coefficients to see which variable \(X_1\) or \(X_2\) has a bigger effect on \(Y\)
What if \(X_1\) and \(X_2\) are different units?
Example: $$\begin{align*} \widehat{\text{Salary}_i}&=\beta_0+\beta_1\, \text{Batting average}_i+\beta_2\, \text{Home runs}_i\\ \widehat{\text{Salary}_i}&=-\text{2,869,439.40}+\text{12,417,629.72} \, \text{Batting average}_i+\text{129,627.36}\, \text{Home runs}_i\\ \end{align*}$$
$$X_Z=\frac{X_i-\overline{X}}{sd(X)}$$
† Also called “centering” or “scaling.”
Variable | Mean | Std. Dev. |
---|---|---|
Salary | $2,024,616 | $2,764,512 |
Batting Average | 0.267 | 0.031 |
Home Runs | 12.11 | 10.31 |
$$\begin{align*}\scriptsize \widehat{\text{Salary}_i}&=-\text{2,869,439.40}+\text{12,417,629.72} \, \text{Batting average}_i+\text{129,627.36} \, \text{Home runs}_i\\ \widehat{\text{Salary}_Z}&=\text{0.00}+\text{0.14} \, \text{Batting average}_Z+\text{0.48} \, \text{Home runs}_Z\\ \end{align*}$$
Variable | Mean | Std. Dev. |
---|---|---|
Salary | $2,024,616 | $2,764,512 |
Batting Average | 0.267 | 0.031 |
Home Runs | 12.11 | 10.31 |
$$\begin{align*}\scriptsize \widehat{\text{Salary}_i}&=-\text{2,869,439.40}+\text{12,417,629.72} \, \text{Batting average}_i+\text{129,627.36} \, \text{Home runs}_i\\ \widehat{\text{Salary}_Z}&=\text{0.00}+\text{0.14} \, \text{Batting average}_Z+\text{0.48} \, \text{Home runs}_Z\\ \end{align*}$$
Marginal effects on \(Y\) (in standard deviations of \(Y\)) from 1 standard deviation change in \(X\):
\(\hat{\beta_1}\): a 1 standard deviation increase in Batting Average increases Salary by 0.14 standard deviations
$$0.14 \times \$2,764,512=\$387,032$$
$$0.48 \times \$2,764,512=\$1,326,966$$
R
Variable | Mean | SD |
---|---|---|
LifeExp |
59.47 | 12.92 |
gdpPercap |
$7215.32 | $9857.46 |
scale()
command inside mutate()
function to standardize a variablegapminder <- gapminder %>% mutate(life_Z = scale(lifeExp), gdp_Z = scale(gdpPercap))std_reg <- lm(life_Z ~ gdp_Z, data = gapminder)tidy(std_reg)
## # A tibble: 2 × 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 1.10e-16 0.0197 5.57e-15 1.00e+ 0## 2 gdp_Z 5.84e- 1 0.0197 2.97e+ 1 3.57e-156
gdpPercap
will increase lifeExp
by 0.584 standard deviations \((0.584 \times 12.92 = = 7.55\) years)Example: Return again to:
$$\widehat{Wage_i}=\hat{\beta_0}+\hat{\beta_1}Male_i+\hat{\beta_2}Northeast_i+\hat{\beta_3}Midwest_i+\hat{\beta_4}South_i$$
Example: Return again to:
$$\widehat{Wage_i}=\hat{\beta_0}+\hat{\beta_1}Male_i+\hat{\beta_2}Northeast_i+\hat{\beta_3}Midwest_i+\hat{\beta_4}South_i$$
Example: Return again to:
$$\widehat{Wage_i}=\hat{\beta_0}+\hat{\beta_1}Male_i+\hat{\beta_2}Northeast_i+\hat{\beta_3}Midwest_i+\hat{\beta_4}South_i$$
Maybe region doesn't affect wages at all?
\(H_0: \beta_2=0, \, \beta_3=0, \, \beta_4=0\)
Example: Return again to:
$$\widehat{Wage_i}=\hat{\beta_0}+\hat{\beta_1}Male_i+\hat{\beta_2}Northeast_i+\hat{\beta_3}Midwest_i+\hat{\beta_4}South_i$$
Maybe region doesn't affect wages at all?
\(H_0: \beta_2=0, \, \beta_3=0, \, \beta_4=0\)
This is a joint hypothesis to test
A joint hypothesis tests against the null hypothesis of a value for multiple parameters: $$\mathbf{H_0: \beta_1= \beta_2=0}$$ the hypotheses that multiple regressors are equal to zero (have no causal effect on the outcome)
Our alternative hypothesis is that: $$H_1: \text{ either } \beta_1\neq0\text{ or } \beta_2\neq0\text{ or both}$$ or simply, that \(H_0\) is not true
1) \(H_0\): \(\beta_1=\beta_2=0\)
1) \(H_0\): \(\beta_1=\beta_2=0\)
2) \(H_0\): \(\beta_1=\beta_2\)
1) \(H_0\): \(\beta_1=\beta_2=0\)
2) \(H_0\): \(\beta_1=\beta_2\)
3) \(H_0:\) ALL \(\beta\)'s \(=0\)
The F-statistic is the test-statistic used to test joint hypotheses about regression coefficients with an F-test
This involves comparing two models:
The F-statistic is the test-statistic used to test joint hypotheses about regression coefficients with an F-test
This involves comparing two models:
\(F\) is an analysis of variance (ANOVA)
The F-statistic is the test-statistic used to test joint hypotheses about regression coefficients with an F-test
This involves comparing two models:
\(F\) is an analysis of variance (ANOVA)
\(F\) has its own distribution, with two sets of degrees of freedom
Example: Return again to:
$$\widehat{Wage_i}=\hat{\beta_0}+\hat{\beta_1}Male_i+\hat{\beta_2}Northeast_i+\hat{\beta_3}Midwest_i+\hat{\beta_4}South_i$$
Example: Return again to:
$$\widehat{Wage_i}=\hat{\beta_0}+\hat{\beta_1}Male_i+\hat{\beta_2}Northeast_i+\hat{\beta_3}Midwest_i+\hat{\beta_4}South_i$$
Example: Return again to:
$$\widehat{Wage_i}=\hat{\beta_0}+\hat{\beta_1}Male_i+\hat{\beta_2}Northeast_i+\hat{\beta_3}Midwest_i+\hat{\beta_4}South_i$$
\(H_0: \beta_2=\beta_3=\beta_4=0\)
\(H_a\): \(H_0\) is not true (at least one \(\beta_i \neq 0\))
Example: Return again to:
$$\widehat{Wage_i}=\hat{\beta_0}+\hat{\beta_1}Male_i+\hat{\beta_2}Northeast_i+\hat{\beta_3}Midwest_i+\hat{\beta_4}South_i$$
$$\widehat{Wage_i}=\hat{\beta_0}+\hat{\beta_1}Male_i+\hat{\beta_2}Northeast_i+\hat{\beta_3}Midwest_i+\hat{\beta_4}South_i$$
Example: Return again to:
$$\widehat{Wage_i}=\hat{\beta_0}+\hat{\beta_1}Male_i+\hat{\beta_2}Northeast_i+\hat{\beta_3}Midwest_i+\hat{\beta_4}South_i$$
$$\widehat{Wage_i}=\hat{\beta_0}+\hat{\beta_1}Male_i+\hat{\beta_2}Northeast_i+\hat{\beta_3}Midwest_i+\hat{\beta_4}South_i$$
$$\widehat{Wage_i}=\hat{\beta_0}+\hat{\beta_1}Male_i$$
Example: Return again to:
$$\widehat{Wage_i}=\hat{\beta_0}+\hat{\beta_1}Male_i+\hat{\beta_2}Northeast_i+\hat{\beta_3}Midwest_i+\hat{\beta_4}South_i$$
$$\widehat{Wage_i}=\hat{\beta_0}+\hat{\beta_1}Male_i+\hat{\beta_2}Northeast_i+\hat{\beta_3}Midwest_i+\hat{\beta_4}South_i$$
$$\widehat{Wage_i}=\hat{\beta_0}+\hat{\beta_1}Male_i$$
$$F_{q,(n-k-1)}=\cfrac{\left(\displaystyle\frac{(R^2_u-R^2_r)}{q}\right)}{\left(\displaystyle\frac{(1-R^2_u)}{(n-k-1)}\right)}$$
$$F_{q,(n-k-1)}=\cfrac{\left(\displaystyle\frac{(\color{#e64173}{R^2_u}-R^2_r)}{q}\right)}{\left(\displaystyle\frac{(1-\color{#e64173}{R^2_u})}{(n-k-1)}\right)}$$
$$F_{q,(n-k-1)}=\cfrac{\left(\displaystyle\frac{(\color{#e64173}{R^2_u}-\color{#6A5ACD}{R^2_r})}{q}\right)}{\left(\displaystyle\frac{(1-\color{#e64173}{R^2_u})}{(n-k-1)}\right)}$$
\(\color{#e64173}{R^2_u}\): the \(R^2\) from the unrestricted model (all variables)
\(\color{#6A5ACD}{R^2_r}\): the \(R^2\) from the restricted model (null hypothesis)
$$F_{q,(n-k-1)}=\cfrac{\left(\displaystyle\frac{(\color{#e64173}{R^2_u}-\color{#6A5ACD}{R^2_r})}{q}\right)}{\left(\displaystyle\frac{(1-\color{#e64173}{R^2_u})}{(n-k-1)}\right)}$$
\(\color{#e64173}{R^2_u}\): the \(R^2\) from the unrestricted model (all variables)
\(\color{#6A5ACD}{R^2_r}\): the \(R^2\) from the restricted model (null hypothesis)
\(q\): number of restrictions (number of \(\beta's=0\) under null hypothesis)
$$F_{q,(n-k-1)}=\cfrac{\left(\displaystyle\frac{(\color{#e64173}{R^2_u}-\color{#6A5ACD}{R^2_r})}{q}\right)}{\left(\displaystyle\frac{(1-\color{#e64173}{R^2_u})}{(n-k-1)}\right)}$$
\(\color{#e64173}{R^2_u}\): the \(R^2\) from the unrestricted model (all variables)
\(\color{#6A5ACD}{R^2_r}\): the \(R^2\) from the restricted model (null hypothesis)
\(q\): number of restrictions (number of \(\beta's=0\) under null hypothesis)
\(k\): number of \(X\) variables in unrestricted model (all variables)
$$F_{q,(n-k-1)}=\cfrac{\left(\displaystyle\frac{(\color{#e64173}{R^2_u}-\color{#6A5ACD}{R^2_r})}{q}\right)}{\left(\displaystyle\frac{(1-\color{#e64173}{R^2_u})}{(n-k-1)}\right)}$$
\(\color{#e64173}{R^2_u}\): the \(R^2\) from the unrestricted model (all variables)
\(\color{#6A5ACD}{R^2_r}\): the \(R^2\) from the restricted model (null hypothesis)
\(q\): number of restrictions (number of \(\beta's=0\) under null hypothesis)
\(k\): number of \(X\) variables in unrestricted model (all variables)
\(F\) has two sets of degrees of freedom:
$$F_{q,(n-k-1)}=\cfrac{\left(\displaystyle\frac{(R^2_u-R^2_r)}{q}\right)}{\left(\displaystyle\frac{(1-R^2_u)}{(n-k-1)}\right)}$$
Key takeaway: The bigger the difference between \((R^2_u-R^2_r)\), the greater the improvement in fit by adding variables, the larger the \(F\)!
This formula is (believe it or not) actually a simplified version (assuming homoskedasticity)
wooldridge
package's wage1
data again# load in data from wooldridge packagelibrary(wooldridge)wages <- wage1# run regressionsunrestricted_reg <- lm(wage ~ female + northcen + west + south, data = wages)restricted_reg <- lm(wage ~ female, data = wages)
$$\widehat{Wage_i}=\hat{\beta_0}+\hat{\beta_1}Female_i+\hat{\beta_2}Northeast_i+\hat{\beta_3}Northcen+\hat{\beta_4}South_i$$
$$\widehat{Wage_i}=\hat{\beta_0}+\hat{\beta_1}Female_i$$
\(H_0: \beta_2 = \beta_3 = \beta_4 =0\)
\(q = 3\) restrictions (F numerator df)
\(n-k-1 = 526-4-1=521\) (F denominator df)
car
package's linearHypothesis()
command to run an \(F\)-test:car
package's linearHypothesis()
command to run an \(F\)-test:# load car package for additional regression toolslibrary(car) # F-testlinearHypothesis(unrestricted_reg, c("northcen", "west", "south"))
## Linear hypothesis test## ## Hypothesis:## northcen = 0## west = 0## south = 0## ## Model 1: restricted model## Model 2: wage ~ female + northcen + west + south## ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 524 6332.2 ## 2 521 6174.8 3 157.36 4.4258 0.004377 **## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
car
package's linearHypothesis()
command to run an \(F\)-test:# load car package for additional regression toolslibrary(car) # F-testlinearHypothesis(unrestricted_reg, c("northcen", "west", "south"))
## Linear hypothesis test## ## Hypothesis:## northcen = 0## west = 0## south = 0## ## Model 1: restricted model## Model 2: wage ~ female + northcen + west + south## ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 524 6332.2 ## 2 521 6174.8 3 157.36 4.4258 0.004377 **## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Example:
$$\widehat{wage_i}=\beta_0+\beta_1 \text{Adolescent height}_i + \beta_2 \text{Adult height}_i + \beta_3 \text{Male}_i$$
Example:
$$\widehat{wage_i}=\beta_0+\beta_1 \text{Adolescent height}_i + \beta_2 \text{Adult height}_i + \beta_3 \text{Male}_i$$
$$H_0: \beta_1=\beta_2$$
Example:
$$\widehat{wage_i}=\beta_0+\beta_1 \text{Adolescent height}_i + \beta_2 \text{Adult height}_i + \beta_3 \text{Male}_i$$
$$H_0: \beta_1=\beta_2$$
$$\widehat{wage_i}=\beta_0+\beta_1(\text{Adolescent height}_i + \text{Adult height}_i )+ \beta_3 \text{Male}_i$$
# load in dataheightwages <- read_csv("../data/heightwages.csv")# make a "heights" variable as the sum of adolescent (height81) and adult (height85) heightheightwages <- heightwages %>% mutate(heights = height81 + height85)height_reg <- lm(wage96 ~ height81 + height85 + male, data = heightwages)height_restricted_reg <- lm(wage96 ~ heights + male, data = heightwages)
linearHypothesis(height_reg, "height81 = height85") # F-test
## Linear hypothesis test## ## Hypothesis:## height81 - height85 = 0## ## Model 1: restricted model## Model 2: wage96 ~ height81 + height85 + male## ## Res.Df RSS Df Sum of Sq F Pr(>F)## 1 6591 5128243 ## 2 6590 5127284 1 959.2 1.2328 0.2669
Insufficient evidence to reject \(H_0\)!
The effect of adolescent and adult height on wages is the same
summary(unrestricted_reg)
## ## Call:## lm(formula = wage ~ female + northcen + west + south, data = wages)## ## Residuals:## Min 1Q Median 3Q Max ## -6.3269 -2.0105 -0.7871 1.1898 17.4146 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 7.5654 0.3466 21.827 <2e-16 ***## female -2.5652 0.3011 -8.520 <2e-16 ***## northcen -0.5918 0.4362 -1.357 0.1755 ## west 0.4315 0.4838 0.892 0.3729 ## south -1.0262 0.4048 -2.535 0.0115 * ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 3.443 on 521 degrees of freedom## Multiple R-squared: 0.1376, Adjusted R-squared: 0.131 ## F-statistic: 20.79 on 4 and 521 DF, p-value: 6.501e-16
summary()
is an All F-testF-statistic
that, if high enough, is significant (p-value
\(<0.05)\) enough to reject \(H_0\)broom
instead of summary()
:glance()
command makes table of regression summary statisticstidy()
only shows coefficientslibrary(broom)glance(unrestricted_reg)
## # A tibble: 1 × 12## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 0.138 0.131 3.44 20.8 6.50e-16 4 -1394. 2800. 2826.## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
statistic
is the All F-test, p.value
next to it is the p-value from the F test