Set Up
To minimize confusion, I suggest creating a new R Project
(e.g. nonlinear_practice
) and storing any data in that folder on your computer.
I have already made an R project you can download (as a .zip
), unzip, and open the .Rproj
file in R Studio, or there is an R project you can use on the cloud:
Answers
Question 1
We are returning to the speeding tickets data that we began to explore in R Practice 3.4 on Multivariate Regression and R Practice 3.7 on Interaction Effects. Download and read in (read_csv
) the data below.
This data again comes from a paper by Makowsky and Strattman (2009). Even though state law sets a formula for tickets based on how fast a person was driving, police officers in practice often deviate from that formula. This dataset includes information on all traffic stops. An amount for the fine is given only for observations in which the police officer decided to assess a fine. There are a number of variables in this dataset, but the one’s we’ll look at are:
Variable | Description |
---|---|
Amount |
Amount of fine (in dollars) assessed for speeding |
Age |
Age of speeding driver (in years) |
MPHover |
Miles per hour over the speed limit |
Black |
Dummy \(=1\) if driver was black, \(=0\) if not |
Hispanic |
Dummy \(=1\) if driver was Hispanic, \(=0\) if not |
Female |
Dummy \(=1\) if driver was female, \(=0\) if not |
OutTown |
Dummy \(=1\) if driver was not from local town, \(=0\) if not |
OutState |
Dummy \(=1\) if driver was not from local state, \(=0\) if not |
StatePol |
Dummy \(=1\) if driver was stopped by State Police, \(=0\) if stopped by other (local) |
We again want to explore who gets fines, and how much.
Question 2
Run a regression of Amount
on Age
. Write out the estimated regression equation, and interpret the coefficient on Age.
Question 3
Is the effect of Age
on Amount
nonlinear? Let’s run a quadratic regression.
Part A
Create a new variable for \(Age^2\). Then run a quadratic regression:
\[\widehat{\text{Amount}_i}=\beta_0+\beta_1 \text{Age}_i+\beta_2 \text{Age}_i^2\]
Part B
Try running the same regression using the alternate notation: lm(Y ~ X + I(X^2))
, replacing X
and Y
with our variables. This method allows you to run a quadratric regression without having to create a new variable first. Do you get the same results?
Part C
Write out the estimated regression equation.
Part D
Is this model an improvement from the linear model? Compared \(\bar{R}^2\).
Part E
Is the coefficient on the quadratic term statistically significantly different from zero? i.e. could we reject \(H_0: \beta_2\)?
Part F
Write an equation for the marginal effect of Age
on Amount
.
Part G
Predict the marginal effect on Amount
of being one year older when you are 18. How about when you are 40?
Part H
Our quadratic function is a \(U\)-shape. According to the model, at what age is the amount of the fine minimized?
Part I
Create a scatterplot between Amount
(y
) and Age
(x
). Add a layer with a linear regression (as usual, geom_smooth(method = "lm")
), and an additional layer of with the predicted quadratic regression curve. This additional layer is similar but we need to specify the formula of the curve to be quadratic:
geom_smooth(method = "lm", formula = "y ~ x + I(x^2)")
Part J
It’s quite hard to see the quadratic curve with all those data points. Redo another plot and this time, only keep the quadratic stat_smooth()
layer and leave out the geom_point()
layer. This will only plot the regression curve.
Question 4
Should we use a higher-order polynomial equation? Run a cubic regression, and determine whether it is necessary:
\[\widehat{\text{Amount}_i}=\beta_0+\beta_1 \text{Age}_i+\beta_2 \text{Age}_i^2+\beta_3 \text{Age}_i^3\]
Question 5
Run an \(F\)-test to check if a nonlinear model is appropriate. Use the car
package (which you will need to load, and install if you do not have it).
Your null hypothesis is \(H_0: \beta_2=\beta_3=0\) from the regression in question 4. The command is
library(car)
linearHypothesis(reg_name, # name of your saved regression object
c("var1", "var2")) # name of the variables you are testing
Question 6
Now let’s take a look at speed (MPHover
the speed limit).
Part A
Creating new variables as necessary, run a linear-log model of Amount
on MPHover
. Write down the estimated regression equation, and interpret the coefficient on MPHover
\((\hat{\beta_1})\). Make a scatterplot with the regression line. Hint: The simple geom_smooth(method = "lm")
layer is sufficient, so long as you use the right variables on the plot!
Part B
Creating new variables as necessary, run a log-linear model of Amount
on MPHover
. Write down the estimated regression equation, and interpret the coefficient on MPHover
\((\hat{\beta_1})\). Make a scatterplot with the regression line. Hint: The simple geom_smooth(method = "lm")
is sufficient, so long as you use the right variables on the plot!
Part C
Creating new variables as necessary, run a log-log model of Amount
on MPHover
. Write down the estimated regression equation, and interpret the coefficient on MPHover
\((\hat{\beta_1})\). Make a scatterplot with the regression line. Hint: The simple geom_smooth(method = "lm")
is sufficient, so long as you use the right variables on the plot!
Part D
Which of the three log models has the best fit? Hint: Check \(R^2\)
Question 7
Return to the quadratic model from Question 3. Run a quadratic regression of Amount
on Age
, Age
\(^2\), MPHover
, and all of the race dummy variables. Test the null hypothesis: “the race of the driver has no effect on Amount”
Question 8
Now let’s try standardizing variables. Let’s try running a regression of Amount
on Age
and MPHover
, but standardizing each variable.
Part A
Create new standardized variables for Amount
, Age
, and MPHover
:
data <- data %>% # or whatever your dataframe is called
mutate(Amount_Z = scale(Amount),
Age_Z = scale(Age),
MPHover_Z = scale(MPHover))
Part B
Run a regression of standardized Amount_Z
on standardized Age_Z
and MPHover_Z
. Interpret \(\hat{\beta_1}\) and \(\hat{\beta_2}\). Which variable has a bigger effect on Amount
?
Question 9
Make a regression output table with huxtable
of your regressions in Questions 2, 3, 4, 6a, 6b, 6c, 7 and 8.