Set Up
To minimize confusion, I suggest creating a new R Project
(e.g. interaction_practice
) and storing any data in that folder on your computer.
I have already made an R project you can download (as a .zip
), unzip, and open the .Rproj
file in R Studio, or there is an R project you can use on the cloud:
Answers
Question 1
We are returning to the speeding tickets data that we began to explore in R Practice 3.4 on Multivariate Regression. Download and read in (read_csv
) the data below.
This data again comes from a paper by Makowsky and Strattman (2009). Even though state law sets a formula for tickets based on how fast a person was driving, police officers in practice often deviate from that formula. This dataset includes information on all traffic stops. An amount for the fine is given only for observations in which the police officer decided to assess a fine. There are a number of variables in this dataset, but the one’s we’ll look at are:
Variable | Description |
---|---|
Amount |
Amount of fine (in dollars) assessed for speeding |
Age |
Age of speeding driver (in years) |
MPHover |
Miles per hour over the speed limit |
Black |
Dummy \(=1\) if driver was black, \(=0\) if not |
Hispanic |
Dummy \(=1\) if driver was Hispanic, \(=0\) if not |
Female |
Dummy \(=1\) if driver was female, \(=0\) if not |
OutTown |
Dummy \(=1\) if driver was not from local town, \(=0\) if not |
OutState |
Dummy \(=1\) if driver was not from local state, \(=0\) if not |
StatePol |
Dummy \(=1\) if driver was stopped by State Police, \(=0\) if stopped by other (local) |
We again want to explore who gets fines, and how much.
Question 2
We will have to do a little more cleaning to get some of the data into a more usable form.
Part A
Inspect the data with str()
or head()
or glimpse()
to see what it looks like.
Part B
What class
of variable are Black
, Hispanic
, Female
, OutTown
, and OutState
?
Hint: use the class(df$variable)
command to ask what class something is, where df
is your dataframe, and variable
is the name of a variable. Alternatively, it should also show the class automatically from the commands from Part A.
Part C
Notice that when importing the data from the .csv
file, R
interpreted these variables as numeric
(num
) or double
(dbl
), but we want them to be factor
(fct
) variables, to ensure R
recognizes that there are two groups (categories), 0 and 1. Convert each of these variables into factors by reassigning it according to the format:
df <- df %>% # where df is your data
mutate(my_var = as.factor(my_var) # where my_var is the variable
)
You could do this for each variable, or all at once, using mutate_at()
:
df <- df %>%
mutate_at(c("Black", "Hispanic", "Female", "OutTown", "OutState"), factor)
Recall the class on Data Wrangling for all the mutate possibilities!
Part D
Confirm they are each now factors by checking their class or looking at the data again.
Part E
Get a summary()
of Amount
using:
df %>% # or whatever your dataframe is called
select(Amount) %>%
summary()
Note that there are a lot of NA
’s (these are people that were stopped but did not receive a fine)! Let’s filter()
to use only those observations for which Amount
is a positive number, and save this in your dataframe (assign and overwrite it, or make a new dataframe).
df <- df %>% # overwrite or make a new dataframe
filter(Amount > 0)
# verify it worked
df %>%
select(Amount) %>%
summar()
Question 3
Create a scatterplot between Amount
(as y
) and Female (as x
).
Hint: Use geom_jitter()
instead of geom_point()
to plot the points, and play around with width
settings inside geom_jitter()
Question 4
Now let’s start looking at the distribution conditionally to find the different group means.
Part A
Find the mean and standard deviation of Amount
for male drivers and again for female drivers.
Hint: properly filter()
the data and then use the summarize()
command.
Part B
What is the difference between the average Amount for Males and Females?
Part C
We did not go over how to do this in class, but you can run a t-test for the difference in group means to see if the difference is statistically significant. The syntax is similar for a regression:
t.test(Amount ~ Female,
data = df)
Is there a statistically significant difference between Amount
for male and female drivers? Hint: this is like any hypothesis test. Here \(H_0: \text{difference}=0\). A \(t\)-value needs to be large enough to be greater than a critical value of \(t\). Check the \(p\)-value and see if it is less than our standard of \(\alpha=0.05.\)
Question 5
Now run the following regression to ensure we get the same result as the t-test.
\[\text{Amount}_i=\hat{\beta_0}+\hat{\beta_1}Female_i\]
Part A
Write out the estimated regression equation.
Part B
Use the regression coefficients to find
- the average
Amount
for men
- the average
- the average
Amount
for women
- the average
- the difference in average
Amount
between men and women
- the difference in average
Question 6
Let’s recode the sex variable to Male
instead of Female.
Part A
Make a new variable called Male
and save it in your dataframe using the ifelse()
command:
df <- df %>% # overwrite or save as another dataframe
mutate(Male = ifelse(Female == 0, # test observation to see if Female is 0
yes = 1, # if yes (a Male), code Male as 1
no = 0), # if no (a Female), code Male as 0
)
# Verify it worked
df %>%
select(Female, Male)
Part B
Run the same regression as in question 5, but use Male
instead of Female
.
Part C
Write out the estimated regression equation.
Part D
Use the regression coefficients to find
- the average
Amount
for men
- the average
- the average
Amount
for women
- the average
- the difference in average
Amount
between men and women
- the difference in average
Question 7
Run a regression of Amount
on Male
and Female
. What happens, and why?
Question 8
Age
probably has a lot to do with differences in fines, perhaps also age affects fines differences between males and females.
Part A
Run a regression of Amount
on Age
and Female.
How does the coefficient on Female
change?
Part B
Now let’s see if the difference in fine between men and women are different depending on the driver’s age. Run the regression again, but add an interaction term between Female
and Age
, using Female*Age
or Female:Age
.
Part C
Write out your estimated regression equation.
Part D
Interpret the interaction effect. Is it statistically significant?
Part E
Plugging in 0 or 1 as necessary, rewrite (on your paper) this regression as two separate equations, one for Males and one for Females.
Part F
Let’s try to visualize this. Make a scatterplot of Age
(X) and Amount
(Y) and include a regression line.
Try adding color = Female
inside your original aes()
layer. This will produce two lines and color the points by Female
.
If it isn’t a factor
variable already, we can ensure that it is with as.factor(Female)
. We shouldn’t need to in this case because we already reset Female
as a faction in question 1.
Part G
Add a final facet layer to the plot make two different sub-plots by Sex with facet_wrap( ~ Female)
.
Question 9
Now let’s look at the possible interaction between Sex (Male
or Female
) and whether a driver is from In-State or Out-of-State (OutState
).
Part A
Use R
to examine the data and find the mean for (i) Males In-State, (ii) Males Out-of-State, (iii) Females In-State, and (iv) Females Out-of-State. Hint: do what you did in Question 4A.
Part B
Now run a regression of the following model:
\[\text{Amount}_i=\hat{\beta_0}+\hat{\beta_1}Female_i+\hat{\beta_2}OutState_i+\hat{\beta_3}Female_i*OutState_i\]
Part C
Write out the estimated regression equation.
Part D
What does each coefficient mean?
Part E
Using the regression equation, what are the means for
- Males In-State
- Males Out-of-State
- Females In-State
- Females Out-of-State?
Compare to your answers in part A.
Question 10
Collect your regressions from questions 5, 6b, 8a, 8b, and 9b and output them in a regression table with huxtable()
.