To minimize confusion, I suggest creating a new R Project
(e.g. regression_practice
) and storing any data in that folder on your computer.
I have already made an R project you can download (as a .zip
), unzip, and open the .Rproj
file in R Studio, or there is an R project you can use on the cloud:
Data Cleaning
How did the results of the 2016 election in their district affect how often Members of Congress during the Trump Administration voted with the President?
First, as always, load tidyverse
!
Question 1
Our data comes from fivethirtyeight’s Trump Congress tracker. Fivethirtyeight has a great public archive of all of the data they use for their articles. I have saved a copy to this website so you can download it below. Download and import (read_csv()
) the data into an object.
The variables that we care about are
Variable | Description |
---|---|
congress |
Congressional session |
chamber |
Chamber of Congress |
last_name |
Member of Congress |
party |
Political Party |
state |
U.S. State |
district |
Congressional district |
agree_pct |
Proportion of votes that agree with President Trump (0-1) |
net_trump_vote |
District’s margin of victory in 2016 election (positive is Trump win, negative is Clinton win) |
Question 3
agree_pct
is oddly named, given that its values range from 0 to 1. Make a new variable (you can overwrite agree_pct
) that is a true percentage, from 0 to 100, and use this going forward (it will make interpretation of our results easier).
Regression
Question 6
We want to predict the following model:
Run a regression and save it as an object. Then get a summary()
of the object.
Question 7
We can look at regression outputs in a tidier way, with the broom
package.
Part B
Run the function tidy()
on your regression object (saved in question 6). Save this result as an object and then look at it.
Question 8
Let’s use our broom
results to calculate the goodness of fit statistics you found in question 6 to confirm.
Calculate as by taking the variance of (.fitted
in the augment
ed object you made in Question 6D) over the variance of .
Alternately, you can calculate as by taking 1 minus sum of squared (.resid
in that same augment
ed object) over the variance of .
Optional: If you really want to be fancy, make your own function to calculate the sum of squares of and (instead of variance), as I did, with:
sum_sq = function(x){sum((x - mean(x))^2)}
and then running this function on agree_pct
and .fitted
.
Regression Diagnostics
Question 10
Now let’s start looking at the residuals of the regression.
Question 11
Now let’s check for heteroskedasticity.
Part A
Looking at the scatterplot and residual plots in Questions 3 and 7B, do you think the errors are heteroskedastic or homoskedastic?
Part B
Install and load the lmtest
package and run bptest()
on your saved lm
object from Question 6. According to the test, is the data heteroskedastic or homoskedastic?
Part C
Now let’s make some heteroskedasticity-robust standard errors. Install and load the estimatr
package and use the lm_robust()
command (instead of the lm()
command) to run a new regression (and save it). Make sure you add se_type = "stata"
inside the command to calculate robust SEs (in the same way that the Stata software does…long story). Look at it. What changes?
Question 12
Now let’s check for outliers.
Question 13
(Optional: Flexing your tidyverse skills)
This data is still a bit messy. Let’s check your tidyverse
skills again! For example, we’d probably like to plot our scatterplots with colors for Republican and Democratic party. Or plot by the House and the Senate.
Part A
First, the variable congress
(session of Congress) seems a bit off. Get a count()
of congress
.
Part C
The variable party
is also quite a mess. count()
by party
to see. Then let’s mutate
a variable to make a better measure of political party - just "Republican"
, "Democrat"
, and "Independent"
. Try doing this with the case_when()
command (as your mutate
formula).
The syntax for case_when()
is to have a series of condition ~ "Outcome"
, separated by commas. For example, one condition is to assign both "Democrat"
and "D"
to "Democrat"
, as in party %in% c("Democrat", "D") ~ "Democrat"
. You could also do this with a few ifelse()
commands, but that’s a bit more awkward.] When you’re done count()
by your new party variable to make sure it worked.
Part D
Now plot a scatterplot (same as Question 4) and set color
to your party variable. Notice R
uses its own default colors, which don’t match to the actual colors these political parties use! Make a vector where you define the party colors as follows:
party_colors <- c("Democrat" = "blue",
"Republican" = "red",
"Independent" = "gray")
Then, run your plot again, adding the following layer to customize the colors scale_colour_manual("Parties", values = party_colors)
. "Parties"
is the title that will show up on the legend, feel free to edit it, or remove the legend with another layer +guides(color = F)
.