2.5 — OLS: Precision and Diagnostics — Practice

To minimize confusion, I suggest creating a new R Project (e.g. regression_practice) and storing any data in that folder on your computer.

I have already made an R project you can download (as a .zip), unzip, and open the .Rproj file in R Studio, or there is an R project you can use on the cloud:

R Project R Studio Cloud

Data Cleaning

How did the results of the 2016 election in their district affect how often Members of Congress during the Trump Administration voted with the President?

First, as always, load tidyverse!

Question 1

Our data comes from fivethirtyeight’s Trump Congress tracker. Fivethirtyeight has a great public archive of all of the data they use for their articles. I have saved a copy to this website so you can download it below. Download and import (read_csv()) the data into an object.

The variables that we care about are

Variable Description
congress Congressional session
chamber Chamber of Congress
last_name Member of Congress
party Political Party
state U.S. State
district Congressional district
agree_pct Proportion of votes that agree with President Trump (0-1)
net_trump_vote District’s margin of victory in 2016 election (positive is Trump win, negative is Clinton win)

Question 2

Look at the data with glimpse(). How many variables are there? How many observations?

Question 3

agree_pct is oddly named, given that its values range from 0 to 1. Make a new variable (you can overwrite agree_pct) that is a true percentage, from 0 to 100, and use this going forward (it will make interpretation of our results easier).

Question 4

Make a scatterplot of agree_pct on net_trump_vote. Add a regression line by adding an additional layer of geom_smooth(method = "lm").

Question 5

Find the correlation between agree_pct and net_trump_vote. Hint: if using tidyverse, like calculating any statistic, we want the summarize() the data.

Regression

Question 6

We want to predict the following model:

\[\widehat{\text{agree_pct}}= \hat{\beta_0}+\hat{\beta_1} \,\text{net_trump_vote}\] Run a regression and save it as an object. Then get a summary() of the object.

Part A

What is \(\hat{\beta_0}\) for this model? What does it mean in the context of our question?

Part B

What is \(\hat{\beta_1}\) for this model? What does it mean in the context of our question?

Part C

What is \(R^2\) for this model? What does it mean in the context of our question?

Part D

What is the \(SER\) for this model? What does it mean in the context of our question?

Question 7

We can look at regression outputs in a tidier way, with the broom package.

Part A

Install (if you have not yet done so) and load the broom package.

Part B

Run the function tidy() on your regression object (saved in question 6). Save this result as an object and then look at it.

Part C

Run the glance() function on your original regression object. What does it show you? Find \(R^2\) and the SER and confirm they are the same from the Base R lm output.

Part D

Now run the augment() function on your original regression object, and save this as an object. Look at it. What does it show you?

Question 8

Let’s use our broom results to calculate the goodness of fit statistics you found in question 6 to confirm.

Calculate \(R^2\) as \(\frac{ESS}{TSS}\) by taking the variance of \(\widehat{\text{agree_pct}_i}\) (.fitted in the augmented object you made in Question 6D) over the variance of \(\text{agree_pct}\).

Alternately, you can calculate \(R^2\) as \(1-\frac{SSE}{TSS}\) by taking 1 minus sum of squared \(\hat{u_i}\) (.resid in that same augmented object) over the variance of \(\text{agree_pct}\).

Optional: If you really want to be fancy, make your own function to calculate the sum of squares of \(\hat{Y}\) and \(Y_i\) (instead of variance), as I did, with:

sum_sq = function(x){sum((x - mean(x))^2)}

and then running this function on agree_pct and .fitted.

Question 9

Now let’s try presenting your results in a regression table. Install and load the huxtable package, and run the huxreg() command. Your main input is your regression object you saved in Question 6. Feel free to customize the output of this table (see the slides).

Regression Diagnostics

Question 10

Now let’s start looking at the residuals of the regression.

Part A

Take the augmented regression object from Question 7D and use it as the source of your data to create a histogram with ggplot(), setting aes(x = .resid). Does it look roughly normal?

Part B

Take the augmented regression object and make a residual plot, which is a scatterplot where x is the normal x variable, and y is the .resid. Feel free to add a horizontal line at 0 with geom_hline(yintercept = 0) as an additional layer.

Question 11

Now let’s check for heteroskedasticity.

Part A

Looking at the scatterplot and residual plots in Questions 3 and 7B, do you think the errors are heteroskedastic or homoskedastic?

Part B

Install and load the lmtest package and run bptest() on your saved lm object from Question 6. According to the test, is the data heteroskedastic or homoskedastic?

Part C

Now let’s make some heteroskedasticity-robust standard errors. Install and load the estimatr package and use the lm_robust() command (instead of the lm() command) to run a new regression (and save it). Make sure you add se_type = "stata" inside the command to calculate robust SEs (in the same way that the Stata software does…long story). Look at it. What changes?

Part D

Now let’s see this in a nice regression table. Use huxreg() again, but add both your original regression and your regression saved in part C. Notice any changes?

Question 12

Now let’s check for outliers.

Part A

Just looking at the scatterplot in Question 3, do you see any outliers?

Part B

Install and load the car package. Run the outlierTest command on your regression object. Does it detect any outliers?

Part C

Look in your original data to match this outlier with an observation. Hint: use the slice() command, as the outlier test gave you an observation (row) number!

Question 13

(Optional: Flexing your tidyverse skills)

This data is still a bit messy. Let’s check your tidyverse skills again! For example, we’d probably like to plot our scatterplots with colors for Republican and Democratic party. Or plot by the House and the Senate.

Part A

First, the variable congress (session of Congress) seems a bit off. Get a count() of congress.

Part B

Let’s get rid of the 0 values for congress (someone made a mistake coding this, probably).

Part C

The variable party is also quite a mess. count() by party to see. Then let’s mutate a variable to make a better measure of political party - just "Republican", "Democrat", and "Independent". Try doing this with the case_when() command (as your mutate formula).

The syntax for case_when() is to have a series of condition ~ "Outcome", separated by commas. For example, one condition is to assign both "Democrat" and "D" to "Democrat", as in party %in% c("Democrat", "D") ~ "Democrat". You could also do this with a few ifelse() commands, but that’s a bit more awkward.] When you’re done count() by your new party variable to make sure it worked.

Part D

Now plot a scatterplot (same as Question 4) and set color to your party variable. Notice R uses its own default colors, which don’t match to the actual colors these political parties use! Make a vector where you define the party colors as follows:

party_colors <- c("Democrat" = "blue",
                  "Republican" = "red",
                  "Independent" = "gray")

Then, run your plot again, adding the following layer to customize the colors scale_colour_manual("Parties", values = party_colors). "Parties" is the title that will show up on the legend, feel free to edit it, or remove the legend with another layer +guides(color = F).

Part E

Now facet your scatterplot by chamber by adding a layer: facet_wrap(~chamber).

Previous
Next