2.7 — Inference for Regression - R Practice

Set Up

To minimize confusion, I suggest creating a new R Project (e.g. infer_practice) and storing any data in that folder on your computer.

I have already made an R project you can download (as a .zip), unzip, and open the .Rproj file in R Studio, or there is an R project you can use on the cloud:

R Project R Studio Cloud

Answers

Answers (html) Answers (R Project) Answers (R Script)

Question 1

Let’s use the diamonds data built into ggplot. Simply load tidyverse and then to be explicit, save this as a tibble (feel free to rename it) with diamonds <- diamonds.

Let’s answer the following questions:

What is the effect of carat size on a diamond’s price?

Part A

Just to see what we’re looking at, make a scatterplot using ggplot(), with x as carat and y as price, and add a regression line.

Question 2

Suppose we want to estimate the following relationship:

\[\widehat{\text{price}}_i = \beta_0 + \beta_1 \text{carat}_i + u_i\]

Run a regression of price on carat using lm() and get a summary() of it. Be sure to save your regression model as an object, we’ll need it later.

Part A

Write out the estimated regression equation.

Part B

Make a regression table of the output (using the huxtable package).

Part C

What is \(\hat{\beta_1}\) for this model? Interpret it in the context of our question.

Part D

Use the broom package’s tidy() command on your regression object, and calculate confidence intervals for your estimates by setting conf.int = T inside tidy().

What is the 95% confidence interval for \(\hat{\beta_1}\), and what does it mean?

Save these endpoints as an object (either by taking your tidy()-ed regression and filter()-ing the term == "carat" and then select()-ing the columns with the confidence interval and then saving this; or simply assigning the two values as a vector, c( , ), to an object).

Part E

Save your estimated \(\hat{\beta_1}\) as an object, we’ll need it later with infer (either by taking your tidy()-ed regression and filter()-ing the term == carat and then select()-ing the column with the estimate and then saving this; or simply assigning the values to an object).

Question 3

Now let’s use infer. Install it if you don’t have it, then load it.

Part A

Let’s generate a confidence interval for \(\hat{\beta_1}\) by simulating the sampling distribution of \(\hat{\beta_1}\). Run the following code, which will specify() the model relationship and generate() 1,000 repetitions using the bootstrap method of resampling data points randomly from our sample, with replacement.

# save our simulations as an object (I called it "boot")
boot <- diamonds %>% # or whatever you named your dataframe!
  specify(carat ~ price) %>% # our regression model
  generate(reps = 1000, # run 1000 simulations
           type = "bootstrap") %>% # using bootstrap method
  calculate(stat = "slope") # estimate slope in each simulation

# look at it
boot

Note this will take a few minutes, its doing a lot of calculations! What does it show you when you look at it?

Part B

Continue by piping your object from Part A into get_confidence_interval(). Set level = 0.95, type = "se" and point_estimate equal to our estimated \(\hat{\beta_1}\) (saved) from Question 2 Part E.

boot %>%
  get_confidence_interval(level = 0.95,
                          type = "se", 
                          point_estimate = beta_1_hat) # or whatever you saved it as

Part C

Now instead of get_confidence_interval(), pipe your object from Part A into visualize() to see the sampling distribution of \(\hat{\beta_1}\) we simulated. You can add + shade_ci(endpoints = ...) setting the argument equal to whatever you called your object containing the confidence interval from Question 2 Part D (I have it named here as CI_endpoints).

boot %>%
  visualize()+
  shade_ci(endpoints = CI_endpoints) # or whatever you saved it as

Compare your simulated confidence interval with the theoretically-constructed confidence interval from the output of summary, and/or of tidy() from Question 2.

Question 4

Now let’s test the following hypothesis:

\[\begin{align*} H_0: \beta_1 &= 0\\ H_1: \beta_1 &\neq 0\\ \end{align*}\]

Part A

What does the output of summary, and/or of tidy() from Question 2 tell you?

Part B

Let’s now do run this hypothesis test with infer, which will simulate the sampling distribution under the null hypothesis that \(\beta_1 = 0\). Run the following code, which will specify() the model relationship and hypothesize() the null hypothesis that there is no relationship between \(X\) and \(Y\) (i.e. \(\beta_1=0)\), and generate() 1,000 repetitions using the permute method, which will center the distribution at 0, and then calculate(stat = "slope").

# save our simulations as an object (I called it "test_sims")
test_sims <- diamonds %>% # or whatever you named your dataframe!
  specify(carat ~ price) %>% # our regression model
  hypothesize(null = "independence") %>% # H_0 is that slope is 0
  generate(reps = 1000, # run 1000 simulations
           type = "permute") %>% # using permute method, centering distr at 0
  calculate(stat = "slope") # estimate slope in each simulation

# look at it
test_sims

Note this may also take a few minutes. What does it show you?

Part C

Pipe your object from the previous part into the following code, which will get_p_value(). Inside this function, we are setting obs_stat equal to our \(\hat{\beta_1}\) we found (from Question 2 part E), and set direction = "both" to run a two-sided test, since our alternative hypothesis is two-sided, \(H_1: \beta_1 \neq 0\).

test_sims %>%
  get_p_value(obs_stat = beta_1_hat,
              direction = "both")

Note the warning message that comes up!

Part D

Instead of get_p_value(), pipe your object from Part B into the following code, which will visualize() the null distribution, and (in the second command), place our finding on it and shade_p_value().

test_sims %>%
  visualize()

test_sims %>%
  visualize()+
  shade_p_value(obs_stat = beta_1_hat, # or whatever you saved it as
                direction = "both") # for two-sided test

Part E

Compare your simulated p-value with the theoretically-constructed p-value from the output of summary, and/or of tidy() from Question 2.

Both summary and tidy() also report the \(t\)-statistic (t value or statistic) on this test for carat (by default, that \(H_0: \beta_1=0)\). What is the estimated test statistic for this model, and what does this number mean? Try to calculate it yourself with the formula:

\[t = \frac{\text{estimate} - \text{null hypothesis value}}{\text{standard error of estimate}}\]

The p-value is the probability of a \(t\) statistic at least as large as ours if the null hypothesis were true. R constructs a \(t\)-distribution with n-k-1 degrees of freedom (n is number of observations, k is number of \(X\)-variables) and calculates the probability in the tails of the distribution beyond this \(t\) value. You can calculate it yourself (for a two-sided test) with:

2 * pt(your_t, # put your t-statistic here
       df = your_df, # put the df number here
       lower.tail = F) # we'll use the right tail

Last updated on Oct 6, 2021