2.7 — Inference for Regression - R Practice

Set Up

To minimize confusion, I suggest creating a new R Project (e.g. infer_practice) and storing any data in that folder on your computer.

I have already made an R project you can download (as a .zip), unzip, and open the .Rproj file in R Studio, or there is an R project you can use on the cloud:

R Project R Studio Cloud

Answers

Answers (html) Answers (R Project) Answers (R Script)

Question 1

Let’s use the diamonds data built into ggplot. Simply load tidyverse and then to be explicit, save this as a tibble (feel free to rename it) with diamonds <- diamonds.

Let’s answer the following questions:

What is the effect of carat size on a diamond’s price?

Part A

Just to see what we’re looking at, make a scatterplot using ggplot(), with x as carat and y as price, and add a regression line.

Question 2

Suppose we want to estimate the following relationship:

${\hat{price}}_{i} = β_{0} + β_{1} {carat}_{i} + u_{i}$

Run a regression of price on carat using lm() and get a summary() of it. Be sure to save your regression model as an object, we’ll need it later.

Part A

Write out the estimated regression equation.

Part B

Make a regression table of the output (using the huxtable package).

Part C

What is $\hat{β_{1}}$ for this model? Interpret it in the context of our question.

Part D

Use the broom package’s tidy() command on your regression object, and calculate confidence intervals for your estimates by setting conf.int = T inside tidy().

What is the 95% confidence interval for $\hat{β_{1}}$ , and what does it mean?

Save these endpoints as an object (either by taking your tidy()-ed regression and filter()-ing the term == "carat" and then select()-ing the columns with the confidence interval and then saving this; or simply assigning the two values as a vector, c( , ), to an object).

Part E

Save your estimated $\hat{β_{1}}$ as an object, we’ll need it later with infer (either by taking your tidy()-ed regression and filter()-ing the term == carat and then select()-ing the column with the estimate and then saving this; or simply assigning the values to an object).

Question 3

Now let’s use infer. Install it if you don’t have it, then load it.

Part A

Let’s generate a confidence interval for $\hat{β_{1}}$ by simulating the sampling distribution of $\hat{β_{1}}$ . Run the following code, which will specify() the model relationship and generate() 1,000 repetitions using the bootstrap method of resampling data points randomly from our sample, with replacement.

# save our simulations as an object (I called it "boot")
boot <- diamonds %>% # or whatever you named your dataframe!
  specify(carat ~ price) %>% # our regression model
  generate(reps = 1000, # run 1000 simulations
           type = "bootstrap") %>% # using bootstrap method
  calculate(stat = "slope") # estimate slope in each simulation

# look at it
boot

Note this will take a few minutes, its doing a lot of calculations! What does it show you when you look at it?

Part B

Continue by piping your object from Part A into get_confidence_interval(). Set level = 0.95, type = "se" and point_estimate equal to our estimated $\hat{β_{1}}$ (saved) from Question 2 Part E.

boot %>%
  get_confidence_interval(level = 0.95,
                          type = "se", 
                          point_estimate = beta_1_hat) # or whatever you saved it as

Part C

Now instead of get_confidence_interval(), pipe your object from Part A into visualize() to see the sampling distribution of $\hat{β_{1}}$ we simulated. You can add + shade_ci(endpoints = ...) setting the argument equal to whatever you called your object containing the confidence interval from Question 2 Part D (I have it named here as CI_endpoints).

boot %>%
  visualize()+
  shade_ci(endpoints = CI_endpoints) # or whatever you saved it as

Compare your simulated confidence interval with the theoretically-constructed confidence interval from the output of summary, and/or of tidy() from Question 2.

Question 4

Now let’s test the following hypothesis:

$\begin{aligned} H_{0} : β_{1} & = 0 \\ H_{1} : β_{1} & \neq 0 \end{aligned}$

Part A

What does the output of summary, and/or of tidy() from Question 2 tell you?

Part B

Let’s now do run this hypothesis test with infer, which will simulate the sampling distribution under the null hypothesis that $β_{1} = 0$ . Run the following code, which will specify() the model relationship and hypothesize() the null hypothesis that there is no relationship between $X$ and $Y$ (i.e. $β_{1} = 0)$ , and generate() 1,000 repetitions using the permute method, which will center the distribution at 0, and then calculate(stat = "slope").

# save our simulations as an object (I called it "test_sims")
test_sims <- diamonds %>% # or whatever you named your dataframe!
  specify(carat ~ price) %>% # our regression model
  hypothesize(null = "independence") %>% # H_0 is that slope is 0
  generate(reps = 1000, # run 1000 simulations
           type = "permute") %>% # using permute method, centering distr at 0
  calculate(stat = "slope") # estimate slope in each simulation

# look at it
test_sims

Note this may also take a few minutes. What does it show you?

Part C

Pipe your object from the previous part into the following code, which will get_p_value(). Inside this function, we are setting obs_stat equal to our $\hat{β_{1}}$ we found (from Question 2 part E), and set direction = "both" to run a two-sided test, since our alternative hypothesis is two-sided, $H_{1} : β_{1} \neq 0$ .

test_sims %>%
  get_p_value(obs_stat = beta_1_hat,
              direction = "both")

Note the warning message that comes up!

Part D

Instead of get_p_value(), pipe your object from Part B into the following code, which will visualize() the null distribution, and (in the second command), place our finding on it and shade_p_value().

test_sims %>%
  visualize()

test_sims %>%
  visualize()+
  shade_p_value(obs_stat = beta_1_hat, # or whatever you saved it as
                direction = "both") # for two-sided test

Part E

Compare your simulated p-value with the theoretically-constructed p-value from the output of summary, and/or of tidy() from Question 2.

Both summary and tidy() also report the $t$ -statistic (t value or statistic) on this test for carat (by default, that $H_{0} : β_{1} = 0)$ . What is the estimated test statistic for this model, and what does this number mean? Try to calculate it yourself with the formula:

$t = \frac{estimate - null hypothesis value}{standard error of estimate}$

The p-value is the probability of a $t$ statistic at least as large as ours if the null hypothesis were true. R constructs a $t$ -distribution with n-k-1 degrees of freedom (n is number of observations, k is number of $X$ -variables) and calculates the probability in the tails of the distribution beyond this $t$ value. You can calculate it yourself (for a two-sided test) with:

2 * pt(your_t, # put your t-statistic here
       df = your_df, # put the df number here
       lower.tail = F) # we'll use the right tail

Last updated on Oct 6, 2021