Set Up
To minimize confusion, I suggest creating a new R Project (e.g. infer_practice) and storing any data in that folder on your computer.
I have already made an R project you can download (as a .zip), unzip, and open the .Rproj file in R Studio, or there is an R project you can use on the cloud:
Question 1
Let’s use the diamonds data built into ggplot. Simply load tidyverse and then to be explicit, save this as a tibble (feel free to rename it) with diamonds <- diamonds.
Let’s answer the following questions:
What is the effect of carat size on a diamond’s price?
Part A
Just to see what we’re looking at, make a scatterplot using ggplot(), with x as carat and y as price, and add a regression line.
Question 2
Suppose we want to estimate the following relationship:
\[\widehat{\text{price}}_i = \beta_0 + \beta_1 \text{carat}_i + u_i\]
Run a regression of price on carat using lm() and get a summary() of it. Be sure to save your regression model as an object, we’ll need it later.
Part A
Write out the estimated regression equation.
Part B
Make a regression table of the output (using the huxtable package).
Part C
What is \(\hat{\beta_1}\) for this model? Interpret it in the context of our question.
Part D
Use the broom package’s tidy() command on your regression object, and calculate confidence intervals for your estimates by setting conf.int = T inside tidy().
What is the 95% confidence interval for \(\hat{\beta_1}\), and what does it mean?
Save these endpoints as an object (either by taking your tidy()-ed regression and filter()-ing the term == "carat" and then select()-ing the columns with the confidence interval and then saving this; or simply assigning the two values as a vector, c( , ), to an object).
Part E
Save your estimated \(\hat{\beta_1}\) as an object, we’ll need it later with infer (either by taking your tidy()-ed regression and filter()-ing the term == carat and then select()-ing the column with the estimate and then saving this; or simply assigning the values to an object).
Question 3
Now let’s use infer. Install it if you don’t have it, then load it.
Part A
Let’s generate a confidence interval for \(\hat{\beta_1}\) by simulating the sampling distribution of \(\hat{\beta_1}\). Run the following code, which will specify() the model relationship and generate() 1,000 repetitions using the bootstrap method of resampling data points randomly from our sample, with replacement.
# save our simulations as an object (I called it "boot")
boot <- diamonds %>% # or whatever you named your dataframe!
specify(carat ~ price) %>% # our regression model
generate(reps = 1000, # run 1000 simulations
type = "bootstrap") %>% # using bootstrap method
calculate(stat = "slope") # estimate slope in each simulation
# look at it
boot
Note this will take a few minutes, its doing a lot of calculations! What does it show you when you look at it?
Part B
Continue by piping your object from Part A into get_confidence_interval(). Set level = 0.95, type = "se" and point_estimate equal to our estimated \(\hat{\beta_1}\) (saved) from Question 2 Part E.
boot %>%
get_confidence_interval(level = 0.95,
type = "se",
point_estimate = beta_1_hat) # or whatever you saved it as
Part C
Now instead of get_confidence_interval(), pipe your object from Part A into visualize() to see the sampling distribution of \(\hat{\beta_1}\) we simulated. You can add + shade_ci(endpoints = ...) setting the argument equal to whatever you called your object containing the confidence interval from Question 2 Part D (I have it named here as CI_endpoints).
boot %>%
visualize()+
shade_ci(endpoints = CI_endpoints) # or whatever you saved it as
Compare your simulated confidence interval with the theoretically-constructed confidence interval from the output of summary, and/or of tidy() from Question 2.
Question 4
Now let’s test the following hypothesis:
\[\begin{align*} H_0: \beta_1 &= 0\\ H_1: \beta_1 &\neq 0\\ \end{align*}\]
Part A
What does the output of summary, and/or of tidy() from Question 2 tell you?
Part B
Let’s now do run this hypothesis test with infer, which will simulate the sampling distribution under the null hypothesis that \(\beta_1 = 0\). Run the following code, which will specify() the model relationship and hypothesize() the null hypothesis that there is no relationship between \(X\) and \(Y\) (i.e. \(\beta_1=0)\), and generate() 1,000 repetitions using the permute method, which will center the distribution at 0, and then calculate(stat = "slope").
# save our simulations as an object (I called it "test_sims")
test_sims <- diamonds %>% # or whatever you named your dataframe!
specify(carat ~ price) %>% # our regression model
hypothesize(null = "independence") %>% # H_0 is that slope is 0
generate(reps = 1000, # run 1000 simulations
type = "permute") %>% # using permute method, centering distr at 0
calculate(stat = "slope") # estimate slope in each simulation
# look at it
test_sims
Note this may also take a few minutes. What does it show you?
Part C
Pipe your object from the previous part into the following code, which will get_p_value(). Inside this function, we are setting obs_stat equal to our \(\hat{\beta_1}\) we found (from Question 2 part E), and set direction = "both" to run a two-sided test, since our alternative hypothesis is two-sided, \(H_1: \beta_1 \neq 0\).
test_sims %>%
get_p_value(obs_stat = beta_1_hat,
direction = "both")
Note the warning message that comes up!
Part D
Instead of get_p_value(), pipe your object from Part B into the following code, which will visualize() the null distribution, and (in the second command), place our finding on it and shade_p_value().
test_sims %>%
visualize()
test_sims %>%
visualize()+
shade_p_value(obs_stat = beta_1_hat, # or whatever you saved it as
direction = "both") # for two-sided test
Part E
Compare your simulated p-value with the theoretically-constructed p-value from the output of summary, and/or of tidy() from Question 2.
Both summary and tidy() also report the \(t\)-statistic (t value or statistic) on this test for carat (by default, that \(H_0: \beta_1=0)\). What is the estimated test statistic for this model, and what does this number mean? Try to calculate it yourself with the formula:
\[t = \frac{\text{estimate} - \text{null hypothesis value}}{\text{standard error of estimate}}\]
The p-value is the probability of a \(t\) statistic at least as large as ours if the null hypothesis were true. R constructs a \(t\)-distribution with n-k-1 degrees of freedom (n is number of observations, k is number of \(X\)-variables) and calculates the probability in the tails of the distribution beyond this \(t\) value. You can calculate it yourself (for a two-sided test) with:
2 * pt(your_t, # put your t-statistic here
df = your_df, # put the df number here
lower.tail = F) # we'll use the right tail