Set Up
To minimize confusion, I suggest creating a new R Project
(e.g. infer_practice
) and storing any data in that folder on your computer.
I have already made an R project you can download (as a .zip
), unzip, and open the .Rproj
file in R Studio, or there is an R project you can use on the cloud:
Question 1
Let’s use the diamonds
data built into ggplot
. Simply load tidyverse
and then to be explicit, save this as a tibble (feel free to rename it) with diamonds <- diamonds
.
Let’s answer the following questions:
What is the effect of carat size on a diamond’s price?
Part A
Just to see what we’re looking at, make a scatterplot using ggplot()
, with x
as carat
and y
as price
, and add a regression line.
Question 2
Suppose we want to estimate the following relationship:
\[\widehat{\text{price}}_i = \beta_0 + \beta_1 \text{carat}_i + u_i\]
Run a regression of price
on carat
using lm()
and get a summary()
of it. Be sure to save your regression model as an object, we’ll need it later.
Part A
Write out the estimated regression equation.
Part B
Make a regression table of the output (using the huxtable
package).
Part C
What is \(\hat{\beta_1}\) for this model? Interpret it in the context of our question.
Part D
Use the broom
package’s tidy()
command on your regression object, and calculate confidence intervals for your estimates by setting conf.int = T
inside tidy()
.
What is the 95% confidence interval for \(\hat{\beta_1}\), and what does it mean?
Save these endpoints as an object (either by taking your tidy()
-ed regression and filter()
-ing the term == "carat"
and then select()
-ing the columns with the confidence interval and then saving this; or simply assigning the two values as a vector, c( , )
, to an object).
Part E
Save your estimated \(\hat{\beta_1}\) as an object, we’ll need it later with infer
(either by taking your tidy()
-ed regression and filter()
-ing the term == carat
and then select()
-ing the column with the estimate
and then saving this; or simply assigning the values to an object).
Question 3
Now let’s use infer
. Install it if you don’t have it, then load it.
Part A
Let’s generate a confidence interval for \(\hat{\beta_1}\) by simulating the sampling distribution of \(\hat{\beta_1}\). Run the following code, which will specify()
the model relationship and generate()
1,000 repetitions using the bootstrap
method of resampling data points randomly from our sample, with replacement.
# save our simulations as an object (I called it "boot")
boot <- diamonds %>% # or whatever you named your dataframe!
specify(carat ~ price) %>% # our regression model
generate(reps = 1000, # run 1000 simulations
type = "bootstrap") %>% # using bootstrap method
calculate(stat = "slope") # estimate slope in each simulation
# look at it
boot
Note this will take a few minutes, its doing a lot of calculations! What does it show you when you look at it?
Part B
Continue by piping your object from Part A into get_confidence_interval()
. Set level = 0.95, type = "se"
and point_estimate
equal to our estimated \(\hat{\beta_1}\) (saved) from Question 2 Part E.
boot %>%
get_confidence_interval(level = 0.95,
type = "se",
point_estimate = beta_1_hat) # or whatever you saved it as
Part C
Now instead of get_confidence_interval()
, pipe your object from Part A into visualize()
to see the sampling distribution of \(\hat{\beta_1}\) we simulated. You can add + shade_ci(endpoints = ...)
setting the argument equal to whatever you called your object containing the confidence interval from Question 2 Part D (I have it named here as CI_endpoints
).
boot %>%
visualize()+
shade_ci(endpoints = CI_endpoints) # or whatever you saved it as
Compare your simulated confidence interval with the theoretically-constructed confidence interval from the output of summary
, and/or of tidy()
from Question 2.
Question 4
Now let’s test the following hypothesis:
\[\begin{align*} H_0: \beta_1 &= 0\\ H_1: \beta_1 &\neq 0\\ \end{align*}\]
Part A
What does the output of summary
, and/or of tidy()
from Question 2 tell you?
Part B
Let’s now do run this hypothesis test with infer
, which will simulate the sampling distribution under the null hypothesis that \(\beta_1 = 0\). Run the following code, which will specify()
the model relationship and hypothesize()
the null hypothesis that there is no relationship between \(X\) and \(Y\) (i.e. \(\beta_1=0)\), and generate()
1,000 repetitions using the permute
method, which will center the distribution at 0, and then calculate(stat = "slope")
.
# save our simulations as an object (I called it "test_sims")
test_sims <- diamonds %>% # or whatever you named your dataframe!
specify(carat ~ price) %>% # our regression model
hypothesize(null = "independence") %>% # H_0 is that slope is 0
generate(reps = 1000, # run 1000 simulations
type = "permute") %>% # using permute method, centering distr at 0
calculate(stat = "slope") # estimate slope in each simulation
# look at it
test_sims
Note this may also take a few minutes. What does it show you?
Part C
Pipe your object from the previous part into the following code, which will get_p_value()
. Inside this function, we are setting obs_stat
equal to our \(\hat{\beta_1}\) we found (from Question 2 part E), and set direction = "both"
to run a two-sided test, since our alternative hypothesis is two-sided, \(H_1: \beta_1 \neq 0\).
test_sims %>%
get_p_value(obs_stat = beta_1_hat,
direction = "both")
Note the warning message that comes up!
Part D
Instead of get_p_value()
, pipe your object from Part B into the following code, which will visualize()
the null distribution, and (in the second command), place our finding on it and shade_p_value()
.
test_sims %>%
visualize()
test_sims %>%
visualize()+
shade_p_value(obs_stat = beta_1_hat, # or whatever you saved it as
direction = "both") # for two-sided test
Part E
Compare your simulated p-value with the theoretically-constructed p-value from the output of summary
, and/or of tidy()
from Question 2.
Both summary
and tidy()
also report the \(t\)-statistic (t value
or statistic
) on this test for carat
(by default, that \(H_0: \beta_1=0)\). What is the estimated test statistic for this model, and what does this number mean? Try to calculate it yourself with the formula:
\[t = \frac{\text{estimate} - \text{null hypothesis value}}{\text{standard error of estimate}}\]
The p-value is the probability of a \(t\) statistic at least as large as ours if the null hypothesis were true. R
constructs a \(t\)-distribution with n-k-1
degrees of freedom (n
is number of observations, k
is number of \(X\)-variables) and calculates the probability in the tails of the distribution beyond this \(t\) value. You can calculate it yourself (for a two-sided test) with:
2 * pt(your_t, # put your t-statistic here
df = your_df, # put the df number here
lower.tail = F) # we'll use the right tail