Examples:
Note, we can test a lot of hypotheses about a lot of population parameters, e.g.
We will focus on hypotheses about population regression slope (ˆβ1), i.e. the causal effect† of X on Y
† With a model this simple, it's almost certainly not causal, but this is the ultimate direction we are heading...
Null hypothesis assigns a value (or a range) to a population parameter
Alternative hypothesis must mathematically contradict the null hypothesis
Null hypothesis assigns a value (or a range) to a population parameter
Alternative hypothesis must mathematically contradict the null hypothesis
A null hypothesis, H0
An alternative hypothesis, Ha
A null hypothesis, H0
An alternative hypothesis, Ha
A test statistic to determine if we reject H0 when the statistic reaches a "critical value"
A null hypothesis, H0
An alternative hypothesis, Ha
A test statistic to determine if we reject H0 when the statistic reaches a "critical value"
A conclusion whether or not to reject H0 in favor of Ha
Sample statistic (^β1) will rarely be exactly equal to the hypothesized parameter (β1)
Difference between observed statistic and true parameter could be because:
Parameter is not the hypothesized value
Parameter is truly hypothesized value but sampling variability gave us a different estimate
We cannot distinguish between these two possibilities with any certainty
Type I error (false positive): rejecting H0 when it is in fact true
Type II error (false negative): failing to reject H0 when it is in fact false
William Blackstone
(1723-1780)
"It is better that ten guilty persons escape than that one innocent suffer."
Blackstone, William, 1765-1770, Commentaries on the Laws of England
α=P(Reject H0|H0 is true)
α=P(Reject H0|H0 is true)
α=P(Reject H0|H0 is true)
The confidence level is defined as (1−α)
The probability of a Type II error is defined as β:
β=P(Don't reject H0|H0 is false)
Power=1−β=P(Reject H0|H0 is false)
Power=1−β=P(Reject H0|H0 is false)
p(δ≥δi|H0 is true)
After running our test, we need to make a decision between the competing hypotheses
Compare p-value with pre-determined α (commonly, α=0.05, 95% confidence level)
If p<α: statistically significant evidence sufficient to reject H0 in favor of Ha
If p≥α: insufficient evidence to reject H0
Sir Ronald A. Fisher
(1890—1962)
"The null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation. Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis."
1931, The Design of Experiments
Modern philosophy of science is largely based off of hypothesis testing and falsifiability, which form the "Scientific Method"†
For something to be "scientific", it must be falsifiable, or at least testable
Hypotheses can be corroborated with evidence, but always tentative until falsified by data in suggesting an alternative hypothesis
"All swans are white" is a hypothesis rejected upon discovery of a single black swan
R
package called infer
Calculate a statistic, δi†, from a sample of data
Simulate a world where δ is null (H0)
Examine the distribution of δ across the null world
Calculate the probability that δi could exist in the null world
Decide if δi is statistically significant
† δ can stand in for any test-statistic in any hypothesis test! For our purposes, δ is the slope of our regression sample, ˆβ1.
lm()
:H0:β1=0H1:β1≠0
infer
allows you to run through these steps manually to understand the process:lm()
:H0:β1=0H1:β1≠0
infer
allows you to run through these steps manually to understand the process:specify()
a modellm()
:H0:β1=0H1:β1≠0
infer
allows you to run through these steps manually to understand the process:specify()
a model
hypothesize()
the null
lm()
:H0:β1=0H1:β1≠0
infer
allows you to run through these steps manually to understand the process:specify()
a model
hypothesize()
the null
generate()
simulations of the null world
lm()
:H0:β1=0H1:β1≠0
infer
allows you to run through these steps manually to understand the process:specify()
a model
hypothesize()
the null
generate()
simulations of the null world
calculate()
the p-value
lm()
:H0:β1=0H1:β1≠0
infer
allows you to run through these steps manually to understand the process:specify()
a model
hypothesize()
the null
generate()
simulations of the null world
calculate()
the p-value
visualize()
with a histogram (optional)
Test statistic (δ): measures how far what we observed in our sample (^β1) is from what we would expect if the null hypothesis were true (β1=0)
Rejection region: if the test statistic reaches a "critical value" of δ, then we reject the null hypothesis
† Again, see last class's appendix for more on the t-distribution. k is the number of independent variables our model has, in this case, with just one X, k=1. We use two degrees of freedom to calculate ^β0 and ^β1, hence we have n−2 df.
Our world, and a world where β1=0 by assumption.
ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | |
---|---|---|
(Intercept) | 698.932952 | |
str | -2.279808 |
ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | |
---|---|---|
(Intercept) | 698.932952 | |
str | -2.279808 |
ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | |
---|---|---|
(Intercept) | 647.8027952 | |
str | 0.3235038 |
# save as obs_slopesample_slope <- school_reg_tidy %>% # this is the regression tidied with broom's tidy() filter(term=="str") %>% pull(estimate)# confirm what it issample_slope
## [1] -2.279808
data %>% specify(y ~ x)
specify()
function, which is essentially a lm()
function for regression (for our purposes)CASchool %>% specify(testscr ~ str)
ABCDEFGHIJ0123456789 |
testscr <dbl> | str <dbl> | ||
---|---|---|---|
690.8 | 17.88991 | ||
661.2 | 21.52466 | ||
643.6 | 18.69723 |
%>% hypothesize(null = "independence")
infer
's language, we are hypothesizing that str
and testscr
are independent
(β1=0)†CASchool %>% specify(testscr ~ str) %>% hypothesize(null = "independence")
ABCDEFGHIJ0123456789 |
testscr <dbl> | str <dbl> | ||
---|---|---|---|
690.8 | 17.88991 | ||
661.2 | 21.52466 | ||
643.6 | 18.69723 |
infer
.
%>% generate(reps = n, type = "permute")
reps
and set the type
equal to "permute"
permutation
(not bootstrap
!) because we are simulating a world where β1=0 by construction!CASchool %>% specify(testscr ~ str) %>% hypothesize(null = "independence") %>% generate(reps = 1000, type = "permute")
%>% generate(reps = n, type = "permute")
%>% calculate(stat = "")
We calculate
sample statistics for each of the 1,000 replicate
samples
In our case, calculate the slope, (^β1) for each replicate
CASchool %>% specify(testscr ~ str) %>% hypothesize(null = "independence") %>% generate(reps = 1000, type = "permute") %>% calculate(stat = "slope")
stat
s for calculation: "mean"
, "median"
, "prop"
, "diff in means"
, "diff in props"
, etc. (see package information)%>% calculate(stat = "")
%>% get_p_value(obs stat = "", direction = "both")
We can calculate the p-value
sample_slope
(-2.28) in our simulated null distributionTwo-sided alternative Ha:β1≠0, we double the raw p-value
CASchool %>% specify(testscr ~ str) %>% hypothesize(null = "independence") %>% generate(reps = 1000, type = "permute") %>% calculate(stat = "slope") %>% get_p_value(obs_stat = sample_slope, direction = "both")
ABCDEFGHIJ0123456789 |
p_value <dbl> | ||||
---|---|---|---|---|
0 |
%>% visualize()
CASchool %>% specify(testscr ~ str) %>% hypothesize(null = "independence") %>% generate(reps = 1000, type = "permute") %>% calculate(stat = "slope") %>% visualize()
%>% visualize()
sample_slope
to show our finding on the null distr.CASchool %>% specify(testscr ~ str) %>% hypothesize(null = "independence") %>% generate(reps = 1000, type = "permute") %>% calculate(stat = "slope") %>% visualize(obs_stat = sample_slope)
%>% visualize()+shade_p_value()
shade_p_value()
to see what p isCASchool %>% specify(testscr ~ str) %>% hypothesize(null = "independence") %>% generate(reps = 1000, type = "permute") %>% calculate(stat = "slope") %>% visualize(obs_stat = sample_slope)+ shade_p_value(obs_stat = sample_slope, direction = "two_sided")
infer
's visualize()
function is just a wrapper function for ggplot()
simulations
tibble
and just ggplot
a normal histograminfer
's visualize()
function is just a wrapper function for ggplot()
simulations
tibble
and just ggplot
a normal histogramsimulations %>% ggplot(data = .)+ aes(x = stat)+ geom_histogram(color="white", fill="#e64173")+ geom_vline(xintercept = sample_slope, color = "blue", size = 2, linetype = "dashed")+ annotate(geom = "label", x = -2.28, y = 100, label = expression(paste("Our ", hat(beta[1]))), color = "blue")+ scale_y_continuous(lim=c(0,120), expand = c(0,0))+ labs(x = expression(paste("Sampling distribution of ", hat(beta)[1], " under ", H[0], ": ", beta[1]==0)), y = "Samples")+ theme_classic(base_family = "Fira Sans Condensed", base_size=20)
infer
's visualize()
function is just a wrapper function for ggplot()
simulations
tibble
and just ggplot
a normal histogramsimulations %>% ggplot(data = .)+ aes(x = stat)+ geom_histogram(color="white", fill="#e64173")+ geom_vline(xintercept = sample_slope, color = "blue", size = 2, linetype = "dashed")+ annotate(geom = "label", x = -2.28, y = 100, label = expression(paste("Our ", hat(beta[1]))), color = "blue")+ scale_y_continuous(lim=c(0,120), expand = c(0,0))+ labs(x = expression(paste("Sampling distribution of ", hat(beta)[1], " under ", H[0], ": ", beta[1]==0)), y = "Samples")+ theme_classic(base_family = "Fira Sans Condensed", base_size=20)
R does things the old-fashioned way, using a theoretical null distribution instead of simulating one
A t-distribution with n−k−1 df†
Calculate a t-statistic for ^β1:
test statistic=estimate−null hypothesisstandard error of estimate
† k is the number of X variables.
test statistic=estimate−null hypothesisstandard error of estimate
t same interpretation as Z: number of std. dev. away from the sampling distribution's expected value E[^β1]† (if H0 were true)
Compares to a critical value of t∗ (pre-determined by α-level & n−k−1 df)
† The expected value is 0, because our null hypothesis was β1=0
‡ Again, the 68-95-99.7% empirical rule!
t=^β1−β1,0se(^β1)t=−2.28−00.48t=−4.75
t=^β1−β1,0se(^β1)t=−2.28−00.48t=−4.75
p-value: prob. of a test statistic at least as large (in magnitude) as ours if the null hypothesis were true
2×p(t418>|−4.75|)=0.0000028
Ha:β1<0
p-value: p(t≤ti)
Ha:β1>0
p-value: p(t≥ti)
Ha:β1≠0
p-value: 2×p(t≥|ti|)
pt()
calculates p
robabilities on a t
distribution with arguments:df =
the degrees of freedomlower.tail =
TRUE
if looking at area to LEFT of valueFALSE
if looking at area to RIGHT of value2 * pt(4.75, # I'll double the right tail df = 418, lower.tail = F) # right tail
## [1] 2.800692e-06
summary(school_reg)
## ## Call:## lm(formula = testscr ~ str, data = CASchool)## ## Residuals:## Min 1Q Median 3Q Max ## -47.727 -14.251 0.483 12.822 48.540 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 698.9330 9.4675 73.825 < 2e-16 ***## str -2.2798 0.4798 -4.751 2.78e-06 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 18.58 on 418 degrees of freedom## Multiple R-squared: 0.05124, Adjusted R-squared: 0.04897 ## F-statistic: 22.58 on 1 and 418 DF, p-value: 2.783e-06
broom
's tidy()
(with confidence intervals)tidy(school_reg, conf.int=TRUE)
ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> | |
---|---|---|---|---|---|
(Intercept) | 698.932952 | 9.4674914 | 73.824514 | 6.569925e-242 | |
str | -2.279808 | 0.4798256 | -4.751327 | 2.783307e-06 |
str
is 0.00000278. H0:β1=0Ha:βa≠0
Because the hypothesis test's p-value < α (0.05)...
We have sufficient evidence to reject H0 in favor of our alternative hypothesis. Our sample suggests that there is a relationship between class size and test scores.
H0:β1=0Ha:βa≠0
Because the hypothesis test's p-value < α (0.05)...
We have sufficient evidence to reject H0 in favor of our alternative hypothesis. Our sample suggests that there is a relationship between class size and test scores.
Using the confidence intervals:
We are 95% confident that, from similarly constructed samples, the true marginal effect of class size on test scores is between -3.22 and -1.34.
Confidence intervals are all two-sided by nature CI0.95=([^β1−2×se(^β1)⏟MOE],[^β1+2×se(^β1)⏟MOE])
Hypothesis test (t-test) of H0:β1=0 computes a t-value of1 t=^β1se(^β1)
1 Since our null hypothesis is that β1,0=0, the test statistic simplifies to this neat fraction.
Confidence intervals are all two-sided by nature CI0.95=([^β1−2×se(^β1)⏟MOE],[^β1+2×se(^β1)⏟MOE])
Hypothesis test (t-test) of H0:β1=0 computes a t-value of1 t=^β1se(^β1)
1 Since our null hypothesis is that β1,0=0, the test statistic simplifies to this neat fraction.
Consider what 95% confident or α=0.05 means
If we repeat a procedure 20 times, we should expect 120 (5%) to produce a fluke result!
Image source: Seeing Theory
“The widespread use of 'statistical significance' (generally interpreted as (p≤0.05) as a license for making a claim of a scientific finding (or implied truth) leads to considerable distortion of the scientific process.”
Wasserstein, Ronald L. and Nicole A. Lazar, (2016), "The ASA's Statement on p-Values: Context, Process, and Purpose," The American Statistician 30(2): 129-133
“No economist has achieved scientific success as a result of a statistically significant coefficient. Massed observations, clever common sense, elegant theorems, new policies, sagacious economic reasoning, historical perspective, relevant accounting, these have all led to scientific success. Statistical significance has not,” (p.112).
McCloskey, Dierdre N and Stephen Ziliak, 1996, The Cult of Statistical Significance
❌ p is the probability that the alternative hypothesis is false
❌ p is the probability that the alternative hypothesis is false
❌ p is the probability that the null hypothesis is true
❌ p is the probability that the alternative hypothesis is false
❌ p is the probability that the null hypothesis is true
❌ p is the probability that our observed effects were produced purely by random chance
❌ p is the probability that the alternative hypothesis is false
❌ p is the probability that the null hypothesis is true
❌ p is the probability that our observed effects were produced purely by random chance
❌ p tells us how significant our finding is
Again, p-value is the probability that, if the null hypothesis were true, we obtain (by pure random chance) a test statistic at least as extreme as the one we estimated for our sample
A low p-value means either (and we can't distinguish which):
Test Score | |
---|---|
Intercept | 698.93 *** |
(9.47) | |
STR | -2.28 *** |
(0.48) | |
N | 420 |
R-Squared | 0.05 |
SER | 18.58 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
Statistical significance is shown by asterisks, common (but not always!) standard:
Rare, but sometimes regression tables include p-values for estimates
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
o | Tile View: Overview of Slides |
Esc | Back to slideshow |
Examples:
Note, we can test a lot of hypotheses about a lot of population parameters, e.g.
We will focus on hypotheses about population regression slope (ˆβ1), i.e. the causal effect† of X on Y
† With a model this simple, it's almost certainly not causal, but this is the ultimate direction we are heading...
Null hypothesis assigns a value (or a range) to a population parameter
Alternative hypothesis must mathematically contradict the null hypothesis
Null hypothesis assigns a value (or a range) to a population parameter
Alternative hypothesis must mathematically contradict the null hypothesis
A null hypothesis, H0
An alternative hypothesis, Ha
A null hypothesis, H0
An alternative hypothesis, Ha
A test statistic to determine if we reject H0 when the statistic reaches a "critical value"
A null hypothesis, H0
An alternative hypothesis, Ha
A test statistic to determine if we reject H0 when the statistic reaches a "critical value"
A conclusion whether or not to reject H0 in favor of Ha
Sample statistic (^β1) will rarely be exactly equal to the hypothesized parameter (β1)
Difference between observed statistic and true parameter could be because:
Parameter is not the hypothesized value
Parameter is truly hypothesized value but sampling variability gave us a different estimate
We cannot distinguish between these two possibilities with any certainty
Type I error (false positive): rejecting H0 when it is in fact true
Type II error (false negative): failing to reject H0 when it is in fact false
William Blackstone
(1723-1780)
"It is better that ten guilty persons escape than that one innocent suffer."
Blackstone, William, 1765-1770, Commentaries on the Laws of England
α=P(Reject H0|H0 is true)
α=P(Reject H0|H0 is true)
α=P(Reject H0|H0 is true)
The confidence level is defined as (1−α)
The probability of a Type II error is defined as β:
β=P(Don't reject H0|H0 is false)
Power=1−β=P(Reject H0|H0 is false)
Power=1−β=P(Reject H0|H0 is false)
p(δ≥δi|H0 is true)
After running our test, we need to make a decision between the competing hypotheses
Compare p-value with pre-determined α (commonly, α=0.05, 95% confidence level)
If p<α: statistically significant evidence sufficient to reject H0 in favor of Ha
If p≥α: insufficient evidence to reject H0
Sir Ronald A. Fisher
(1890—1962)
"The null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation. Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis."
1931, The Design of Experiments
Modern philosophy of science is largely based off of hypothesis testing and falsifiability, which form the "Scientific Method"†
For something to be "scientific", it must be falsifiable, or at least testable
Hypotheses can be corroborated with evidence, but always tentative until falsified by data in suggesting an alternative hypothesis
"All swans are white" is a hypothesis rejected upon discovery of a single black swan
R
package called infer
Calculate a statistic, δi†, from a sample of data
Simulate a world where δ is null (H0)
Examine the distribution of δ across the null world
Calculate the probability that δi could exist in the null world
Decide if δi is statistically significant
† δ can stand in for any test-statistic in any hypothesis test! For our purposes, δ is the slope of our regression sample, ˆβ1.
lm()
:H0:β1=0H1:β1≠0
infer
allows you to run through these steps manually to understand the process:lm()
:H0:β1=0H1:β1≠0
infer
allows you to run through these steps manually to understand the process:specify()
a modellm()
:H0:β1=0H1:β1≠0
infer
allows you to run through these steps manually to understand the process:specify()
a model
hypothesize()
the null
lm()
:H0:β1=0H1:β1≠0
infer
allows you to run through these steps manually to understand the process:specify()
a model
hypothesize()
the null
generate()
simulations of the null world
lm()
:H0:β1=0H1:β1≠0
infer
allows you to run through these steps manually to understand the process:specify()
a model
hypothesize()
the null
generate()
simulations of the null world
calculate()
the p-value
lm()
:H0:β1=0H1:β1≠0
infer
allows you to run through these steps manually to understand the process:specify()
a model
hypothesize()
the null
generate()
simulations of the null world
calculate()
the p-value
visualize()
with a histogram (optional)
Test statistic (δ): measures how far what we observed in our sample (^β1) is from what we would expect if the null hypothesis were true (β1=0)
Rejection region: if the test statistic reaches a "critical value" of δ, then we reject the null hypothesis
† Again, see last class's appendix for more on the t-distribution. k is the number of independent variables our model has, in this case, with just one X, k=1. We use two degrees of freedom to calculate ^β0 and ^β1, hence we have n−2 df.
Our world, and a world where β1=0 by assumption.
ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> |
---|---|---|---|---|
(Intercept) | 698.932952 | 9.4674914 | 73.824514 | 6.569925e-242 |
str | -2.279808 | 0.4798256 | -4.751327 | 2.783307e-06 |
ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> |
---|---|---|---|---|
(Intercept) | 698.932952 | 9.4674914 | 73.824514 | 6.569925e-242 |
str | -2.279808 | 0.4798256 | -4.751327 | 2.783307e-06 |
ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> |
---|---|---|---|---|
(Intercept) | 647.8027952 | 9.7147718 | 66.6822452 | 6.997699e-225 |
str | 0.3235038 | 0.4923581 | 0.6570499 | 5.115104e-01 |
# save as obs_slopesample_slope <- school_reg_tidy %>% # this is the regression tidied with broom's tidy() filter(term=="str") %>% pull(estimate)# confirm what it issample_slope
## [1] -2.279808
data %>% specify(y ~ x)
specify()
function, which is essentially a lm()
function for regression (for our purposes)CASchool %>% specify(testscr ~ str)
ABCDEFGHIJ0123456789 |
testscr <dbl> | str <dbl> |
---|---|
690.8 | 17.88991 |
661.2 | 21.52466 |
643.6 | 18.69723 |
%>% hypothesize(null = "independence")
infer
's language, we are hypothesizing that str
and testscr
are independent
(β1=0)†CASchool %>% specify(testscr ~ str) %>% hypothesize(null = "independence")
ABCDEFGHIJ0123456789 |
testscr <dbl> | str <dbl> |
---|---|
690.8 | 17.88991 |
661.2 | 21.52466 |
643.6 | 18.69723 |
infer
.
%>% generate(reps = n, type = "permute")
reps
and set the type
equal to "permute"
permutation
(not bootstrap
!) because we are simulating a world where β1=0 by construction!CASchool %>% specify(testscr ~ str) %>% hypothesize(null = "independence") %>% generate(reps = 1000, type = "permute")
%>% generate(reps = n, type = "permute")
%>% calculate(stat = "")
We calculate
sample statistics for each of the 1,000 replicate
samples
In our case, calculate the slope, (^β1) for each replicate
CASchool %>% specify(testscr ~ str) %>% hypothesize(null = "independence") %>% generate(reps = 1000, type = "permute") %>% calculate(stat = "slope")
stat
s for calculation: "mean"
, "median"
, "prop"
, "diff in means"
, "diff in props"
, etc. (see package information)%>% calculate(stat = "")
%>% get_p_value(obs stat = "", direction = "both")
We can calculate the p-value
sample_slope
(-2.28) in our simulated null distributionTwo-sided alternative Ha:β1≠0, we double the raw p-value
CASchool %>% specify(testscr ~ str) %>% hypothesize(null = "independence") %>% generate(reps = 1000, type = "permute") %>% calculate(stat = "slope") %>% get_p_value(obs_stat = sample_slope, direction = "both")
ABCDEFGHIJ0123456789 |
p_value <dbl> |
---|
0 |
%>% visualize()
CASchool %>% specify(testscr ~ str) %>% hypothesize(null = "independence") %>% generate(reps = 1000, type = "permute") %>% calculate(stat = "slope") %>% visualize()
%>% visualize()
sample_slope
to show our finding on the null distr.CASchool %>% specify(testscr ~ str) %>% hypothesize(null = "independence") %>% generate(reps = 1000, type = "permute") %>% calculate(stat = "slope") %>% visualize(obs_stat = sample_slope)
%>% visualize()+shade_p_value()
shade_p_value()
to see what p isCASchool %>% specify(testscr ~ str) %>% hypothesize(null = "independence") %>% generate(reps = 1000, type = "permute") %>% calculate(stat = "slope") %>% visualize(obs_stat = sample_slope)+ shade_p_value(obs_stat = sample_slope, direction = "two_sided")
infer
's visualize()
function is just a wrapper function for ggplot()
simulations
tibble
and just ggplot
a normal histograminfer
's visualize()
function is just a wrapper function for ggplot()
simulations
tibble
and just ggplot
a normal histogramsimulations %>% ggplot(data = .)+ aes(x = stat)+ geom_histogram(color="white", fill="#e64173")+ geom_vline(xintercept = sample_slope, color = "blue", size = 2, linetype = "dashed")+ annotate(geom = "label", x = -2.28, y = 100, label = expression(paste("Our ", hat(beta[1]))), color = "blue")+ scale_y_continuous(lim=c(0,120), expand = c(0,0))+ labs(x = expression(paste("Sampling distribution of ", hat(beta)[1], " under ", H[0], ": ", beta[1]==0)), y = "Samples")+ theme_classic(base_family = "Fira Sans Condensed", base_size=20)
infer
's visualize()
function is just a wrapper function for ggplot()
simulations
tibble
and just ggplot
a normal histogramsimulations %>% ggplot(data = .)+ aes(x = stat)+ geom_histogram(color="white", fill="#e64173")+ geom_vline(xintercept = sample_slope, color = "blue", size = 2, linetype = "dashed")+ annotate(geom = "label", x = -2.28, y = 100, label = expression(paste("Our ", hat(beta[1]))), color = "blue")+ scale_y_continuous(lim=c(0,120), expand = c(0,0))+ labs(x = expression(paste("Sampling distribution of ", hat(beta)[1], " under ", H[0], ": ", beta[1]==0)), y = "Samples")+ theme_classic(base_family = "Fira Sans Condensed", base_size=20)
R does things the old-fashioned way, using a theoretical null distribution instead of simulating one
A t-distribution with n−k−1 df†
Calculate a t-statistic for ^β1:
test statistic=estimate−null hypothesisstandard error of estimate
† k is the number of X variables.
test statistic=estimate−null hypothesisstandard error of estimate
t same interpretation as Z: number of std. dev. away from the sampling distribution's expected value E[^β1]† (if H0 were true)
Compares to a critical value of t∗ (pre-determined by α-level & n−k−1 df)
† The expected value is 0, because our null hypothesis was β1=0
‡ Again, the 68-95-99.7% empirical rule!
t=^β1−β1,0se(^β1)t=−2.28−00.48t=−4.75
t=^β1−β1,0se(^β1)t=−2.28−00.48t=−4.75
p-value: prob. of a test statistic at least as large (in magnitude) as ours if the null hypothesis were true
2×p(t418>|−4.75|)=0.0000028
Ha:β1<0
p-value: p(t≤ti)
Ha:β1>0
p-value: p(t≥ti)
Ha:β1≠0
p-value: 2×p(t≥|ti|)
pt()
calculates p
robabilities on a t
distribution with arguments:df =
the degrees of freedomlower.tail =
TRUE
if looking at area to LEFT of valueFALSE
if looking at area to RIGHT of value2 * pt(4.75, # I'll double the right tail df = 418, lower.tail = F) # right tail
## [1] 2.800692e-06
summary(school_reg)
## ## Call:## lm(formula = testscr ~ str, data = CASchool)## ## Residuals:## Min 1Q Median 3Q Max ## -47.727 -14.251 0.483 12.822 48.540 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 698.9330 9.4675 73.825 < 2e-16 ***## str -2.2798 0.4798 -4.751 2.78e-06 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 18.58 on 418 degrees of freedom## Multiple R-squared: 0.05124, Adjusted R-squared: 0.04897 ## F-statistic: 22.58 on 1 and 418 DF, p-value: 2.783e-06
broom
's tidy()
(with confidence intervals)tidy(school_reg, conf.int=TRUE)
ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> | |
---|---|---|---|---|---|
(Intercept) | 698.932952 | 9.4674914 | 73.824514 | 6.569925e-242 | |
str | -2.279808 | 0.4798256 | -4.751327 | 2.783307e-06 |
str
is 0.00000278. H0:β1=0Ha:βa≠0
Because the hypothesis test's p-value < α (0.05)...
We have sufficient evidence to reject H0 in favor of our alternative hypothesis. Our sample suggests that there is a relationship between class size and test scores.
H0:β1=0Ha:βa≠0
Because the hypothesis test's p-value < α (0.05)...
We have sufficient evidence to reject H0 in favor of our alternative hypothesis. Our sample suggests that there is a relationship between class size and test scores.
Using the confidence intervals:
We are 95% confident that, from similarly constructed samples, the true marginal effect of class size on test scores is between -3.22 and -1.34.
Confidence intervals are all two-sided by nature CI0.95=([^β1−2×se(^β1)⏟MOE],[^β1+2×se(^β1)⏟MOE])
Hypothesis test (t-test) of H0:β1=0 computes a t-value of1 t=^β1se(^β1) and p<0.05 when t≥2 (approximately)
1 Since our null hypothesis is that β1,0=0, the test statistic simplifies to this neat fraction.
Confidence intervals are all two-sided by nature CI0.95=([^β1−2×se(^β1)⏟MOE],[^β1+2×se(^β1)⏟MOE])
Hypothesis test (t-test) of H0:β1=0 computes a t-value of1 t=^β1se(^β1) and p<0.05 when t≥2 (approximately)
1 Since our null hypothesis is that β1,0=0, the test statistic simplifies to this neat fraction.
Consider what 95% confident or α=0.05 means
If we repeat a procedure 20 times, we should expect 120 (5%) to produce a fluke result!
Image source: Seeing Theory
“The widespread use of 'statistical significance' (generally interpreted as (p≤0.05) as a license for making a claim of a scientific finding (or implied truth) leads to considerable distortion of the scientific process.”
Wasserstein, Ronald L. and Nicole A. Lazar, (2016), "The ASA's Statement on p-Values: Context, Process, and Purpose," The American Statistician 30(2): 129-133
“No economist has achieved scientific success as a result of a statistically significant coefficient. Massed observations, clever common sense, elegant theorems, new policies, sagacious economic reasoning, historical perspective, relevant accounting, these have all led to scientific success. Statistical significance has not,” (p.112).
McCloskey, Dierdre N and Stephen Ziliak, 1996, The Cult of Statistical Significance
❌ p is the probability that the alternative hypothesis is false
❌ p is the probability that the alternative hypothesis is false
❌ p is the probability that the null hypothesis is true
❌ p is the probability that the alternative hypothesis is false
❌ p is the probability that the null hypothesis is true
❌ p is the probability that our observed effects were produced purely by random chance
❌ p is the probability that the alternative hypothesis is false
❌ p is the probability that the null hypothesis is true
❌ p is the probability that our observed effects were produced purely by random chance
❌ p tells us how significant our finding is
Again, p-value is the probability that, if the null hypothesis were true, we obtain (by pure random chance) a test statistic at least as extreme as the one we estimated for our sample
A low p-value means either (and we can't distinguish which):
Test Score | |
---|---|
Intercept | 698.93 *** |
(9.47) | |
STR | -2.28 *** |
(0.48) | |
N | 420 |
R-Squared | 0.05 |
SER | 18.58 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
Statistical significance is shown by asterisks, common (but not always!) standard:
Rare, but sometimes regression tables include p-values for estimates