Statistics profession is obstinant that we cannot say anything about causality
But you have to! It's how the human brain works!
We can’t concieve of (spurious) correlation without some causation
Source: British Medical Journal
“Correlation does not imply causation”
“Correlation implies causation”
“Correlation does not imply causation”
“Correlation implies causation”
“Correlation plus exogeneity is causation.”
Correlation:
Causation:
We will seek to understand what causality is and how we can approach finding it
We will also explore the different common research designs meant to identify causal relationships
These skills, more than supply & demand, constrained optimization models, ISLM, etc, are the tools and comparative advantage of a modern research economist
Simultaneous “credibility revolution” in econometrics (c.1990s—2000s)
Use clever research designs to approximate natural experiments
Note: major disagreements between Pearl & Angrist/Imbens, etc.!
Example
If X is a light switch, and Y is a light:
Example
The sine qua non of causal claims are counterfactuals: what would Y have been if X had been different?
It is impossible to make a counterfactual claim from data alone!
Need a (theoretical) causal model of the data-generating process!
Again, RCTs are invoked as the gold standard for their ability to make counterfactual claims:
Treatment/intervention (X) is randomly assigned to individuals
If person i who recieved treatment had not recieved the treatment, we can predict what his outcome would have been
If person j who did not recieve treatment had recieved treatment, we can predict what her outcome would have been
RCTs are but the best-known method of a large, growing science of causal inference
We need a causal model to describe the data-generating process (DGP)
Requires us to make some assumptions
A visual model of the data-generating process, encodes our understanding of the causal relationships
Requires some common sense/economic intutition
Remember, all models are wrong, we just need them to be useful!
Suppose we have data on three variables
IP
: how much a firm spends on IP lawsuits tech
: whether a firm is in tech industryprofit
: firm profitsThey are all correlated with each other, but what's are the causal relationships?
We need our own causal model (from theory, intuition, etc) to sort
Consider all the variables likely to be important to the data-generating process (including variables we can't observe!)
For simplicity, combine some similar ones together or prune those that aren't very important
Consider which variables are likely to affect others, and draw arrows connecting them
Test some testable implications of the model (to see if we have a correct one!)
Drawing an arrow requires a direction - making a statement about causality!
Omitting an arrow makes an equally important statement too!
If two variables are correlated, but neither causes the other, likely they are both caused by another (perhaps unobserved) variable - add it!
There should be no cycles or loops (if so, there’s probably another missing variable, such as time)
Example: what is the effect of education on wages?
Education (X, “treatment” or “exposure”)
Wages (Y, “outcome” or “response”)
In social science and complex systems, 1000s of variables could plausibly be in DAG!
So simplify:
Background, Year of birth, Location, Compulsory schooling, all cause education
Background, year of birth, location, job connections probably cause wages
Background, Year of birth, Location, Compulsory schooling, all cause education
Background, year of birth, location, job connections probably cause wages
Job connections in fact is probably caused by education!
Location and background probably both caused by unobserved factor (u1
)
This is messy, but we have a causal model!
Makes our assumptions explicit, and many of them are testable
DAG suggests certain relationships that will not exist:
laws
and conx
go through educ
educ
, then cor(laws,conx)
should be zero!Dagitty.net is a great tool to make these and give you testable implications
Click Model -> New Model
Name your "exposure" variable (X of interest) and "outcome" variable (Y)
Click and drag to move nodes around
Add a new variable by double-clicking
Add an arrow by double-clicking one variable and then double-clicking on the target (do again to remove arrow)
Minimal sufficient adjustment sets containing background, location, year for estimating the total effect of educ on wage: background, location, year
Tells you some testable implications of your model
These are independencies or conditional independencies:
X⊥Y|Z
“X is independent of Y, given Z”
Tells you some testable implications of your model
Example: look at the last one listed:
job_connections
⊥ year
| educ
“Job connections are independent of year, controlling for education”
educ
, there should be no correlation between job_connections
and year
— can test this with data!background
, location
, and year
, we can identify the causal effect of educ
→ wage
.ggdag
Y~X+Z
means "Y
is caused by X
and Z
"# install.packages("ggdag")library(ggdag)dagify(wage~educ+conx+year+bckg+loc, educ~bckg+year+loc+laws, conx~educ, bckg~u1, loc~u1, exposure = "educ", # optional: define X outcome = "wage" # optional: define Y ) %>% ggdag()+ theme_dag()
dagitty.net
!dagitty()
from the dagitty
package, and paste the code in quoteslibrary(dagitty)dagitty('dag {bb="0,0,1,1"background [pos="0.413,0.335"]compulsory_schooling_laws [pos="0.544,0.076"]educ [exposure,pos="0.185,0.121"]job_connections [pos="0.302,0.510"]location [pos="0.571,0.431"]u1 [pos="0.539,0.206"]wage [outcome,pos="0.552,0.761"]year [pos="0.197,0.697"]background -> educbackground -> wagecompulsory_schooling_laws -> educeduc -> job_connectionseduc -> wagejob_connections -> wagelocation -> educlocation -> wageu1 -> backgroundu1 -> locationyear -> educyear -> wage}') %>% ggdag()+ theme_dag()
text = FALSE, use_labels = "name
inside ggdag()
, makes it easier to readdagitty('dag {bb="0,0,1,1"background [pos="0.413,0.335"]compulsory_schooling_laws [pos="0.544,0.076"]educ [exposure,pos="0.185,0.121"]job_connections [pos="0.302,0.510"]location [pos="0.571,0.431"]u1 [pos="0.539,0.206"]wage [outcome,pos="0.552,0.761"]year [pos="0.197,0.697"]background -> educbackground -> wagecompulsory_schooling_laws -> educeduc -> job_connectionseduc -> wagejob_connections -> wagelocation -> educlocation -> wageu1 -> backgroundu1 -> locationyear -> educyear -> wage}') %>% ggdag(., text = FALSE, use_labels = "name")+ theme_dag()
X
(exposure
) and Y
(outcome
), you can use ggdag_paths()
to have it show all possible paths between X and Y!dagify(wage~educ+conx+year+bckg+loc, educ~bckg+year+loc+laws, conx~educ, bckg~u1, loc~u1, exposure = "educ", outcome = "wage" ) %>% tidy_dagitty(seed = 2) %>% ggdag_paths()+ theme_dag()
X
(exposure
) and Y
(outcome
), you can use ggdag_adjustment_set()
to have it show you what you need to control for in order to identify X→Y!dagify(wage~educ+conx+year+bckg+loc, educ~bckg+year+loc+laws, conx~educ, bckg~u1, loc~u1, exposure = "educ", outcome = "wage" ) %>% ggdag_adjustment_set(shadow = T)+ theme_dag()
impliedConditionalIndependencies()
from the dagitty
package to have it show the testable implications from dagitty.netlibrary(dagitty)dagify(wage~educ+conx+year+bckg+loc, educ~bckg+year+loc+laws, conx~educ, bckg~u1, loc~u1, exposure = "educ", outcome = "wage" ) %>% impliedConditionalIndependencies()
## bckg _||_ conx | educ## bckg _||_ laws## bckg _||_ loc | u1## bckg _||_ year## conx _||_ laws | educ## conx _||_ loc | educ## conx _||_ u1 | bckg, loc## conx _||_ u1 | educ## conx _||_ year | educ## educ _||_ u1 | bckg, loc## laws _||_ loc## laws _||_ u1## laws _||_ wage | bckg, educ, loc, year## laws _||_ year## loc _||_ year## u1 _||_ wage | bckg, loc## u1 _||_ year
How does dagitty.net and ggdag
know how to identify effects, or what to control for, or what implications are testable?
Comes from fancy math called “do-calculus”
Typical notation:
X is independent variable of interest
Y is dependent or "response" variable
Other variables use other letters
You can of course use words instead of letters!
Arrows indicate causal effect (& direction)
Two types of causal effect:
Arrows indicate causal effect (& direction)
Two types of causal effect:
Direct effects: X→Y
Indirect effects: X→M→Y
Arrows indicate causal effect (& direction)
Two types of causal effect:
Direct effects: X→Y
Indirect effects: X→M→Y
You of course might have both!
Z is a “confounder” of X→Y, it causes both X and Y
cor(X,Y) is made up of two parts:
Failing to control for Z will bias our estimate of the causal effect of X→Y!
Yi=β0+β1Xi
By leaving out Zi, this regression is biased
ˆβ1 picks up both:
A causal “front-door” path: X→Y
A non-causal “back-door” path: X←Z→Y
† Regardless of the directions of the arrows!
Ideally, if we ran a randomized control trial and randomily assigned different values of X to different individuals, this would delete the arrow between Z and X
This would only leave the front-door, X→Y
But we can rarely run an ideal RCT
Instead of an RCT, if we can just “adjust for” or “control for” Z, we can block the back-door path X←Z→Y
This would only leave the front-door path open, X→Y
“As good as” an RCT!
Using our terminology from last class, we have an outcome (Y), and some treatment
But there are unobserved factors (u)
Yi=β0+β1Treatment+ui
Using our terminology from last class, we have an outcome (Y), and some treatment
But there are unobserved factors (u)
Yi=β0+β1Treatment+ui
cor(treatment,u)=0
Using our terminology from last class, we have an outcome (Y), and some treatment
But there are other unobserved factors (u)
Yi=β0+β1Treatment+ui
Controlling for a single variable along a long causal path is sufficient to block that path!
Causal path: X→Y
Backdoor path: X←A→B→C→Y
It is sufficient to block this backdoor by controlling either A or B or C!
Controlling for a single variable along a long causal path is sufficient to block that path!
Causal path: X→Y
Backdoor path: X←A→B→C→Y
It is sufficient to block this backdoor by controlling either A or B or C!
To identify the causal effect of X→Y:
“Back-door criterion”: control for the minimal amount of variables sufficient to ensure that no open back-door exists between X and Y
Example: in this DAG, control for Z
1) You only need to control for the variables that keep a back-door open, not all other variables!
Example:
1) You only need to control for the variables that keep a back-door open, not all other variables!
Example:
X←A→B→Y (back-door)
Need only control for A or B to block the back-door path
2) Exception: the case of a “collider”
Example:
2) Exception: the case of a “collider”
Example:
Example: Are you less likely to get the flu if you are hit by a bus?
Hos: being in the hospital
Both Flu and Bus send you to Hos (arrows)
Conditional on being in Hos, negative correlation between Flu and Bus (spurious!)
In the NBA, players’ height has no relationship to points scored
Naturally, taller people score more points in a basketball game, but if you only look at NBA players, that relationship goes away
A person being in the NBA is a collider! Colliders are another way to see selection bias
Example:
X←B→Y (back-door)
Should we control for M?
Example:
If we control for M, would block the front-door!
If we can estimate X→M and M→Y (note, no back-doors to either of these!), we can estimate X→Y
Tobacco industry claimed that cor(smoking,cancer) could be spurious due to a confounding gene
that affects both!
gene
is unobservableSuppose smoking causes tar
buildup in lungs, which cause cancer
We should not control for tar
, it's on the front-door path
Thus, to achieve causal identification, control for the minimal amount of variables such that:
Ensure no back-door path remains open
Ensure no front-door path is closed
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
o | Tile View: Overview of Slides |
Esc | Back to slideshow |
Statistics profession is obstinant that we cannot say anything about causality
But you have to! It's how the human brain works!
We can’t concieve of (spurious) correlation without some causation
| ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄|
— Ellie Murray (@EpiEllie) July 13, 2018
IF U DONT SMOKE,
U ALREADY
BELIEVE IN
CAUSAL INFERENCE
WITHOUT
RANDOMIZED TRIALS
|__________|
(__/) ||
(•ㅅ•) ||
/ づ#HistorianSignBunny #Epidemiology
| ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄|
— Ellie Murray (@EpiEllie) July 13, 2018
IF U DONT SMOKE,
U ALREADY
BELIEVE IN
CAUSAL INFERENCE
WITHOUT
RANDOMIZED TRIALS
|__________|
(__/) ||
(•ㅅ•) ||
/ づ#HistorianSignBunny #Epidemiology
Source: British Medical Journal
"Correlation implies casuation," the dean whispered as he handed me my PhD.
— David Robinson (@drob) June 22, 2017
"But then why-"
"Because if they knew, they wouldn't need us."
“Correlation does not imply causation”
“Correlation implies causation”
“Correlation does not imply causation”
“Correlation implies causation”
“Correlation plus exogeneity is causation.”
Correlation:
Causation:
We will seek to understand what causality is and how we can approach finding it
We will also explore the different common research designs meant to identify causal relationships
These skills, more than supply & demand, constrained optimization models, ISLM, etc, are the tools and comparative advantage of a modern research economist
BREAKING NEWS:
— The Nobel Prize (@NobelPrize) October 11, 2021
The 2021 Sveriges Riksbank Prize in Economic Sciences in Memory of Alfred Nobel has been awarded with one half to David Card and the other half jointly to Joshua D. Angrist and Guido W. Imbens.#NobelPrize pic.twitter.com/nkMjWai4Gn
Simultaneous “credibility revolution” in econometrics (c.1990s—2000s)
Use clever research designs to approximate natural experiments
Note: major disagreements between Pearl & Angrist/Imbens, etc.!
Causality isn't achieved; it's approached.
— John B. Holbein (@JohnHolbein1) April 7, 2018
Example
If X is a light switch, and Y is a light:
Example
The sine qua non of causal claims are counterfactuals: what would Y have been if X had been different?
It is impossible to make a counterfactual claim from data alone!
Need a (theoretical) causal model of the data-generating process!
Again, RCTs are invoked as the gold standard for their ability to make counterfactual claims:
Treatment/intervention (X) is randomly assigned to individuals
If person i who recieved treatment had not recieved the treatment, we can predict what his outcome would have been
If person j who did not recieve treatment had recieved treatment, we can predict what her outcome would have been
RCTs are but the best-known method of a large, growing science of causal inference
We need a causal model to describe the data-generating process (DGP)
Requires us to make some assumptions
A visual model of the data-generating process, encodes our understanding of the causal relationships
Requires some common sense/economic intutition
Remember, all models are wrong, we just need them to be useful!
Suppose we have data on three variables
IP
: how much a firm spends on IP lawsuits tech
: whether a firm is in tech industryprofit
: firm profitsThey are all correlated with each other, but what's are the causal relationships?
We need our own causal model (from theory, intuition, etc) to sort
Consider all the variables likely to be important to the data-generating process (including variables we can't observe!)
For simplicity, combine some similar ones together or prune those that aren't very important
Consider which variables are likely to affect others, and draw arrows connecting them
Test some testable implications of the model (to see if we have a correct one!)
Drawing an arrow requires a direction - making a statement about causality!
Omitting an arrow makes an equally important statement too!
If two variables are correlated, but neither causes the other, likely they are both caused by another (perhaps unobserved) variable - add it!
There should be no cycles or loops (if so, there’s probably another missing variable, such as time)
Example: what is the effect of education on wages?
Education (X, “treatment” or “exposure”)
Wages (Y, “outcome” or “response”)
In social science and complex systems, 1000s of variables could plausibly be in DAG!
So simplify:
Background, Year of birth, Location, Compulsory schooling, all cause education
Background, year of birth, location, job connections probably cause wages
Background, Year of birth, Location, Compulsory schooling, all cause education
Background, year of birth, location, job connections probably cause wages
Job connections in fact is probably caused by education!
Location and background probably both caused by unobserved factor (u1
)
This is messy, but we have a causal model!
Makes our assumptions explicit, and many of them are testable
DAG suggests certain relationships that will not exist:
laws
and conx
go through educ
educ
, then cor(laws,conx)
should be zero!Dagitty.net is a great tool to make these and give you testable implications
Click Model -> New Model
Name your "exposure" variable (X of interest) and "outcome" variable (Y)
Click and drag to move nodes around
Add a new variable by double-clicking
Add an arrow by double-clicking one variable and then double-clicking on the target (do again to remove arrow)
Minimal sufficient adjustment sets containing background, location, year for estimating the total effect of educ on wage: background, location, year
Tells you some testable implications of your model
These are independencies or conditional independencies:
X⊥Y|Z
“X is independent of Y, given Z”
Tells you some testable implications of your model
Example: look at the last one listed:
job_connections
⊥ year
| educ
“Job connections are independent of year, controlling for education”
educ
, there should be no correlation between job_connections
and year
— can test this with data!background
, location
, and year
, we can identify the causal effect of educ
→ wage
.ggdag
Y~X+Z
means "Y
is caused by X
and Z
"# install.packages("ggdag")library(ggdag)dagify(wage~educ+conx+year+bckg+loc, educ~bckg+year+loc+laws, conx~educ, bckg~u1, loc~u1, exposure = "educ", # optional: define X outcome = "wage" # optional: define Y ) %>% ggdag()+ theme_dag()
dagitty.net
!dagitty()
from the dagitty
package, and paste the code in quoteslibrary(dagitty)dagitty('dag {bb="0,0,1,1"background [pos="0.413,0.335"]compulsory_schooling_laws [pos="0.544,0.076"]educ [exposure,pos="0.185,0.121"]job_connections [pos="0.302,0.510"]location [pos="0.571,0.431"]u1 [pos="0.539,0.206"]wage [outcome,pos="0.552,0.761"]year [pos="0.197,0.697"]background -> educbackground -> wagecompulsory_schooling_laws -> educeduc -> job_connectionseduc -> wagejob_connections -> wagelocation -> educlocation -> wageu1 -> backgroundu1 -> locationyear -> educyear -> wage}') %>% ggdag()+ theme_dag()
text = FALSE, use_labels = "name
inside ggdag()
, makes it easier to readdagitty('dag {bb="0,0,1,1"background [pos="0.413,0.335"]compulsory_schooling_laws [pos="0.544,0.076"]educ [exposure,pos="0.185,0.121"]job_connections [pos="0.302,0.510"]location [pos="0.571,0.431"]u1 [pos="0.539,0.206"]wage [outcome,pos="0.552,0.761"]year [pos="0.197,0.697"]background -> educbackground -> wagecompulsory_schooling_laws -> educeduc -> job_connectionseduc -> wagejob_connections -> wagelocation -> educlocation -> wageu1 -> backgroundu1 -> locationyear -> educyear -> wage}') %>% ggdag(., text = FALSE, use_labels = "name")+ theme_dag()
X
(exposure
) and Y
(outcome
), you can use ggdag_paths()
to have it show all possible paths between X and Y!dagify(wage~educ+conx+year+bckg+loc, educ~bckg+year+loc+laws, conx~educ, bckg~u1, loc~u1, exposure = "educ", outcome = "wage" ) %>% tidy_dagitty(seed = 2) %>% ggdag_paths()+ theme_dag()
X
(exposure
) and Y
(outcome
), you can use ggdag_adjustment_set()
to have it show you what you need to control for in order to identify X→Y!dagify(wage~educ+conx+year+bckg+loc, educ~bckg+year+loc+laws, conx~educ, bckg~u1, loc~u1, exposure = "educ", outcome = "wage" ) %>% ggdag_adjustment_set(shadow = T)+ theme_dag()
impliedConditionalIndependencies()
from the dagitty
package to have it show the testable implications from dagitty.netlibrary(dagitty)dagify(wage~educ+conx+year+bckg+loc, educ~bckg+year+loc+laws, conx~educ, bckg~u1, loc~u1, exposure = "educ", outcome = "wage" ) %>% impliedConditionalIndependencies()
## bckg _||_ conx | educ## bckg _||_ laws## bckg _||_ loc | u1## bckg _||_ year## conx _||_ laws | educ## conx _||_ loc | educ## conx _||_ u1 | bckg, loc## conx _||_ u1 | educ## conx _||_ year | educ## educ _||_ u1 | bckg, loc## laws _||_ loc## laws _||_ u1## laws _||_ wage | bckg, educ, loc, year## laws _||_ year## loc _||_ year## u1 _||_ wage | bckg, loc## u1 _||_ year
How does dagitty.net and ggdag
know how to identify effects, or what to control for, or what implications are testable?
Comes from fancy math called “do-calculus”
Typical notation:
X is independent variable of interest
Y is dependent or "response" variable
Other variables use other letters
You can of course use words instead of letters!
Arrows indicate causal effect (& direction)
Two types of causal effect:
Arrows indicate causal effect (& direction)
Two types of causal effect:
Direct effects: X→Y
Indirect effects: X→M→Y
Arrows indicate causal effect (& direction)
Two types of causal effect:
Direct effects: X→Y
Indirect effects: X→M→Y
You of course might have both!
Z is a “confounder” of X→Y, it causes both X and Y
cor(X,Y) is made up of two parts:
Failing to control for Z will bias our estimate of the causal effect of X→Y!
Yi=β0+β1Xi
By leaving out Zi, this regression is biased
ˆβ1 picks up both:
A causal “front-door” path: X→Y
A non-causal “back-door” path: X←Z→Y
† Regardless of the directions of the arrows!
Ideally, if we ran a randomized control trial and randomily assigned different values of X to different individuals, this would delete the arrow between Z and X
This would only leave the front-door, X→Y
But we can rarely run an ideal RCT
Instead of an RCT, if we can just “adjust for” or “control for” Z, we can block the back-door path X←Z→Y
This would only leave the front-door path open, X→Y
“As good as” an RCT!
Using our terminology from last class, we have an outcome (Y), and some treatment
But there are unobserved factors (u)
Yi=β0+β1Treatment+ui
Using our terminology from last class, we have an outcome (Y), and some treatment
But there are unobserved factors (u)
Yi=β0+β1Treatment+ui
cor(treatment,u)=0
Using our terminology from last class, we have an outcome (Y), and some treatment
But there are other unobserved factors (u)
Yi=β0+β1Treatment+ui
Controlling for a single variable along a long causal path is sufficient to block that path!
Causal path: X→Y
Backdoor path: X←A→B→C→Y
It is sufficient to block this backdoor by controlling either A or B or C!
Controlling for a single variable along a long causal path is sufficient to block that path!
Causal path: X→Y
Backdoor path: X←A→B→C→Y
It is sufficient to block this backdoor by controlling either A or B or C!
To identify the causal effect of X→Y:
“Back-door criterion”: control for the minimal amount of variables sufficient to ensure that no open back-door exists between X and Y
Example: in this DAG, control for Z
1) You only need to control for the variables that keep a back-door open, not all other variables!
Example:
1) You only need to control for the variables that keep a back-door open, not all other variables!
Example:
X←A→B→Y (back-door)
Need only control for A or B to block the back-door path
2) Exception: the case of a “collider”
Example:
2) Exception: the case of a “collider”
Example:
Example: Are you less likely to get the flu if you are hit by a bus?
Hos: being in the hospital
Both Flu and Bus send you to Hos (arrows)
Conditional on being in Hos, negative correlation between Flu and Bus (spurious!)
In the NBA, players’ height has no relationship to points scored
Naturally, taller people score more points in a basketball game, but if you only look at NBA players, that relationship goes away
A person being in the NBA is a collider! Colliders are another way to see selection bias
Example:
X←B→Y (back-door)
Should we control for M?
Example:
If we control for M, would block the front-door!
If we can estimate X→M and M→Y (note, no back-doors to either of these!), we can estimate X→Y
Tobacco industry claimed that cor(smoking,cancer) could be spurious due to a confounding gene
that affects both!
gene
is unobservableSuppose smoking causes tar
buildup in lungs, which cause cancer
We should not control for tar
, it's on the front-door path
Thus, to achieve causal identification, control for the minimal amount of variables such that:
Ensure no back-door path remains open
Ensure no front-door path is closed