class: center, middle, inverse, title-slide # 2.1 — Data 101 & Descriptive Statistics ## ECON 480 • Econometrics • Fall 2021 ### Ryan Safner
Assistant Professor of Economics
safner@hood.edu
ryansafner/metricsF21
metricsF21.classes.ryansafner.com
--- class: inverse # Outline ## [The Two Big Problems with Data](#3) ## [Data 101](#10) ## [Descriptive Statistics](#31) ## [Measures of Center](#39) ## [Measures of Dispersion](#59) --- class: inverse, center, middle # The Two Big Problems with Data --- # Two Big Problems with Data .pull-left[ - We want to use econometrics to .hi[identify] causal relationships and make .hi[inferences] about them 1. Problem for .hi[identification]: .hi-purple[endogeneity] 2. Problem for .hi[inference]: .hi-purple[randomness] ] .pull-right[ .center[ ![:scale 90%](../images/randomimage.jpg)] ] --- # Identification Problem: Endogeneity .pull-left[ - An independent variable `\((X)\)` is .hi-purple[exogenous] if its variation is .hi-turquoise[unrelated] to other factors that affect the dependent variable `\((Y)\)` - An independent variable `\((X)\)` is .hi-purple[endogenous] if its variation is .hi-turquoise[related] to other factors that affect the dependent variable `\((Y)\)` - Note: unfortunately this is different from how economists talk about endogenous vs. exogenous variables in theoretical models... ] .pull-right[ .center[ ![](../images/causality.jpg) ] ] --- # Identification Problem: Endogeneity .pull-left[ - An independent variable `\((X)\)` is .hi-purple[exogenous] if its variation is .hi-turquoise[unrelated] to other factors that affect the dependent variable `\((Y)\)` ] .pull-right[ <img src="2.1-slides_files/figure-html/unnamed-chunk-1-1.png" width="504" style="display: block; margin: auto;" /> ] --- # Identification Problem: Endogeneity .pull-left[ - An independent variable `\((X)\)` is .hi-purple[endogenous] if its variation is .hi-turquoise[related] to other factors that affect the dependent variable `\((Y)\)`, e.g. `\(Z\)` ] .pull-right[ <img src="2.1-slides_files/figure-html/unnamed-chunk-2-1.png" width="504" style="display: block; margin: auto;" /> ] --- # Inference Problem: Randomness .pull-left[ .smaller[ - Data is .hi-purple[random] due to .hi-purple[natural sampling variation] - Taking one sample of a population will yield slightly different information than another sample of the same population - Common in statistics, *easy to fix* - .hi[Inferential Statistics]: making claims about a wider population using sample data - We use common tools and techniques to deal with randomness ] ] .pull-right[ .center[ ![:scale 90%](https://www.dropbox.com/s/bsdtuddzjouwzr1/sampling.jpg?raw=1) ] ] --- # The Two Problems: Where We're Heading...Ultimately .center[ .b[Sample] `\(\color{#6A5ACD}{\xrightarrow{\text{statistical inference}}}\)` .b[Population] `\(\color{#e64173}{\xrightarrow{\text{causal indentification}}}\)` .b[Unobserved Parameters] ] - We want to .hi[identify] causal relationships between **population** variables - Logically first thing to consider - .hi-purple[Endogeneity problem] - We'll use **sample** *statistics* to .hi-purple[infer] something about population *parameters* - In practice, we'll only ever have a finite *sample distribution* of data - We *don't* know the *population distribution* of data - .hi-purple[Randomness problem] --- class: inverse, center, middle # Data 101 --- # Data 101 .pull-left[ - .hi[Data] are information with context - .hi[Individuals] are the entities described by a set of data - e.g. persons, households, firms, countries ] .pull-right[ ![](../images/individual1.jpg) ] --- # Data 101 .pull-left[ .smallest[ - .hi[Variables] are particular characteristics about an individual - e.g. age, income, profits, population, GDP, marital status, type of legal institutions - .hi[Observations] or .hi[cases] are the separate individuals described by a collection of variables - e.g. for one individual, we have their age, sex, income, education, etc. - individuals and observations are *not necessarily* the same: - e.g. we can have multiple observations on the same individual over time ] ] .pull-right[ ![](../images/individual1.jpg) ] --- # Categorical Data .pull-left[ - .hi[Categorical data] place an individual into one of several possible *categories* - e.g. sex, season, political party - may be responses to survey questions - can be quantitative (e.g. age, zip code) - In `R`: `character` or `factor` type data - `factor` `\(\implies\)` specific possible categories ] .pull-right[ ![](../images/categoricaldata.png) ] --- # Categorical Data: Visualizing I .pull-left[ ```r diamonds %>% count(cut) %>% mutate(frequency = n / sum(n), percent = round(frequency * 100, 2)) ``` <table> <caption>Summary of diamonds by cut</caption> <thead> <tr> <th style="text-align:left;"> cut </th> <th style="text-align:right;"> n </th> <th style="text-align:right;"> frequency </th> <th style="text-align:right;"> percent </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Fair </td> <td style="text-align:right;"> 1610 </td> <td style="text-align:right;"> 0.0298480 </td> <td style="text-align:right;"> 2.98 </td> </tr> <tr> <td style="text-align:left;"> Good </td> <td style="text-align:right;"> 4906 </td> <td style="text-align:right;"> 0.0909529 </td> <td style="text-align:right;"> 9.10 </td> </tr> <tr> <td style="text-align:left;"> Very Good </td> <td style="text-align:right;"> 12082 </td> <td style="text-align:right;"> 0.2239896 </td> <td style="text-align:right;"> 22.40 </td> </tr> <tr> <td style="text-align:left;"> Premium </td> <td style="text-align:right;"> 13791 </td> <td style="text-align:right;"> 0.2556730 </td> <td style="text-align:right;"> 25.57 </td> </tr> <tr> <td style="text-align:left;"> Ideal </td> <td style="text-align:right;"> 21551 </td> <td style="text-align:right;"> 0.3995365 </td> <td style="text-align:right;"> 39.95 </td> </tr> </tbody> </table> ] .pull-right[ - Good way to represent categorical data is with a .hi[frequency table] - .hi-purple[Count (n)]: total number of individuals in a category - .hi-purple[Frequency]: **proportion** of a category's ocurrence relative to all data - Multiply proportions by 100% to get **percentages** ] --- # Categorical Data: Visualizing II .pull-left[ - .hi-purple[Charts and graphs are *always* better ways to visualize data] - A .hi[bar graph] represents categories as bars, with lengths proportional to the count or relative frequency of each category ```r ggplot(diamonds, aes(x=cut, fill=cut))+ geom_bar()+ guides(fill=F)+ theme_pander(base_family = "Fira Sans Condensed", base_size=20) ``` ] .pull-right[ <img src="2.1-slides_files/figure-html/unnamed-chunk-6-1.png" width="504" style="display: block; margin: auto;" /> ] --- # Categorical Data: Visualizing III .pull-left[ - Avoid pie charts! - People are *not* good at judging 2-d differences (angles, area) - People *are* good at judging 1-d differences (length) ] -- .pull-right[ .center[ ![](../images/piechart.jpg) ] ] --- # Categorical Data: Visualizing IV .pull-left[ - Maybe a *stacked bar chart* ```r diamonds %>% count(cut) %>% ggplot(data = .)+ aes(x = "", y = n)+ geom_col(aes(fill = cut))+ geom_label(aes(label = cut, color = cut), position = position_stack(vjust = 0.5) )+ guides(color = F, fill = F)+ theme_void() ``` ] .pull-right[ <img src="2.1-slides_files/figure-html/unnamed-chunk-8-1.png" width="504" style="display: block; margin: auto;" /> ] --- # Categorical Data: Visualizing IV .pull-left[ - Maybe *lollipop chart* ```r diamonds %>% count(cut) %>% mutate(cut_name = as.factor(cut)) %>% ggplot(., aes(x = cut_name, y = n, color = cut))+ geom_point(stat="identity", fill="black", size=12) + geom_segment(aes(x = cut_name, y = 0, xend = cut_name, yend = n), size = 2)+ geom_text(aes(label = n),color="white", size=3) + coord_flip()+ labs(x = "Cut")+ theme_pander(base_family = "Fira Sans Condensed", base_size=20)+ guides(color = F) ``` ] .pull-right[ <img src="2.1-slides_files/figure-html/unnamed-chunk-10-1.png" width="504" style="display: block; margin: auto;" /> ] --- # Categorical Data: Visualizing IV .pull-left[ - Maybe a *treemap* ```r library(treemapify) diamonds %>% count(cut) %>% ggplot(., aes(area = n, fill = cut)) + geom_treemap() + guides(fill = FALSE) + geom_treemap_text(aes(label = cut), colour = "white", place = "topleft", grow = TRUE) ``` ] .pull-right[ <img src="2.1-slides_files/figure-html/unnamed-chunk-12-1.png" width="504" style="display: block; margin: auto;" /> ] --- # Quantitative Data I .pull-left[ .smallest[ - .hi[Quantitative variables] take on numerical values of equal units that describe an individual - Units: points, dollars, inches - Context: GPA, prices, height - We can mathematically manipulate *only* quantitative data - e.g. sum, average, standard deviation - In `R`: `numeric` type data - `integer` if whole number - `double` if has decimals ] ] .pull-right[ ![:scale 75%](../images/mathoperations.jpg) ] --- # Discrete Data .pull-left[ - .hi[Discrete data] are finite, with a countable number of alternatives - .hi-purple[Categorical]: place data into categories - e.g. letter grades: A, B, C, D, F - e.g. class level: freshman, sophomore, junior, senior - .hi-purple[Quantitative]: integers - e.g. SAT Score, number of children, age (years) ] .pull-right[ ![](../images/buildingblocks.jpeg) ] --- # Continuous Data .pull-left[ - .hi[Continuous data] are infinitely divisible, with an uncountable number of alternatives - e.g. weight, length, temperature, GPA - Many discrete variables may be treated as if they are continuous - e.g. SAT scores (whole points), wages (dollars and cents) ] .pull-right[ .center[ ![:scale 90%](../images/continuous.png) ] ] --- # Spreadsheets .pull-left[ <table class="table table-striped table-hover" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> ID </th> <th style="text-align:left;"> Name </th> <th style="text-align:right;"> Age </th> <th style="text-align:left;"> Sex </th> <th style="text-align:right;"> Income </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> John </td> <td style="text-align:right;"> 23 </td> <td style="text-align:left;"> Male </td> <td style="text-align:right;"> 41000 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:left;"> Emile </td> <td style="text-align:right;"> 18 </td> <td style="text-align:left;"> Male </td> <td style="text-align:right;"> 52600 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> Natalya </td> <td style="text-align:right;"> 28 </td> <td style="text-align:left;"> Female </td> <td style="text-align:right;"> 48000 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> Lakisha </td> <td style="text-align:right;"> 31 </td> <td style="text-align:left;"> Female </td> <td style="text-align:right;"> 60200 </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:left;"> Cheng </td> <td style="text-align:right;"> 36 </td> <td style="text-align:left;"> Male </td> <td style="text-align:right;"> 81900 </td> </tr> </tbody> </table> ] .pull-right[ - The most common data structure we use is a .hi[spreadsheet] - In *R*: a `data.frame` or `tibble` - A .hi-purple[row] contains data about all variables for a single .hi-purple[individual] - A .hi-purple[column] contains data about a single .hi-purple[variable] across all individuals ] --- # Spreadsheets .pull-left[ <table class="table table-striped table-hover" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> ID </th> <th style="text-align:left;"> Name </th> <th style="text-align:right;"> Age </th> <th style="text-align:left;"> Sex </th> <th style="text-align:right;"> Income </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> John </td> <td style="text-align:right;"> 23 </td> <td style="text-align:left;"> Male </td> <td style="text-align:right;"> 41000 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:left;"> Emile </td> <td style="text-align:right;"> 18 </td> <td style="text-align:left;"> Male </td> <td style="text-align:right;"> 52600 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> Natalya </td> <td style="text-align:right;"> 28 </td> <td style="text-align:left;"> Female </td> <td style="text-align:right;"> 48000 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> Lakisha </td> <td style="text-align:right;"> 31 </td> <td style="text-align:left;"> Female </td> <td style="text-align:right;"> 60200 </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:left;"> Cheng </td> <td style="text-align:right;"> 36 </td> <td style="text-align:left;"> Male </td> <td style="text-align:right;"> 81900 </td> </tr> </tbody> </table> ] .pull-right[ - Each .hi-purple[cell] can be referenced by its row and column (in that order!), `df[row,column]` ```r example[3,2] # value in row 3, column 2 ``` ``` ## # A tibble: 1 × 1 ## Name ## <chr> ## 1 Natalya ``` - Recall [how to “subset” data frames](https://metricsf21.classes.ryansafner.com/slides/1.2-slides#67) from 1.2; though it’s now much easier with `filter()` and `select()`! ] --- # Spreadsheets II - It is common to use some notation like the following: - Let `\(\{x_1, x_2, \cdots, x_n\}\)` be a simple data series on variable `\(X\)` - `\(n\)` individual observations - `\(x_i\)` is the value of the `\(i\)`<sup>th</sup> observation for `\(i=1,2,\cdots, n\)` -- .content-box-blue[ .blue[**Quick Check**]: Let `\(x\)` represent the score on a homework assignment: `$$75, 100, 92, 87, 79, 0, 95$$` 1. What is `\(n\)`? 2. What is `\(x_1\)`? 3. What is `\(x_6\)`? ] --- # Datasets: Cross-Sectional .pull-left[ <table> <thead> <tr> <th style="text-align:right;"> ID </th> <th style="text-align:left;"> Name </th> <th style="text-align:right;"> Age </th> <th style="text-align:left;"> Sex </th> <th style="text-align:right;"> Income </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> John </td> <td style="text-align:right;"> 23 </td> <td style="text-align:left;"> Male </td> <td style="text-align:right;"> 41000 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:left;"> Emile </td> <td style="text-align:right;"> 18 </td> <td style="text-align:left;"> Male </td> <td style="text-align:right;"> 52600 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> Natalya </td> <td style="text-align:right;"> 28 </td> <td style="text-align:left;"> Female </td> <td style="text-align:right;"> 48000 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> Lakisha </td> <td style="text-align:right;"> 31 </td> <td style="text-align:left;"> Female </td> <td style="text-align:right;"> 60200 </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:left;"> Cheng </td> <td style="text-align:right;"> 36 </td> <td style="text-align:left;"> Male </td> <td style="text-align:right;"> 81900 </td> </tr> </tbody> </table> ] .pull-right[ - .hi[Cross-sectional data]: observations of individuals at a given point in time - Each observation is a unique individual `$$x_i$$` - Simplest and most common data - A .hi-purple["snapshot"] to compare differences across individuals ] --- # Datasets: Time-Series .pull-left[ <table> <thead> <tr> <th style="text-align:right;"> Year </th> <th style="text-align:right;"> GDP </th> <th style="text-align:right;"> Unemployment </th> <th style="text-align:right;"> CPI </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1950 </td> <td style="text-align:right;"> 8.2 </td> <td style="text-align:right;"> 0.06 </td> <td style="text-align:right;"> 100 </td> </tr> <tr> <td style="text-align:right;"> 1960 </td> <td style="text-align:right;"> 9.9 </td> <td style="text-align:right;"> 0.04 </td> <td style="text-align:right;"> 118 </td> </tr> <tr> <td style="text-align:right;"> 1970 </td> <td style="text-align:right;"> 10.2 </td> <td style="text-align:right;"> 0.08 </td> <td style="text-align:right;"> 130 </td> </tr> <tr> <td style="text-align:right;"> 1980 </td> <td style="text-align:right;"> 12.4 </td> <td style="text-align:right;"> 0.08 </td> <td style="text-align:right;"> 190 </td> </tr> <tr> <td style="text-align:right;"> 1985 </td> <td style="text-align:right;"> 13.6 </td> <td style="text-align:right;"> 0.06 </td> <td style="text-align:right;"> 196 </td> </tr> </tbody> </table> ] .pull-right[ - .hi[Time-series data]: observations of the *same* individual(s) over time - Each observation is a time period `$$x_{t}$$` - Often used for macroeconomics, finance, and forecasting - Unique challenges for time series - A .hi-purple["moving picture"] to see how individuals change over time ] --- # Datasets: Panel .pull-left[ <table> <thead> <tr> <th style="text-align:left;"> City </th> <th style="text-align:right;"> Year </th> <th style="text-align:right;"> Murders </th> <th style="text-align:right;"> Population </th> <th style="text-align:right;"> UR </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Philadelphia </td> <td style="text-align:right;"> 1986 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 3.700 </td> <td style="text-align:right;"> 8.7 </td> </tr> <tr> <td style="text-align:left;"> Philadelphia </td> <td style="text-align:right;"> 1990 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 4.200 </td> <td style="text-align:right;"> 7.2 </td> </tr> <tr> <td style="text-align:left;"> D.C. </td> <td style="text-align:right;"> 1986 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 0.250 </td> <td style="text-align:right;"> 5.4 </td> </tr> <tr> <td style="text-align:left;"> D.C. </td> <td style="text-align:right;"> 1990 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 0.275 </td> <td style="text-align:right;"> 5.5 </td> </tr> <tr> <td style="text-align:left;"> New York </td> <td style="text-align:right;"> 1986 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 6.400 </td> <td style="text-align:right;"> 9.6 </td> </tr> </tbody> </table> ] .pull-right[ .smaller[ - .hi[Panel], or .hi[longitudinal] dataset: a time-series for *each* cross-sectional entity - Must be *same* individuals over time - Each obs. is an individual in a time period `$$x_{it}$$` - More common today for serious researchers; unique challenges and benefits - A .hi-purple[combination] of "snapshot" comparisons over time ] ] --- class: inverse, center, middle # Descriptive Statistics --- # Variables and Distributions - Variables take on different values, we can describe a variable's .hi[distribution] (of these values) - We want to *visualize* and *analyze* distributions to search for meaningful patterns using **statistics** --- # Two Branches of Statistics .pull-left[ - Two main branches of statistics: 1. .hi[Descriptive Statistics:] describes or summarizes the properties of a sample 2. .hi[Inferential Statistics:] infers properties about a larger population from the properties of a sample<sup>.magenta[†]</sup> ] .pull-right[ .center[ ![](../images/statsgraphs.jpg) ] ] .footnote[<sup>.magenta[†]</sup> We'll encounter inferential statistics mainly in the context of regression later.] --- # Histograms .pull-left[ - A common way to present a *quantitative* variable's distribution is a .hi[histogram] - The quantitative analog to the bar graph for a categorical variable - Divide up values into **bins** of a certain size, and count the number of values falling within each bin, representing them visually as bars ] .pull-right[ <img src="2.1-slides_files/figure-html/unnamed-chunk-15-1.png" width="504" style="display: block; margin: auto;" /> ] --- # Histogram: Example .pull-left[ .smallest[ .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt5[ .green[**Example**]: a class of 13 students takes a quiz (out of 100 points) with the following results: `$$\{ 0, 62, 66, 71, 71, 74, 76, 79, 83, 86, 88, 93, 95 \}$$` ] ] ] --- # Histogram: Example .pull-left[ .smallest[ .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt5[ .green[**Example**]: a class of 13 students takes a quiz (out of 100 points) with the following results: `$$\{ 0, 62, 66, 71, 71, 74, 76, 79, 83, 86, 88, 93, 95 \}$$` ] ] .code50[ ```r quizzes<-tibble(scores = c(0,62,66,71,71,74,76,79,83,86,88,93,95)) ``` ] ] --- # Histogram: Example .pull-left[ .smallest[ .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt5[ .green[**Example**]: a class of 13 students takes a quiz (out of 100 points) with the following results: `$$\{ 0, 62, 66, 71, 71, 74, 76, 79, 83, 86, 88, 93, 95 \}$$` ] ] .code50[ ```r h<-ggplot(quizzes,aes(x=scores))+ geom_histogram(breaks = seq(0,100,10), color = "white", fill = "#56B4E9")+ scale_x_continuous(breaks = seq(0,100,10))+ scale_y_continuous(limits = c(0,6), expand = c(0,0))+ labs(x = "Scores", y = "Number of Students")+ ggthemes::theme_pander(base_family = "Fira Sans Condensed", base_size=20) h ``` ] ] .pull-right[ .center[ <img src="2.1-slides_files/figure-html/unnamed-chunk-16-1.png" width="504" style="display: block; margin: auto;" /> ] ] --- # Descriptive Statistics .pull-left[ - We are often interested in the *shape* or *pattern* of a distribution, particularly: - Measures of **center** - Measures of **dispersion** - **Shape** of distribution ] .pull-right[ .center[ ![](../images/statsgraphs.jpg) ] ] --- class: inverse, center, middle # Measures of Center --- # Mode - The .hi[mode] of a variable is simply its most frequent value - A variable can have multiple modes -- .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt5[ .green[**Example**]: a class of 13 students takes a quiz (out of 100 points) with the following results: `$$\{ 0, 62, 66, \mathbf{71}, \mathbf{71}, 74, 76, 79, 83, 86, 88, 93, 95 \}$$` ] --- # Mode .pull-left[ - There is no dedicated `mode()` function in `R`, surprisingly - A workaround in `dplyr`: ```r quizzes %>% count(scores) %>% arrange(desc(n)) ``` ] .pull-right[ ``` ## # A tibble: 12 × 2 ## scores n ## <dbl> <int> ## 1 71 2 ## 2 0 1 ## 3 62 1 ## 4 66 1 ## 5 74 1 ## 6 76 1 ## 7 79 1 ## 8 83 1 ## 9 86 1 ## 10 88 1 ## 11 93 1 ## 12 95 1 ``` ] --- # Multi-Modal Distributions .pull-left[ - Looking at a histogram, the modes are the "peaks" of the distribution - Note: depends on how wide you make the bins! - May be unimodal, bimodal, trimodal, etc ] .pull-right[ <img src="2.1-slides_files/figure-html/unnamed-chunk-18-1.png" width="504" style="display: block; margin: auto;" /> ] --- # Symmetry and Skew I .pull-left[ - A distribution is **symmetric** if it looks roughly the same on either side of the "center" - The thinner ends (far left and far right) are called the **tails** of a distribution ] .pull-right[ <img src="2.1-slides_files/figure-html/unnamed-chunk-19-1.png" width="504" style="display: block; margin: auto;" /> ] --- # Symmetry and Skew I .pull-left[ - If one tail stretches farther than the other, distribution is **skewed** in the direction of the longer tail ] .pull-right[ <img src="2.1-slides_files/figure-html/unnamed-chunk-20-1.png" width="504" style="display: block; margin: auto;" /> ] --- # Outliers .pull-left[ - .hi[Outlier]: extreme value that does not appear part of the general pattern of a distribution - Can strongly affect descriptive statistics - Might be the most informative part of the data - Could be the result of errors - Should always be explored and discussed! ] .pull-right[ <img src="2.1-slides_files/figure-html/unnamed-chunk-21-1.png" width="504" style="display: block; margin: auto;" /> ] --- # Arithmetic Mean (Population) - The natural measure of the center of a *population*'s distribution is its .hi["average"] or .hi[arithmetic mean `\\((\mu)\\)`] `$$\mu=\frac{x_1+x_2+...+x_N}{N} = \frac{1}{N} \sum^N_{i=1} x_i$$` - For `\(N\)` values of variable `\(x\)`, "mu" is the sum of all individual `\(x\)` values `\((x_i)\)` from 1 to `\(N\)`, divided by the `\(N\)` number of values<sup>.magenta[†]</sup> - See [today's class notes](/content/2.1-content) for more about the .hi-purple[summation operator, `\\(\displaystyle\Sigma\\)`], it'll come up again! .footnote[<sup>.magenta[†]</sup> Note the mean need not be an actual value of the data!] --- # Arithmetic Mean (Sample) .smaller[ - When we have a *sample*, we compute the .hi[sample mean `\\((\bar{x})\\)`] `$$\bar{x}=\frac{x_1+x_2+...+x_n}{n} = \frac{1}{n} \sum^n_{i=1} x_i$$` - For `\(n\)` values of variable `\(x\)`, "x-bar" is the sum of all individual `\(x\)` values `\((x_i)\)` divided by the `\(n\)` number of values ] -- .pull-left[ .tiny[ .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt5[ .green[**Example**]: `$$\{0, 62, 66, 71, 71, 74, 76, 79, 83, 86, 88, 93, 95\}$$` `$$\begin{align*} \bar{x}&=\frac{1}{13}(0+62+66+71+71+74+76+79+83+86+88+93+95)\\ \bar{x}&=\frac{944}{13}\\ \bar{x}&=72.62\\ \end{align*}$$` ] ] ] -- .pull-right[ .code50[ ```r quizzes %>% summarize(mean=mean(scores)) ``` ``` ## # A tibble: 1 × 1 ## mean ## <dbl> ## 1 72.6 ``` ] ] --- # Arithmetic Mean: Affected by Outliers - If we drop the outlier (0) -- .pull-left[ .tiny[ .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt5[ .green[**Example**]: `$$\{62, 66, 71, 71, 74, 76, 79, 83, 86, 88, 93, 95\}$$` `$$\begin{align*} \bar{x}&=\frac{1}{12}(62+66+71+71+74+76+79+83+86+88+93+95)\\ &=\frac{944}{12}\\ &=78.67\\ \end{align*}$$` ] ] ] -- .pull-right[ .code50[ ```r quizzes %>% filter(scores>0) %>% summarize(mean=mean(scores)) ``` ``` ## # A tibble: 1 × 1 ## mean ## <dbl> ## 1 78.7 ``` ] ] --- # Median `$$\{0, 62, 66, 71, 71, 74, \mathbf{76}, 79, 83, 86, 88, 93, 95\}$$` - The .hi[median] is the midpoint of the distribution - 50% to the left of the median, 50% to the right of the median - Arrange values in numerical order - For odd `\(n\)`: median is middle observation - For even `\(n\)`: median is average of two middle observations --- # Mean, Median, and Outliers .center[ ![](../images/meanoutliers.jpg) ] --- # Mean, Median, Symmetry, Skew I .pull-left[ - Symmetric distribution: mean `\(\approx\)` median ```r symmetric %>% summarize(mean = mean(x), median = median(x)) ``` ``` ## # A tibble: 1 × 2 ## mean median ## <dbl> <dbl> ## 1 4 4 ``` ] .pull-right[ <img src="2.1-slides_files/figure-html/unnamed-chunk-26-1.png" width="504" style="display: block; margin: auto;" /> ] --- # Mean, Median, Symmetry, Skew II .pull-left[ - Left-skewed: mean `\(<\)` median ```r leftskew %>% summarize(mean = mean(x), median = median(x)) ``` ``` ## mean median ## 1 4.615385 5 ``` ] .pull-right[ <img src="2.1-slides_files/figure-html/unnamed-chunk-29-1.png" width="504" style="display: block; margin: auto;" /> ] --- # Mean, Median, Symmetry, Skew III .pull-left[ - Right-skewed: mean `\(>\)` median ```r rightskew %>% summarize(mean = mean(x), median = median(x)) ``` ``` ## # A tibble: 1 × 2 ## mean median ## <dbl> <dbl> ## 1 3.38 3 ``` ] .pull-right[ <img src="2.1-slides_files/figure-html/unnamed-chunk-32-1.png" width="504" style="display: block; margin: auto;" /> ] --- class: inverse, center, middle # Measures of Dispersion --- # Measures of Dispersion: Range - The more *variation* in the data, the less helpful a measure of central tendency will tell us - Beyond just the center, we also want to measure the spread - Simplest metric is .hi[range] `\(=max-min\)` --- # Measures of Dispersion: 5 Number Summary I - Common set of summary statistics of a distribution: .hi["five number summary"]: .pull-left[ .smallest[ 1. Minimum value 2. 25<sup>th</sup> percentile `\((Q_1\)`, median of first 50% of data) 3. 50<sup>th</sup> percentile (median, `\(Q_2)\)` 4. 25<sup>th</sup> percentile `\((Q_3\)`, median of last 50% of data) 5. Maximum value ] ] -- .pull-right[ .code60[ ```r # Base R summary command (includes Mean) summary(quizzes$scores) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.00 71.00 76.00 72.62 86.00 95.00 ``` ```r quizzes %>% # dplyr summarize(Min = min(scores), Q1 = quantile(scores, 0.25), Median = median(scores), Q3 = quantile(scores, 0.75), Max = max(scores)) ``` ``` ## # A tibble: 1 × 5 ## Min Q1 Median Q3 Max ## <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0 71 76 86 95 ``` ] ] --- # Measures of Dispersion: 5 Number Summary II - The `\(n\)`<sup>th</sup> .hi-purple[percentile] of a distribution is the value that places `\(n\)` percent of values beneath it ```r quizzes %>% summarize("37th percentile" = quantile(scores,0.37)) ``` ``` ## # A tibble: 1 × 1 ## `37th percentile` ## <dbl> ## 1 72.3 ``` --- # Boxplots I .pull-left[ .smallest[ - .hi[Boxplots] are a great way to visualize the 5 number summary - **Height of box**: `\(Q_1\)` to `\(Q_3\)` (known as .hi-purple[interquartile range (IQR)], middle 50% of data) - **Line inside box**: median (50<sup>th</sup> percentile) - **"Whiskers"** identify data within `\(1.5 \times IQR\)` - Points *beyond* whiskers are .hi-purple[outliers] - common definition: `\(Outlier >1.5 \times IQR\)` ] ] .pull-right[ <img src="2.1-slides_files/figure-html/unnamed-chunk-35-1.png" width="504" style="display: block; margin: auto;" /> ] --- # Comparisons I - Boxplots (and five number summaries) are great for comparing two distributions .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt5[ .green[**Example**]: `$$\begin{align*} \text{Quiz 1}&: \{0, 62, 66, 71, 71, 74, 76, 79, 83, 86, 88, 93, 95\} \\ \text{Quiz 2}&: \{50, 62, 72, 73, 79, 81, 82, 82, 86, 90, 94, 98, 99\} \\ \end{align*}$$` ] --- # Comparisons II .pull-left[ ```r quizzes_new %>% summary() ``` ``` ## student quiz_1 quiz_2 ## Min. : 1 Min. : 0.00 Min. :50.00 ## 1st Qu.: 4 1st Qu.:71.00 1st Qu.:73.00 ## Median : 7 Median :76.00 Median :82.00 ## Mean : 7 Mean :72.62 Mean :80.62 ## 3rd Qu.:10 3rd Qu.:86.00 3rd Qu.:90.00 ## Max. :13 Max. :95.00 Max. :99.00 ``` ] .pull-right[ <img src="2.1-slides_files/figure-html/unnamed-chunk-38-1.png" width="504" style="display: block; margin: auto;" /> ] --- # Aside: Making Nice Summary Tables I .smallest[ - I don't like the options available for printing out summary statistics - So I wrote my own `R function` called `summary_table()` that makes nice summary tables (it uses `dplyr` and `tidyr`!). To use: 1. Download the `summaries.R` [file](/files/summaries.R) from the website<sup>.magenta[†]</sup> and move it to your working directory/project folder 2. Load the function with the `source()` command:<sup>.magenta[‡]</sup> ] ```r source("summaries.R") ``` .footnote[<sup>.magenta[†]</sup> One day I'll make this part of a package I'll write. <sup>.magenta[‡]</sup> If it *was* a package, then you'd load with `library()`. But you can run a single `.R` script with `source()`.] --- # Aside: Making Nice Summary Tables II 3) The function has at least 2 arguments: the `data.frame` (automatically piped in if you use the pipe!) and then all variables you want to summarize, separated by commas<sup>.magenta[†]</sup> ```r mpg %>% summary_table(hwy, cty, cyl) ``` ``` ## # A tibble: 3 × 9 ## Variable Obs Min Q1 Median Q3 Max Mean `Std. Dev.` ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 cty 234 9 14 17 19 35 16.9 4.26 ## 2 cyl 234 4 4 6 8 8 5.89 1.61 ## 3 hwy 234 12 18 24 27 44 23.4 5.95 ``` .footnote[<sup>.magenta[†]</sup> There is one restriction: No variable name can have an underscore `(_)` in it. You will have to rename them or else you will break the function!] --- # Aside: Making Nice Summary Tables II 4) When `knit`ted in `R markdown`, it looks nicer: ```r mpg %>% summary_table(hwy, cty, cyl) %>% knitr::kable(., format="html") ``` <table> <thead> <tr> <th style="text-align:left;"> Variable </th> <th style="text-align:right;"> Obs </th> <th style="text-align:right;"> Min </th> <th style="text-align:right;"> Q1 </th> <th style="text-align:right;"> Median </th> <th style="text-align:right;"> Q3 </th> <th style="text-align:right;"> Max </th> <th style="text-align:right;"> Mean </th> <th style="text-align:right;"> Std. Dev. </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> cty </td> <td style="text-align:right;"> 234 </td> <td style="text-align:right;"> 9 </td> <td style="text-align:right;"> 14 </td> <td style="text-align:right;"> 17 </td> <td style="text-align:right;"> 19 </td> <td style="text-align:right;"> 35 </td> <td style="text-align:right;"> 16.86 </td> <td style="text-align:right;"> 4.26 </td> </tr> <tr> <td style="text-align:left;"> cyl </td> <td style="text-align:right;"> 234 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 5.89 </td> <td style="text-align:right;"> 1.61 </td> </tr> <tr> <td style="text-align:left;"> hwy </td> <td style="text-align:right;"> 234 </td> <td style="text-align:right;"> 12 </td> <td style="text-align:right;"> 18 </td> <td style="text-align:right;"> 24 </td> <td style="text-align:right;"> 27 </td> <td style="text-align:right;"> 44 </td> <td style="text-align:right;"> 23.44 </td> <td style="text-align:right;"> 5.95 </td> </tr> </tbody> </table> - We'll talk more about using `markdown` and making final products nicer when we discuss your paper project (have you forgotten?) --- # Measures of Dispersion: Deviations - Every observation `\(i\)` .hi-purple[deviates] from the mean of the data: `$$deviation_i = x_i-\mu $$` - There are as many deviations as there are data points `\((n)\)` - We can measure the *average* or .hi[standard deviation] of a variable from its mean - Before we get there... --- # Variance (Population) - The .hi[population variance `\\((\sigma^2)\\)`] of a *population* distribution measures the average of the *squared* deviations from the *population* mean `\((\mu)\)` `$$\sigma^2 = \frac{1}{N}\displaystyle\sum^N_{i=1} (x_i-\mu)^2$$` - Why do we square deviations? - What are these units? --- # Standard Deviation (Population) - Square root the variance to get the .hi[population standard deviation `\\((\sigma)\\)`], the average deviation from the population mean (in same units as `\(x\)`) `$$\sigma=\sqrt{\sigma^2}=\sqrt{\frac{1}{N}\displaystyle\sum^N_{i=1} (x_i-\mu)^2 }$$` --- # Variance (Sample) - The .hi[sample variance `\\((s^2)\\)`] of a *sample* distribution measures the average of the *squared* deviations from the *sample* mean `\((\bar{x})\)` `$$\sigma^2 = \frac{1}{n-1}\displaystyle\sum^n_{i=1} (x_i-\bar{x})^2$$` - Why do we divide by `\(n-1\)`? --- # Standard Deviation (Sample) - Square root the sample variance to get the .hi[sample standard deviation `\\((s)\\)`], the average deviation from the *sample* mean (in same units as `\(x\)`) `$$s=\sqrt{s^2}=\sqrt{\frac{1}{n-1}\displaystyle\sum^n_{i=1} (x_i-\bar{x})^2 }$$` --- # Sample Standard Deviation: Example .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt5[ .green[**Example**]: Calculate the sample standard deviation for the following series: `$$\{2, 4, 6, 8, 10 \}$$` ] -- ```r sd(c(2,4,6,8,10)) ``` ``` ## [1] 3.162278 ``` --- # The Steps to Calculate sd(), Coded I ```r # first let's save our data in a tibble sd_example<-tibble(x=c(2,4,6,8,10)) # first find the mean (just so we know) sd_example %>% summarize(mean(x)) ``` ``` ## # A tibble: 1 × 1 ## `mean(x)` ## <dbl> ## 1 6 ``` ```r # now let's make some more columns: sd_example <- sd_example %>% mutate(deviations = x-mean(x), # take deviations from mean deviations_sq = deviations^2) # square them ``` --- # The Steps to Calculate sd(), Coded II .pull-left[ ```r sd_example # see what we made ``` ] .pull-right[ ``` ## # A tibble: 5 × 3 ## x deviations deviations_sq ## <dbl> <dbl> <dbl> ## 1 2 -4 16 ## 2 4 -2 4 ## 3 6 0 0 ## 4 8 2 4 ## 5 10 4 16 ``` ] --- # The Steps to Calculate sd(), Coded III .pull-left[ ```r sd_example %>% # sum the squared deviations summarize(sum_sq_devs = sum(deviations_sq), # divide by n-1 to get variance variance = sum_sq_devs/(n()-1), # square root to get sd std_dev = sqrt(variance)) ``` ] .pull-right[ ``` ## # A tibble: 1 × 3 ## sum_sq_devs variance std_dev ## <dbl> <dbl> <dbl> ## 1 40 10 3.16 ``` ] --- # Sample Standard Deviation: You Try .content-box-blue[ .blue[**You Try**]: Calculate the sample standard deviation for the following series: `$$\{1, 3, 5, 7 \}$$` ] -- ```r sd(c(1,3,5,7)) ``` ``` ## [1] 2.581989 ``` --- # Descriptive Statistics: Populations vs. Samples .pull-left[ ## Population parameters - **Population size**: `\(N\)` - **Mean**: `\(\mu\)` - **Variance**: `\(\sigma^2=\frac{1}{N} \displaystyle\sum^N_{i=1} (x_i-\mu)^2\)` - **Standard deviation**: `\(\sigma = \sqrt{\sigma^2}\)` ] .pull-right[ ## Sample statistics - **Population size**: `\(n\)` - **Mean**: `\(\bar{x}\)` - **Variance**: `\(s^2=\frac{1}{n-1} \displaystyle\sum^n_{i=1} (x_i-\bar{x})^2\)` - **Standard deviation**: `\(s = \sqrt{s^2}\)` ]