2.1 — Data 101 & Descriptive Statistics — Class Content

Thursday, September 9, 2021

Problem Set 1 is due by class Tuesday September 14 via Blackboard Assignments.

Overview

Today we begin with a review and overview of using data and descriptive statistics. We want to quantify characteristics about samples as statistics, which we will later use to infer things about populations (between which we will later identify causal relationships).

Next class will be on random variables and distributions. This full week is your crash course/review of basic statistics that we will need to start the “meat and potatoes” of this class: linear regression next Thursday As such, I’ll give you a brief homework next week to review these statistical concepts (with minimal use of R!).

Readings

  • Math and Probability Background Appendix A in Bailey

Now that we return to the statistics, we will do a minimal overview of basic statistics and distributions. Review Bailey’s appendices. Today, only A is really useful, the rest will come next class.

Chapter 2 is optional, but will give you a good overview of using data. ## Slides

Below, you can find the slides in two formats. Clicking the image will bring you to the html version of the slides in a new tab. Note while in going through the slides, you can type h to see a special list of viewing options, and type o for an outline view of all the slides.

The lower button will allow you to download a PDF version of the slides. I suggest printing the slides beforehand and using them to take additional notes in class (not everything is in the slides)!

2.1-slides

Download as PDF

R Practice

Answers from last class’ practice problems on base R are posted on that page. Today’s “practice problems” get you to practice the tools we are working with today. They are again not required, but will help you if you are interested.

Assignments

Problem Set 1

Problem Set 1 is posted and is due by class Tuesday September 14. Please see the instructions for more information on how to submit your assignment (there are multiple ways!).

Problem set 2 (on classes 2.1-2.2) will be posted shortly, and will be due by Tuesday September 21.

Math Appendix

The Summation Operator

Many elementary propositions in econometrics (and statistics) involve the use of the sums of numbers. Mathematicians often use the summation operator (the greek letter \(\Sigma\) –“sigma”) as a shorthand, rather than writing everything out the long way. It will be worth your time to understand the summation operator, and some of its properties, and how these can provide shortcuts to proving more advanced theorems in econometrics.

Let \(X\) be a random variable from which a sample of \(n\) observations is observed, so we have a sequence \(\{x_1, x_2,...,x_n\}\) i.e. $x_i, $ for \(i=1,2,...,n\). Then the total sum of the observations \((x_1+x_2+...+x_n)\) can be represented as:

\[\sum_{i=1}^n x_i = x_1+x_2+...+x_n\]

  • The term beneath \(\Sigma\) is known as the “index,” which tells us where to begin our adding (at the 1st individual \(x\) term, \(x_1\))
    • Note other letters, such as \(j\), or \(k\) may be used (especially if \(i\) is defined elsewhere)
  • The term above \(\Sigma\) is the total number of \(x\) terms we should add \((n)\)
  • Essentially, read \(\displaystyle \sum_{i=1}^{n} x_i\) as “add up all the individual \(x\) observations from the 1st \((x_1)\) to the final \(n\) \((x_n)\).”

Useful Properties of Summation Operators

Rule 1: The summation of a constant \(k\) times a random variable \(X_i\) is equal to the constant times the summation of that random variable:

\[\sum_{i=1}^n kX_i = k \sum^n_{i=1} X_i\]

Proof:

\[\begin{align*} \sum_{i=1}^n kX_i &= k x_1 + kx_2 +...+ kx_n\\ &= k(x_1+x_2+...x_n)\\ &= k\sum_{i=1}^n X_i. \\ \end{align*}\]

Rule 2: The summation of a sum of two random variables is equal to the sum of their summations:

\[\sum_{i=1}^n (X_i+Y_i) = \sum_{i=1}^n X_i + \sum_{i=1}^n Y_i\]

Proof:

\[\begin{align*} \sum_{i=1}^n (X_i+Y_i) &=(X_1+Y_1) + (X_2+Y_2) + ... (X_n+Y_n)\\ &=(X_1+X_2+...+X_n) + (Y_1+Y_2+...+Y_n)\\ &=\sum_{i=1}^n X_i + \sum_{i=1}^n Y_i\\ \end{align*}\]

Rule 3: The summation of constant over \(n\) observations is the product of the constant and \(n\):

\[\sum_{i=1}^n k = nk\]

Proof:

\[\sum_{i=1}^n k = \underbrace{k + k + ... + k}_{n \text{ times}} = nk\]

Combining these 3 rules: for the sum of a linear combination of a random variable (\(a+bX\)):

\[\sum_{i=1}^n (a+bX_i) = na+b\sum_{i=1}^n X_i\]

Proof: left to you as an exercise!

Advanced: Useful Properties for Regression

There are some additional properties of summations that may not be immediately obvious, but will be quite essential in proving properties of linear regressions.

Using the properties above, we can describe the mean, variance, and covariance of random variables.1

First, define the mean of a sequence \(\{X_i: i=1,...,n\}\) and \(\{Y_i: i=1,...,n\}\) as:

\[\bar{X}=\frac{1}{n}\sum^n_{i=1}X_i\]

Second, the variance of \(X\) is:

\[var(X)=\frac{1}{n}\sum^n_{i=1}(X_i-\bar{X})^2\]

Third, the covariance of \(X\) and \(Y\) is:

\[cov(X,Y)=\frac{1}{n}\sum^n_{i=1}(X_i-\bar{X})(Y_i-\bar{Y})\]

Rule 4: The sum of the deviations of observations of \(X_i\) from its mean (\(\bar{X}\)) is 0:

\[\sum^n_{i=1} (X_i-\bar{X})=0\]

Proof:

\[\begin{align*} \sum^n_{i=1} (X_i-\bar{x}) &=\sum^n_{i=1}X_i-\sum^n_{i=1}\bar{X} && \\ &=\sum^n_{i=1}X_i-n\bar{X} && \text{Since $\bar{x}$ is a constant}\\ &=n\underbrace{\frac{\displaystyle\sum^n_{i=1}X_i}{n}}_{\bar{X}}-n\bar{X} && \text{Multiply the first term by }\frac{n}{n}=1\\ &=n\bar{X}-n\bar{X}&& \text{By the definition of the mean }\bar{X}\\ &=0 && \\ \end{align*}\]

Rule 5: The squared deviations of \(X\) are equal to the product of \(X\) times its deviations:

\[\sum^n_{i=1} (X_i-\bar{X})^2=\sum^n_{i=1} X_i(X_i-\bar{X})\]

Proof:

\[\begin{align*} \sum^n_{i=1} (X_i-\bar{X})^2&=\sum^n_{i=1} (X_i-\bar{X})(X_i-\bar{X}) && \text{Expanding the square}\\ &=\sum^n_{i=1} X_i(X_i-\bar{X})-\sum^n_{i=1}\bar{X}(X_i-\bar{X}) && \text{Breaking apart the first term} \\ &=\sum^n_{i=1} X_i(X_i-\bar{X})-\bar{X}\sum^n_{i=1}(X_i-\bar{X}) && \text{Since }\bar{X} \text{ is constant, not depending on } i's\\ &=\sum^n_{i=1} X_i(X_i-\bar{X})-\bar{X}(0) && \text{From rule 4} \\ &=\sum^n_{i=1} X_i(X_i-\bar{X}) && \text{Remainder after multiplying by 0}\\ \end{align*}\]

Rule 6: The following summations involving \(X\) and \(Y\) are equivalent:

\[\sum^n_{i=1} Y_i(X_i-\bar{X})=\sum^n_{i=1}X_i(Y_i-\bar{Y})=\sum^n_{i=1} (X_i-\bar{X})(Y_i-\bar{Y})\]

Proof:

\[\begin{align*} \sum^n_{i=1} (X_i-\bar{X})(Y_i-\bar{Y}) &=\sum^n_{i=1} Y_i(X_i-\bar{X})-\sum^n_{i=1}\bar{Y}(X_i-\bar{X}) && \text{Breaking apart the second term} \\ &=\sum^n_{i=1} Y_i(X_i-\bar{X})-\bar{Y}(0) && \text{From rule 4}\\ &=\sum^n_{i=1} Y_i(X_i-\bar{X}) && \text{Remainder after multiplying by 0}\\ \end{align*}\]

equivalently:

\[\begin{align*} \sum^n_{i=1} (X_i-\bar{X})(Y_i-\bar{Y}) &=\sum^n_{i=1} X_i(Y_i-\bar{Y})-\sum^n_{i=1}\bar{X}(Y_i-\bar{Y}) && \text{Breaking apart the first term} \\ &=\sum^n_{i=1} X_i(Y_i-\bar{Y})-\bar{X}(0) && \text{From rule 4}\\ &=\sum^n_{i=1} X_i(Y_i-\bar{Y}) && \text{Remainder after multiplying by 0}\\ \end{align*}\]


  1. For more beyond the mere definition, see the appendix on Covariance and Correlation↩︎

Previous
Next