library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.5 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 2.0.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Importing Data
The janitor
package
cleannames()
- generally makes lowercase names
Dealing with Missing Data
When calculating statistics (e.g. with summarize()
), many calculations will give errors if your data contains NA
s.
Example: Calculating Mean
data_missing <- tribble(
~x, ~y,
2, 3,
1, 4,
NA, 2,
3, NA,
7, 8
)
Now if we were to get the mean of x:
data_missing %>%
summarize(mean_x = mean(x))
## # A tibble: 1 × 1
## mean_x
## <dbl>
## 1 NA
It gives us NA
.
One way to combat this is to ignore all observations that contain NA
values. Most statistics functions (like mean()
) have an optional argument na.rm
, which if set to TRUE
, will ignore NA
s when performing the calculation:
data_missing %>%
summarize(mean_x = mean(x, na.rm = TRUE))
## # A tibble: 1 × 1
## mean_x
## <dbl>
## 1 3.25