Data Wrangling Tips & Tricks

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.5     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   2.0.0     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Importing Data

The janitor package

cleannames()

generally makes lowercase names

Dealing with Missing Data

When calculating statistics (e.g. with summarize()), many calculations will give errors if your data contains NAs.

Example: Calculating Mean

data_missing <- tribble(
  ~x, ~y,
  2, 3,
  1, 4,
  NA, 2,
  3, NA,
  7, 8
)

Now if we were to get the mean of x:

data_missing %>% 
  summarize(mean_x = mean(x))

## # A tibble: 1 × 1
##   mean_x
##    <dbl>
## 1     NA

It gives us NA.

One way to combat this is to ignore all observations that contain NA values. Most statistics functions (like mean()) have an optional argument na.rm, which if set to TRUE, will ignore NAs when performing the calculation:

data_missing %>%
  summarize(mean_x = mean(x, na.rm = TRUE))

## # A tibble: 1 × 1
##   mean_x
##    <dbl>
## 1   3.25

Last updated on Oct 19, 2021