NA
") 😡Ideally, you'd want to be able to do all of this in one program
"Yet far too much handcrafted work - what data scientists call "data wrangling," "data munging," and "data janitor work" - is still required. Data scientists, according to interviews and expert estimates, spend from 50 to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets."
"The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
suppressPackageStartupMessages()
and insert library()
command inside# install for first time# install.packages("tidyverse") # this takes a few minutes and may give several prompts# load tidyverse suppressPackageStartupMessages(library("tidyverse"))
tidyverse
contains a lot of packages, not all are loaded automaticallytidyverse_packages()
## [1] "broom" "cli" "crayon" "dbplyr" ## [5] "dplyr" "dtplyr" "forcats" "googledrive" ## [9] "googlesheets4" "ggplot2" "haven" "hms" ## [13] "httr" "jsonlite" "lubridate" "magrittr" ## [17] "modelr" "pillar" "purrr" "readr" ## [21] "readxl" "reprex" "rlang" "rstudioapi" ## [25] "rvest" "stringr" "tibble" "tidyr" ## [29] "xml2" "tidyverse"
tibble
for friendlier dataframesmagrittr
for "pipeable" codereadr
for importing datadplyr
for data wranglingtidyr
for tidying dataggplot2
for plotting data (we've already covered)broom
for tidy regression (not part of core tidyverse)forcats
for working with factorsstringr
for working with stringslubridate
for working with dates and timespurrr
for iterationtibble
converts all data.frames
into a friendlier version called tibbles
(or tbl_df
)diamonds
## # A tibble: 53,940 × 7## carat cut color clarity depth table price## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int>## 1 0.23 Ideal E SI2 61.5 55 326## 2 0.21 Premium E SI1 59.8 61 326## 3 0.23 Good E VS1 56.9 65 327## 4 0.29 Premium I VS2 62.4 58 334## 5 0.31 Good J SI2 63.3 58 335## 6 0.24 Very Good J VVS2 62.8 57 336## 7 0.24 Very Good I VVS1 62.3 57 336## 8 0.26 Very Good H SI1 61.9 55 337## 9 0.22 Fair E VS2 65.1 61 337## 10 0.23 Very Good H VS1 59.4 61 338## # … with 53,930 more rows
Prints much nicer output
Shows a bit of the str
ucture:
nrow() x ncol()
<dbl>
is numeric ("double")<ord>
is an ordered factor<int>
is an integerFundamental grammar of tidyverse:
tibble
from a data.frame
with as_tibble()
as_tibble(mpg) # take built-in dataframe mpg
tibble
from scratch with tibble()
, works like data.frame()
example<-tibble(x = seq(2,6,2), # sequence from 2 to 6 by 2's y = rnorm(3,0,1), # 3 random draws with mean 0, sd 1 colors = c("orange", "green", "blue"))example
## # A tibble: 3 × 3## x y colors## <dbl> <dbl> <chr> ## 1 2 -0.0747 orange## 2 4 -1.01 green ## 3 6 -1.05 blue
tibble
row-by-row with tribble()
example_2<-tribble( ~x, ~y, ~color, # each variable name starts with ~ 2, 1.5, "orange", 4, 0.2, "green", 6, 0.8, "blue") # last element has no commaexample_2
## # A tibble: 3 × 3## x y color ## <dbl> <dbl> <chr> ## 1 2 1.5 orange## 2 4 0.2 green ## 3 6 0.8 blue
The magrittr
package allows us to use the "pipe" operator (%>%
)†
%>%
"pipes" the output of the left of the pipe into the (1st) argument of the right
Running a function f
on object x
as f(x)
becomes x %>% f
in pipeable form
x
and then run function f
on it"† Keyboard shortcuts in R Studio: CTRL+Shift+M
(Windows) or Cmd+Shift+M
(Mac)
With ordinary math functions, read from outside ← (inside): g(f(x))
x
and perform function f()
on x
and then take that result and perform function g()
on itWith pipes, read operations from left → right:
x %>% f %>% g
take x
and then perform function f
on it, then perform function g
on that result
%>%
mentally as "and then"Example
ln(exp(x))
x %>% exp() %>% ln()
Example
park(drive(start_car(find("keys")), to = "campus"))
find("keys") %>% start_car() %>% drive(to = "campus") %>% park()
# look at top 6 rowshead(gapminder) # use pipe insteadgapminder %>% head()
## # A tibble: 6 × 6## country continent year lifeExp pop gdpPercap## <fct> <fct> <int> <dbl> <int> <dbl>## 1 Afghanistan Asia 1952 28.8 8425333 779.## 2 Afghanistan Asia 1957 30.3 9240934 821.## 3 Afghanistan Asia 1962 32.0 10267083 853.## 4 Afghanistan Asia 1967 34.0 11537966 836.## 5 Afghanistan Asia 1972 36.1 13079460 740.## 6 Afghanistan Asia 1977 38.4 14880372 786.
These two methods produce the same output (average hightway mpg of Audi cars)
Without the pipe
summarise(group_by(filter(mpg, manufacturer=="audi"), model), hwy_mean = mean(hwy))
These two methods produce the same output (average hightway mpg of Audi cars)
Without the pipe
summarise(group_by(filter(mpg, manufacturer=="audi"), model), hwy_mean = mean(hwy))
mpg %>% filter(manufacturer=="audi") %>% group_by(model) %>% summarise(hwy_mean = mean(hwy))
## # A tibble: 3 × 2## model hwy_mean## <chr> <dbl>## 1 a4 28.3## 2 a4 quattro 25.8## 3 a6 quattro 24
readr
helps load common spreadsheet files (.csv
, .tsv
) with simple commands:read_*(path/to/my_data.*)
*
can be .csv
or .tsv
Often this is enough, but many more customizations possible
You can also export your data from R into a common spreadsheet file with:
write_*(my_df, path = path/to/file_name.*)
my_df
is the name of your tibble
, and file_name
is the name of the file you want to save asRead more on the tidyverse website and the Readr Cheatsheet
readxl
has equivalent commands for Excel data types:
read_*("path/to/my/data.*")
write_*(my_dataframe, path=path/to/file_name.*)
*
can be .xls
or .xlsx
haven
has equivalent commands for other data types:
read_*("path/to/my_data.dta")
for STATA .dta
fileswrite_*(my_dataframe, path=path/to/file_name.*)
*
can be .dta
(STATA), .sav
(SPSS), .sas7bdat
(SAS)Most common: "where the hell is my data file"??
Recall R
looks for files to read_*()
in the default working directory (check what it is with getwd()
, change it with setwd()
)
You can tell R
where this data is by making the path
a part of the file's name when importing
..
to "move up one folder"/
to "enter a folder"Either use an absolute path on your computer:
# Exampledf <- read_csv("C:/Documents and Settings/Ryan Safner/Downloads/my_data.csv")
Most common: "where the hell is my data file"??
Recall R
looks for files to read_*()
in the default working directory (check what it is with getwd()
, change it with setwd()
)
You can tell R
where this data is by making the path
a part of the file's name when importing
..
to "move up one folder"/
to "enter a folder"# Example# If working directory is Documents, but data is in Downloads, like so: # # Ryan Safner/# |# |- Documents/# |- Downloads/# |- Photos/# |- Videos/df <- read_csv("../Downloads/my_data.csv")
Suggestion to make your data import easier: Download and move files to R's working directory
Your computer and working directory are different from mine (and others)
This is not a reproducible workflow!
We'll finally fix this next class with R Projects
dplyr
uses more efficient & intuitive commands to manipulate tibblesBase R
grammar passively runs functions on nouns: function(object)
dplyr
grammar actively uses verbs: verb(df, conditions)
†
Three great features:
%>%
pipe operatortibble
df %>% verb(conditions)
dplyr
verbsVerb | Does |
---|---|
filter() |
Keep only selected observations |
select() |
Keep only selected variables |
arrange() |
Reorder rows (e.g. in numerical order) |
mutate() |
Create new variables |
summarize() |
Collapse data into summary statistics |
group_by() |
Perform any of the above functions by groups/categories |
filter
keeps only selected observations (rows)filter
keeps only selected observations (rows)# look only at African observations# syntax without the pipefilter(gapminder, continent=="Africa")
# using the pipegapminder %>% filter(continent == "Africa")
filter
keeps only selected observations (rows)# look only at African observations# syntax without the pipefilter(gapminder, continent=="Africa")
# using the pipegapminder %>% filter(continent == "Africa")
## # A tibble: 624 × 6## country continent year lifeExp pop gdpPercap## <fct> <fct> <int> <dbl> <int> <dbl>## 1 Algeria Africa 1952 43.1 9279525 2449.## 2 Algeria Africa 1957 45.7 10270856 3014.## 3 Algeria Africa 1962 48.3 11000948 2551.## 4 Algeria Africa 1967 51.4 12760499 3247.## 5 Algeria Africa 1972 54.5 14760787 4183.## 6 Algeria Africa 1977 58.0 17152804 4910.## 7 Algeria Africa 1982 61.4 20033753 5745.## 8 Algeria Africa 1987 65.8 23254956 5681.## 9 Algeria Africa 1992 67.7 26298373 5023.## 10 Algeria Africa 1997 69.2 29072015 4797.## # … with 614 more rows
dplyr
functions never modify their inputs (i.e. never overwrite the original tibble
)<-
to assign it to a new tibble
tibble
by namedplyr
functions never modify their inputs (i.e. never overwrite the original tibble
)<-
to assign it to a new tibble
tibble
by name# base syntaxafrica <- filter(gapminder, continent=="Africa")
# using the pipeafrica <- gapminder %>% filter(continent == "Africa")
# look at new tibbleafrica
## # A tibble: 624 × 6## country continent year lifeExp pop gdpPercap## <fct> <fct> <int> <dbl> <int> <dbl>## 1 Algeria Africa 1952 43.1 9279525 2449.## 2 Algeria Africa 1957 45.7 10270856 3014.## 3 Algeria Africa 1962 48.3 11000948 2551.## 4 Algeria Africa 1967 51.4 12760499 3247.## 5 Algeria Africa 1972 54.5 14760787 4183.## 6 Algeria Africa 1977 58.0 17152804 4910.## 7 Algeria Africa 1982 61.4 20033753 5745.## 8 Algeria Africa 1987 65.8 23254956 5681.## 9 Algeria Africa 1992 67.7 26298373 5023.## 10 Algeria Africa 1997 69.2 29072015 4797.## # … with 614 more rows
(africa <- gapminder %>% filter(continent == "Africa"))
## # A tibble: 624 × 6## country continent year lifeExp pop gdpPercap## <fct> <fct> <int> <dbl> <int> <dbl>## 1 Algeria Africa 1952 43.1 9279525 2449.## 2 Algeria Africa 1957 45.7 10270856 3014.## 3 Algeria Africa 1962 48.3 11000948 2551.## 4 Algeria Africa 1967 51.4 12760499 3247.## 5 Algeria Africa 1972 54.5 14760787 4183.## 6 Algeria Africa 1977 58.0 17152804 4910.## 7 Algeria Africa 1982 61.4 20033753 5745.## 8 Algeria Africa 1987 65.8 23254956 5681.## 9 Algeria Africa 1992 67.7 26298373 5023.## 10 Algeria Africa 1997 69.2 29072015 4797.## # … with 614 more rows
tibble
, it would overwrite the original!# base syntaxgapminder <- filter(gapminder, continent=="Africa")
# using the pipegapminder <- gapminder %>% filter(continent == "Africa")# this overwrites gapminder!
TRUE
†>
, <
: greater than, less than>=
, <=
: greater than or equal to, less than or equal to==
‡, !=
: is equal to‡, is not equal to%in%
: is a member of some defined set (∈)&
: AND (commas also work instead)|
: OR!
: not† See ?Comparison
and ?Base::Logic
.
‡ Recall one =
assigns values to an object, two ==
tests an object for a condition!
# look only at African observations# in 1997gapminder %>% filter(continent == "Africa", year == 1997)
# look only at African observations# in 1997gapminder %>% filter(continent == "Africa", year == 1997)
## # A tibble: 52 × 6## country continent year lifeExp pop gdpPercap## <fct> <fct> <int> <dbl> <int> <dbl>## 1 Algeria Africa 1997 69.2 29072015 4797.## 2 Angola Africa 1997 41.0 9875024 2277.## 3 Benin Africa 1997 54.8 6066080 1233.## 4 Botswana Africa 1997 52.6 1536536 8647.## 5 Burkina Faso Africa 1997 50.3 10352843 946.## 6 Burundi Africa 1997 45.3 6121610 463.## 7 Cameroon Africa 1997 52.2 14195809 1694.## 8 Central African Republic Africa 1997 46.1 3696513 741.## 9 Chad Africa 1997 51.6 7562011 1005.## 10 Comoros Africa 1997 60.7 527982 1174.## # … with 42 more rows
# look only at African observations # or observations in 1997gapminder %>% filter(continent == "Africa" | year == 1997)
# look only at African observations # or observations in 1997gapminder %>% filter(continent == "Africa" | year == 1997)
## # A tibble: 714 × 6## country continent year lifeExp pop gdpPercap## <fct> <fct> <int> <dbl> <int> <dbl>## 1 Afghanistan Asia 1997 41.8 22227415 635.## 2 Albania Europe 1997 73.0 3428038 3193.## 3 Algeria Africa 1952 43.1 9279525 2449.## 4 Algeria Africa 1957 45.7 10270856 3014.## 5 Algeria Africa 1962 48.3 11000948 2551.## 6 Algeria Africa 1967 51.4 12760499 3247.## 7 Algeria Africa 1972 54.5 14760787 4183.## 8 Algeria Africa 1977 58.0 17152804 4910.## 9 Algeria Africa 1982 61.4 20033753 5745.## 10 Algeria Africa 1987 65.8 23254956 5681.## # … with 704 more rows
# look only at U.S. and U.K. # observations in 2002gapminder %>% filter(country %in% c("United States", "United Kingdom"), year == 2002)
# look only at U.S. and U.K. # observations in 2002gapminder %>% filter(country %in% c("United States", "United Kingdom"), year == 2002)
## # A tibble: 2 × 6## country continent year lifeExp pop gdpPercap## <fct> <fct> <int> <dbl> <int> <dbl>## 1 United Kingdom Europe 2002 78.5 59912431 29479.## 2 United States Americas 2002 77.3 287675526 39097.
arrange
reorders observations (rows) in a logical orderarrange
reorders observations (rows) in a logical order# order by smallest to largest pop# syntax without the pipearrange(gapminder, pop)
# using the pipegapminder %>% arrange(pop)
arrange
reorders observations (rows) in a logical order# order by smallest to largest pop# syntax without the pipearrange(gapminder, pop)
# using the pipegapminder %>% arrange(pop)
## # A tibble: 1,704 × 6## country continent year lifeExp pop gdpPercap## <fct> <fct> <int> <dbl> <int> <dbl>## 1 Sao Tome and Principe Africa 1952 46.5 60011 880.## 2 Sao Tome and Principe Africa 1957 48.9 61325 861.## 3 Djibouti Africa 1952 34.8 63149 2670.## 4 Sao Tome and Principe Africa 1962 51.9 65345 1072.## 5 Sao Tome and Principe Africa 1967 54.4 70787 1385.## 6 Djibouti Africa 1957 37.3 71851 2865.## 7 Sao Tome and Principe Africa 1972 56.5 76595 1533.## 8 Sao Tome and Principe Africa 1977 58.6 86796 1738.## 9 Djibouti Africa 1962 39.7 89898 3021.## 10 Sao Tome and Principe Africa 1982 60.4 98593 1890.## # … with 1,694 more rows
# order by year, with the smallest # to largest pop in each year# syntax without the pipearrange(gapminder, year, pop)
# using the pipegapminder %>% arrange(year, pop)
# order by year, with the smallest # to largest pop in each year# syntax without the pipearrange(gapminder, year, pop)
# using the pipegapminder %>% arrange(year, pop)
## # A tibble: 1,704 × 6## country continent year lifeExp pop gdpPercap## <fct> <fct> <int> <dbl> <int> <dbl>## 1 Sao Tome and Principe Africa 1952 46.5 60011 880.## 2 Djibouti Africa 1952 34.8 63149 2670.## 3 Bahrain Asia 1952 50.9 120447 9867.## 4 Iceland Europe 1952 72.5 147962 7268.## 5 Comoros Africa 1952 40.7 153936 1103.## 6 Kuwait Asia 1952 55.6 160000 108382.## 7 Equatorial Guinea Africa 1952 34.5 216964 376.## 8 Reunion Africa 1952 52.7 257700 2719.## 9 Gambia Africa 1952 30 284320 485.## 10 Swaziland Africa 1952 41.4 290243 1148.## # … with 1,694 more rows
desc()
to re-order in the opposite directiondesc()
to re-order in the opposite direction# order by largest to smallest pop# syntax without the pipearrange(gapminder, desc(pop))
# using the pipegapminder %>% arrange(desc(pop))
desc()
to re-order in the opposite direction# order by largest to smallest pop# syntax without the pipearrange(gapminder, desc(pop))
# using the pipegapminder %>% arrange(desc(pop))
## # A tibble: 1,704 × 6## country continent year lifeExp pop gdpPercap## <fct> <fct> <int> <dbl> <int> <dbl>## 1 China Asia 2007 73.0 1318683096 4959.## 2 China Asia 2002 72.0 1280400000 3119.## 3 China Asia 1997 70.4 1230075000 2289.## 4 China Asia 1992 68.7 1164970000 1656.## 5 India Asia 2007 64.7 1110396331 2452.## 6 China Asia 1987 67.3 1084035000 1379.## 7 India Asia 2002 62.9 1034172547 1747.## 8 China Asia 1982 65.5 1000281000 962.## 9 India Asia 1997 61.8 959000000 1459.## 10 China Asia 1977 64.0 943455000 741.## # … with 1,694 more rows
select
keeps only selected variables (columns)select
keeps only selected variables (columns)# keep only country, year, # and population variables# syntax without the pipeselect(gapminder, country, year, pop)
# using the pipegapminder %>% select(country, year, pop)
select
keeps only selected variables (columns)# keep only country, year, # and population variables# syntax without the pipeselect(gapminder, country, year, pop)
# using the pipegapminder %>% select(country, year, pop)
## # A tibble: 1,704 × 3## country year pop## <fct> <int> <int>## 1 Afghanistan 1952 8425333## 2 Afghanistan 1957 9240934## 3 Afghanistan 1962 10267083## 4 Afghanistan 1967 11537966## 5 Afghanistan 1972 13079460## 6 Afghanistan 1977 14880372## 7 Afghanistan 1982 12881816## 8 Afghanistan 1987 13867957## 9 Afghanistan 1992 16317921## 10 Afghanistan 1997 22227415## # … with 1,694 more rows
select
"all except" by negating a variable with -
select
"all except" by negating a variable with -
# keep all *except* gdpPercap# syntax without the pipeselect(gapminder, -gdpPercap)
# using the pipegapminder %>% select(-gdpPercap)
select
"all except" by negating a variable with -
# keep all *except* gdpPercap# syntax without the pipeselect(gapminder, -gdpPercap)
# using the pipegapminder %>% select(-gdpPercap)
## # A tibble: 1,704 × 5## country continent year lifeExp pop## <fct> <fct> <int> <dbl> <int>## 1 Afghanistan Asia 1952 28.8 8425333## 2 Afghanistan Asia 1957 30.3 9240934## 3 Afghanistan Asia 1962 32.0 10267083## 4 Afghanistan Asia 1967 34.0 11537966## 5 Afghanistan Asia 1972 36.1 13079460## 6 Afghanistan Asia 1977 38.4 14880372## 7 Afghanistan Asia 1982 39.9 12881816## 8 Afghanistan Asia 1987 40.8 13867957## 9 Afghanistan Asia 1992 41.7 16317921## 10 Afghanistan Asia 1997 41.8 22227415## # … with 1,694 more rows
select
reorders the columns in the order you provideeverything()
at the endselect
reorders the columns in the order you provideeverything()
at the end# keep all and move pop first# syntax without the pipeselect(gapminder, pop, everything())
# using the pipegapminder %>% select(pop, everything())
select
reorders the columns in the order you provideeverything()
at the end# keep all and move pop first# syntax without the pipeselect(gapminder, pop, everything())
# using the pipegapminder %>% select(pop, everything())
## # A tibble: 1,704 × 6## pop country continent year lifeExp gdpPercap## <int> <fct> <fct> <int> <dbl> <dbl>## 1 8425333 Afghanistan Asia 1952 28.8 779.## 2 9240934 Afghanistan Asia 1957 30.3 821.## 3 10267083 Afghanistan Asia 1962 32.0 853.## 4 11537966 Afghanistan Asia 1967 34.0 836.## 5 13079460 Afghanistan Asia 1972 36.1 740.## 6 14880372 Afghanistan Asia 1977 38.4 786.## 7 12881816 Afghanistan Asia 1982 39.9 978.## 8 13867957 Afghanistan Asia 1987 40.8 852.## 9 16317921 Afghanistan Asia 1992 41.7 649.## 10 22227415 Afghanistan Asia 1997 41.8 635.## # … with 1,694 more rows
select
has a lot of helper functions, useful for when you have hundreds of variables?select()
for a listselect
has a lot of helper functions, useful for when you have hundreds of variables?select()
for a list# keep all variables starting with "co"gapminder %>% select(starts_with("co"))
## # A tibble: 1,704 × 2## country continent## <fct> <fct> ## 1 Afghanistan Asia ## 2 Afghanistan Asia ## 3 Afghanistan Asia ## 4 Afghanistan Asia ## 5 Afghanistan Asia ## 6 Afghanistan Asia ## 7 Afghanistan Asia ## 8 Afghanistan Asia ## 9 Afghanistan Asia ## 10 Afghanistan Asia ## # … with 1,694 more rows
select
has a lot of helper functions, useful for when you have hundreds of variables?select()
for a list# keep all variables starting with "co"gapminder %>% select(starts_with("co"))
## # A tibble: 1,704 × 2## country continent## <fct> <fct> ## 1 Afghanistan Asia ## 2 Afghanistan Asia ## 3 Afghanistan Asia ## 4 Afghanistan Asia ## 5 Afghanistan Asia ## 6 Afghanistan Asia ## 7 Afghanistan Asia ## 8 Afghanistan Asia ## 9 Afghanistan Asia ## 10 Afghanistan Asia ## # … with 1,694 more rows
# keep country and all variables # containing "per"gapminder %>% select(country, contains("per"))
## # A tibble: 1,704 × 2## country gdpPercap## <fct> <dbl>## 1 Afghanistan 779.## 2 Afghanistan 821.## 3 Afghanistan 853.## 4 Afghanistan 836.## 5 Afghanistan 740.## 6 Afghanistan 786.## 7 Afghanistan 978.## 8 Afghanistan 852.## 9 Afghanistan 649.## 10 Afghanistan 635.## # … with 1,694 more rows
rename
changes the name of a variable (column)new_name = old_name
rename
changes the name of a variable (column)new_name = old_name
# rename gdpPercap to GDP# syntax without the piperename(gapminder, GDP = gdpPercap)
# using the pipegapminder %>% rename(GDP = gdpPercap)
rename
changes the name of a variable (column)new_name = old_name
# rename gdpPercap to GDP# syntax without the piperename(gapminder, GDP = gdpPercap)
# using the pipegapminder %>% rename(GDP = gdpPercap)
## # A tibble: 1,704 × 6## country continent year lifeExp pop GDP## <fct> <fct> <int> <dbl> <int> <dbl>## 1 Afghanistan Asia 1952 28.8 8425333 779.## 2 Afghanistan Asia 1957 30.3 9240934 821.## 3 Afghanistan Asia 1962 32.0 10267083 853.## 4 Afghanistan Asia 1967 34.0 11537966 836.## 5 Afghanistan Asia 1972 36.1 13079460 740.## 6 Afghanistan Asia 1977 38.4 14880372 786.## 7 Afghanistan Asia 1982 39.9 12881816 978.## 8 Afghanistan Asia 1987 40.8 13867957 852.## 9 Afghanistan Asia 1992 41.7 16317921 649.## 10 Afghanistan Asia 1997 41.8 22227415 635.## # … with 1,694 more rows
mutate
creates a new variable (column)new_variable_name = operation
# create variable "europe" if country # is in Europe# syntax without the pipemutate(gapminder, europe = ifelse(continent == "Europe", yes = "In Europe", no = "Not in Europe"))
# using the pipegapminder %>% mutate(europe = ifelse(continent == "Europe", yes = "In Europe", no = "Not in Europe"))
# create variable "europe" if country # is in Europe# syntax without the pipemutate(gapminder, europe = ifelse(continent == "Europe", yes = "In Europe", no = "Not in Europe"))
# using the pipegapminder %>% mutate(europe = ifelse(continent == "Europe", yes = "In Europe", no = "Not in Europe"))
## # A tibble: 1,704 × 4## country continent year europe ## <fct> <fct> <int> <chr> ## 1 Afghanistan Asia 1952 Not in Europe## 2 Afghanistan Asia 1957 Not in Europe## 3 Afghanistan Asia 1962 Not in Europe## 4 Afghanistan Asia 1967 Not in Europe## 5 Afghanistan Asia 1972 Not in Europe## 6 Afghanistan Asia 1977 Not in Europe## 7 Afghanistan Asia 1982 Not in Europe## 8 Afghanistan Asia 1987 Not in Europe## 9 Afghanistan Asia 1992 Not in Europe## 10 Afghanistan Asia 1997 Not in Europe## # … with 1,694 more rows
# create population in millions # syntax without the pipemutate(gapminder, pop_mil = pop / 1000000)
# using the pipegapminder %>% rename(pop_mil = pop / 1000000)
# create population in millions # syntax without the pipemutate(gapminder, pop_mil = pop / 1000000)
# using the pipegapminder %>% rename(pop_mil = pop / 1000000)
## # A tibble: 1,704 × 6## country continent year lifeExp pop pop_mil## <fct> <fct> <int> <dbl> <int> <dbl>## 1 Afghanistan Asia 1952 28.8 8425333 8.43## 2 Afghanistan Asia 1957 30.3 9240934 9.24## 3 Afghanistan Asia 1962 32.0 10267083 10.3 ## 4 Afghanistan Asia 1967 34.0 11537966 11.5 ## 5 Afghanistan Asia 1972 36.1 13079460 13.1 ## 6 Afghanistan Asia 1977 38.4 14880372 14.9 ## 7 Afghanistan Asia 1982 39.9 12881816 12.9 ## 8 Afghanistan Asia 1987 40.8 13867957 13.9 ## 9 Afghanistan Asia 1992 41.7 16317921 16.3 ## 10 Afghanistan Asia 1997 41.8 22227415 22.2 ## # … with 1,694 more rows
# create GDP variable from gdpPercap # and pop, in billions# syntax without the pipemutate(gapminder, GDP = ((gdpPercap * pop)/1000000000))
# using the pipegapminder %>% mutate(GDP = ((gdpPercap * pop)/1000000000))
# create GDP variable from gdpPercap # and pop, in billions# syntax without the pipemutate(gapminder, GDP = ((gdpPercap * pop)/1000000000))
# using the pipegapminder %>% mutate(GDP = ((gdpPercap * pop)/1000000000))
## # A tibble: 1,704 × 6## country continent year pop gdpPercap GDP## <fct> <fct> <int> <int> <dbl> <dbl>## 1 Afghanistan Asia 1952 8425333 779. 6.57## 2 Afghanistan Asia 1957 9240934 821. 7.59## 3 Afghanistan Asia 1962 10267083 853. 8.76## 4 Afghanistan Asia 1967 11537966 836. 9.65## 5 Afghanistan Asia 1972 13079460 740. 9.68## 6 Afghanistan Asia 1977 14880372 786. 11.7 ## 7 Afghanistan Asia 1982 12881816 978. 12.6 ## 8 Afghanistan Asia 1987 13867957 852. 11.8 ## 9 Afghanistan Asia 1992 16317921 649. 10.6 ## 10 Afghanistan Asia 1997 22227415 635. 14.1 ## # … with 1,694 more rows
class
of a variable inside mutate()
with as.*()
gapminder %>% head(., 2)
## # A tibble: 2 × 6## country continent year lifeExp pop gdpPercap## <fct> <fct> <int> <dbl> <int> <dbl>## 1 Afghanistan Asia 1952 28.8 8425333 779.## 2 Afghanistan Asia 1957 30.3 9240934 821.
# change year from an integer to a factorgapminder %>% mutate(year = as.factor(year))
## # A tibble: 1,704 × 6## country continent year lifeExp pop gdpPercap## <fct> <fct> <fct> <dbl> <int> <dbl>## 1 Afghanistan Asia 1952 28.8 8425333 779.## 2 Afghanistan Asia 1957 30.3 9240934 821.## 3 Afghanistan Asia 1962 32.0 10267083 853.## 4 Afghanistan Asia 1967 34.0 11537966 836.## 5 Afghanistan Asia 1972 36.1 13079460 740.## 6 Afghanistan Asia 1977 38.4 14880372 786.## 7 Afghanistan Asia 1982 39.9 12881816 978.## 8 Afghanistan Asia 1987 40.8 13867957 852.## 9 Afghanistan Asia 1992 41.7 16317921 649.## 10 Afghanistan Asia 1997 41.8 22227415 635.## # … with 1,694 more rows
gapminder %>% mutate(GDP = gdpPercap * pop, pop_millions = pop / 1000000)
## # A tibble: 1,704 × 8## country continent year lifeExp pop gdpPercap GDP pop_millions## <fct> <fct> <int> <dbl> <int> <dbl> <dbl> <dbl>## 1 Afghanistan Asia 1952 28.8 8425333 779. 6567086330. 8.43## 2 Afghanistan Asia 1957 30.3 9240934 821. 7585448670. 9.24## 3 Afghanistan Asia 1962 32.0 10267083 853. 8758855797. 10.3 ## 4 Afghanistan Asia 1967 34.0 11537966 836. 9648014150. 11.5 ## 5 Afghanistan Asia 1972 36.1 13079460 740. 9678553274. 13.1 ## 6 Afghanistan Asia 1977 38.4 14880372 786. 11697659231. 14.9 ## 7 Afghanistan Asia 1982 39.9 12881816 978. 12598563401. 12.9 ## 8 Afghanistan Asia 1987 40.8 13867957 852. 11820990309. 13.9 ## 9 Afghanistan Asia 1992 41.7 16317921 649. 10595901589. 16.3 ## 10 Afghanistan Asia 1997 41.8 22227415 635. 14121995875. 22.2 ## # … with 1,694 more rows
transmute
keeps only newly created variables (select
s only the new mutate
d variables)gapminder %>% transmute(GDP = gdpPercap * pop, pop_millions = pop / 1000000)
## # A tibble: 1,704 × 2## GDP pop_millions## <dbl> <dbl>## 1 6567086330. 8.43## 2 7585448670. 9.24## 3 8758855797. 10.3 ## 4 9648014150. 11.5 ## 5 9678553274. 13.1 ## 6 11697659231. 14.9 ## 7 12598563401. 12.9 ## 8 11820990309. 13.9 ## 9 10595901589. 16.3 ## 10 14121995875. 22.2 ## # … with 1,694 more rows
mutate()
:gapminder %>% select(country, year, lifeExp) %>% mutate(long_1 = lifeExp > 70, long_2 = ifelse(lifeExp > 70, "Long", "Short"))
## # A tibble: 1,704 × 5## country year lifeExp long_1 long_2## <fct> <int> <dbl> <lgl> <chr> ## 1 Afghanistan 1952 28.8 FALSE Short ## 2 Afghanistan 1957 30.3 FALSE Short ## 3 Afghanistan 1962 32.0 FALSE Short ## 4 Afghanistan 1967 34.0 FALSE Short ## 5 Afghanistan 1972 36.1 FALSE Short ## 6 Afghanistan 1977 38.4 FALSE Short ## 7 Afghanistan 1982 39.9 FALSE Short ## 8 Afghanistan 1987 40.8 FALSE Short ## 9 Afghanistan 1992 41.7 FALSE Short ## 10 Afghanistan 1997 41.8 FALSE Short ## # … with 1,694 more rows
mutate()
is order-aware, so you can chain multiple mutates that depend on previous mutatesgapminder %>% select(country, year, lifeExp) %>% mutate(dog_years = lifeExp * 7, comment = paste("Life expectancy in", country, "is", dog_years, "in dog years.", sep = " "))
## # A tibble: 1,704 × 5## country year lifeExp dog_years comment ## <fct> <int> <dbl> <dbl> <chr> ## 1 Afghanistan 1952 28.8 202. Life expectancy in Afghanistan is 201.60…## 2 Afghanistan 1957 30.3 212. Life expectancy in Afghanistan is 212.32…## 3 Afghanistan 1962 32.0 224. Life expectancy in Afghanistan is 223.97…## 4 Afghanistan 1967 34.0 238. Life expectancy in Afghanistan is 238.14…## 5 Afghanistan 1972 36.1 253. Life expectancy in Afghanistan is 252.61…## 6 Afghanistan 1977 38.4 269. Life expectancy in Afghanistan is 269.06…## 7 Afghanistan 1982 39.9 279. Life expectancy in Afghanistan is 278.97…## 8 Afghanistan 1987 40.8 286. Life expectancy in Afghanistan is 285.75…## 9 Afghanistan 1992 41.7 292. Life expectancy in Afghanistan is 291.71…## 10 Afghanistan 1997 41.8 292. Life expectancy in Afghanistan is 292.34…## # … with 1,694 more rows
case_when
creates a new variable with values that are conditional on values of other variables (e.g., "if/else")TRUE
: when gapminder %>% mutate(European = case_when( continent == "Europe" ~ "Aye", TRUE ~ "Nay" ))
## # A tibble: 1,704 × 7## country continent year lifeExp pop gdpPercap European## <fct> <fct> <int> <dbl> <int> <dbl> <chr> ## 1 Afghanistan Asia 1952 28.8 8425333 779. Nay ## 2 Afghanistan Asia 1957 30.3 9240934 821. Nay ## 3 Afghanistan Asia 1962 32.0 10267083 853. Nay ## 4 Afghanistan Asia 1967 34.0 11537966 836. Nay ## 5 Afghanistan Asia 1972 36.1 13079460 740. Nay ## 6 Afghanistan Asia 1977 38.4 14880372 786. Nay ## 7 Afghanistan Asia 1982 39.9 12881816 978. Nay ## 8 Afghanistan Asia 1987 40.8 13867957 852. Nay ## 9 Afghanistan Asia 1992 41.7 16317921 649. Nay ## 10 Afghanistan Asia 1997 41.8 22227415 635. Nay ## # … with 1,694 more rows
mutate
that work on a subset of variables:mutate_all()
affects every variablemutate_at()
affects named or selected variablesmutate_if()
affects variables that meet a criteria # round all observations of numeric# variables to 2 digitsgapminder %>% mutate_if(is.numeric, round, digits = 2)
## # A tibble: 1,704 × 6## country continent year lifeExp pop gdpPercap## <fct> <fct> <dbl> <dbl> <dbl> <dbl>## 1 Afghanistan Asia 1952 28.8 8425333 779.## 2 Afghanistan Asia 1957 30.3 9240934 821.## 3 Afghanistan Asia 1962 32 10267083 853.## 4 Afghanistan Asia 1967 34.0 11537966 836.## 5 Afghanistan Asia 1972 36.1 13079460 740.## 6 Afghanistan Asia 1977 38.4 14880372 786.## 7 Afghanistan Asia 1982 39.8 12881816 978.## 8 Afghanistan Asia 1987 40.8 13867957 852.## 9 Afghanistan Asia 1992 41.7 16317921 649.## 10 Afghanistan Asia 1997 41.8 22227415 635.## # … with 1,694 more rows
mutate
that work on a subset of variables:mutate_all()
affects every variablemutate_at()
affects named or selected variablesmutate_if()
affects variables that meet a criteria # make all factor variables uppercasegapminder %>% mutate_if(is.factor, toupper)
## # A tibble: 1,704 × 6## country continent year lifeExp pop gdpPercap## <chr> <chr> <int> <dbl> <int> <dbl>## 1 AFGHANISTAN ASIA 1952 28.8 8425333 779.## 2 AFGHANISTAN ASIA 1957 30.3 9240934 821.## 3 AFGHANISTAN ASIA 1962 32.0 10267083 853.## 4 AFGHANISTAN ASIA 1967 34.0 11537966 836.## 5 AFGHANISTAN ASIA 1972 36.1 13079460 740.## 6 AFGHANISTAN ASIA 1977 38.4 14880372 786.## 7 AFGHANISTAN ASIA 1982 39.9 12881816 978.## 8 AFGHANISTAN ASIA 1987 40.8 13867957 852.## 9 AFGHANISTAN ASIA 1992 41.7 16317921 649.## 10 AFGHANISTAN ASIA 1997 41.8 22227415 635.## # … with 1,694 more rows
tibble
(or overwrite original) if you want to "save" the new variables!summarize
† outputs a tibble of desired summary statisticsmutate
-ing a new variablesummarize
† outputs a tibble of desired summary statisticsmutate
-ing a new variable# get average life expectancy# call it avg_LEsummarize(gapminder, avg_LE = mean(lifeExp))
# using the pipegapminder %>% summarize(avg_LE = mean(lifeExp))
summarize
† outputs a tibble of desired summary statisticsmutate
-ing a new variable# get average life expectancy# call it avg_LEsummarize(gapminder, avg_LE = mean(lifeExp))
# using the pipegapminder %>% summarize(avg_LE = mean(lifeExp))
## # A tibble: 1 × 1## avg_LE## <dbl>## 1 59.5
† Also the more civilised non-U.S. English spelling summarise
also works. dplyr
was written by a Kiwi after all!
summarize()
commands:Command | Does |
---|---|
n() * |
Number of observations |
n_distinct() * |
Number of unique observations |
sum() |
Sum all observations of a variable |
mean() |
Average of all observations of a variable |
median() |
50th percentile of all observations of a variable |
sd() |
Standard deviation of all observations of a variable |
* Most commands require you to put a variable name inside the command's argument parentheses. These commands require nothing to be in parentheses!
summarize()
commands (continued):Command | Does |
---|---|
min() |
Minimum value of a variable |
max() |
Maximum value of a variable |
quantile(., 0.25) + |
Specified percentile (example 25 th percentile) of a variable |
first() |
First value of a variable |
last() |
Last value of a variable |
nth(., 2) + |
Specified position of a variable (example 2 nd) |
+ The .
is where you would put your variable name.
# summarize with n() gives size of current group, has no argumentsgapminder %>% summarize(amount = n()) # I've called it "amount"
## # A tibble: 1 × 1## amount## <int>## 1 1704
# summarize with n() gives size of current group, has no argumentsgapminder %>% summarize(amount = n()) # I've called it "amount"
## # A tibble: 1 × 1## amount## <int>## 1 1704
# count() is a dedicated command, counts observations by specified variablegapminder %>% count(year) # counts how many observations per year
## # A tibble: 12 × 2## year n## <int> <int>## 1 1952 142## 2 1957 142## 3 1962 142## 4 1967 142## 5 1972 142## 6 1977 142## 7 1982 142## 8 1987 142## 9 1992 142## 10 1997 142## 11 2002 142## 12 2007 142
TRUE
)TRUE=1
and FALSE=0
sum(x)
becomes the number of TRUE
s in x
mean(x)
becomes the proportionTRUE
)TRUE=1
and FALSE=0
sum(x)
becomes the number of TRUE
s in x
mean(x)
becomes the proportion# How many countries have life expectancy# over 70 in 2007?gapminder %>% filter(year=="2007") %>% summarize(Over_70 = sum(lifeExp>70))
## # A tibble: 1 × 1## Over_70## <int>## 1 83
TRUE
)TRUE=1
and FALSE=0
sum(x)
becomes the number of TRUE
s in x
mean(x)
becomes the proportion# How many countries have life expectancy# over 70 in 2007?gapminder %>% filter(year=="2007") %>% summarize(Over_70 = sum(lifeExp>70))
## # A tibble: 1 × 1## Over_70## <int>## 1 83
# What *proportion* of countries have life# expectancy over 70 in 2007?gapminder %>% filter(year=="2007") %>% summarize(Over_70 = mean(lifeExp>70))
## # A tibble: 1 × 1## Over_70## <dbl>## 1 0.585
summarize()
multiple variables at once, separate by commas# get average life expectancy and GDP # call each avg_LE, avg_GDPsummarize(gapminder, avg_LE = mean(lifeExp), avg_GDP = mean(gdpPercap))
# using the pipegapminder %>% summarize(avg_LE = mean(lifeExp), avg_GDP = mean(gdpPercap))
summarize()
multiple variables at once, separate by commas# get average life expectancy and GDP # call each avg_LE, avg_GDPsummarize(gapminder, avg_LE = mean(lifeExp), avg_GDP = mean(gdpPercap))
# using the pipegapminder %>% summarize(avg_LE = mean(lifeExp), avg_GDP = mean(gdpPercap))
## # A tibble: 1 × 2## avg_LE avg_GDP## <dbl> <dbl>## 1 59.5 7215.
summarize()
multiple statistics of a variable at once, separate by commas# get count, mean, sd, min, max# of life Expectancy summarize(gapminder, obs = n(), avg_LE = mean(lifeExp), sd_LE = sd(lifeExp), min_LE = min(lifeExp), max_LE = max(lifeExp))
# using the pipegapminder %>% summarize(obs = n(), avg_LE = mean(lifeExp), sd_LE = sd(lifeExp), min_LE = min(lifeExp), max_LE = max(lifeExp))
summarize()
multiple statistics of a variable at once, separate by commas# get count, mean, sd, min, max# of life Expectancy summarize(gapminder, obs = n(), avg_LE = mean(lifeExp), sd_LE = sd(lifeExp), min_LE = min(lifeExp), max_LE = max(lifeExp))
# using the pipegapminder %>% summarize(obs = n(), avg_LE = mean(lifeExp), sd_LE = sd(lifeExp), min_LE = min(lifeExp), max_LE = max(lifeExp))
## # A tibble: 1 × 5## obs avg_LE sd_LE min_LE max_LE## <int> <dbl> <dbl> <dbl> <dbl>## 1 1704 59.5 12.9 23.6 82.6
summarize()
that work on a subset of variablessummarize_all()
: affects every variablesummarize_at()
: affects named or selected variablessummarize_if()
: affects variables that meet a criteriasummarize()
that work on a subset of variablessummarize_all()
: affects every variablesummarize_at()
: affects named or selected variablessummarize_if()
: affects variables that meet a criteria# get the average of all# numeric variables gapminder %>% summarize_if(is.numeric, funs(avg = mean))
## # A tibble: 1 × 4## year_avg lifeExp_avg pop_avg gdpPercap_avg## <dbl> <dbl> <dbl> <dbl>## 1 1980. 59.5 29601212. 7215.
summarize()
that work on a subset of variablessummarize_all()
: affects every variablesummarize_at()
: affects named or selected variablessummarize_if()
: affects variables that meet a criteria# get the average of all# numeric variables gapminder %>% summarize_if(is.numeric, funs(avg = mean))
## # A tibble: 1 × 4## year_avg lifeExp_avg pop_avg gdpPercap_avg## <dbl> <dbl> <dbl> <dbl>## 1 1980. 59.5 29601212. 7215.
# get mean and sd for# pop and lifeExpgapminder %>% summarize_at(vars(pop, lifeExp), funs("avg" = mean, "std dev" = sd))
## # A tibble: 1 × 4## pop_avg lifeExp_avg `pop_std dev` `lifeExp_std dev`## <dbl> <dbl> <dbl> <dbl>## 1 29601212. 59.5 106157897. 12.9
If we have factor
variables grouping a variable into categories, we can run dplyr
verbs by group
summarize()
First define the group with group_by()
If we have factor
variables grouping a variable into categories, we can run dplyr
verbs by group
summarize()
First define the group with group_by()
# get average life expectancy and gdp by continent gapminder %>% group_by(continent) %>% summarize(mean_life = mean(lifeExp), mean_GDP = mean(gdpPercap))
## # A tibble: 5 × 3## continent mean_life mean_GDP## <fct> <dbl> <dbl>## 1 Africa 48.9 2194.## 2 Americas 64.7 7136.## 3 Asia 60.1 7902.## 4 Europe 71.9 14469.## 5 Oceania 74.3 18622.
# track changes in average life expectancy and gdp over time gapminder %>% group_by(year) %>% summarize(mean_life = mean(lifeExp), mean_GDP = mean(gdpPercap))
## # A tibble: 12 × 3## year mean_life mean_GDP## <int> <dbl> <dbl>## 1 1952 49.1 3725.## 2 1957 51.5 4299.## 3 1962 53.6 4726.## 4 1967 55.7 5484.## 5 1972 57.6 6770.## 6 1977 59.6 7313.## 7 1982 61.5 7519.## 8 1987 63.2 7901.## 9 1992 64.2 8159.## 10 1997 65.0 9090.## 11 2002 65.7 9918.## 12 2007 67.0 11680.
# track changes in average life expectancy and gdp by continent over time gapminder %>% group_by(continent, year) %>% summarize(mean_life = mean(lifeExp), mean_GDP = mean(gdpPercap))
## # A tibble: 60 × 4## # Groups: continent [5]## continent year mean_life mean_GDP## <fct> <int> <dbl> <dbl>## 1 Africa 1952 39.1 1253.## 2 Africa 1957 41.3 1385.## 3 Africa 1962 43.3 1598.## 4 Africa 1967 45.3 2050.## 5 Africa 1972 47.5 2340.## 6 Africa 1977 49.6 2586.## 7 Africa 1982 51.6 2482.## 8 Africa 1987 53.3 2283.## 9 Africa 1992 53.6 2282.## 10 Africa 1997 53.6 2379.## # … with 50 more rows
tidyverse
uses same grammar and design philosophytidyverse
uses same grammar and design philosophygapminder %>%group_by(continent, year) %>%summarize(mean_life = mean(lifeExp), mean_GDP = mean(gdpPercap)) %>%# now pipe this tibble in as data for ggplot!ggplot(data = ., # . stands in for stuff ^! aes(x = year, y = mean_life, color = continent))+geom_path(size=1)+labs(x = "Year", y = "Average Life Expectancy (Years)", color = "Continent", title = "Average Life Expectancy Over Time")+theme_classic(base_family = "Fira Sans Condensed", base_size=20)
tidyverse
uses same grammar and design philosophygapminder %>%group_by(continent, year) %>%summarize(mean_life = mean(lifeExp), mean_GDP = mean(gdpPercap)) %>%# now pipe this tibble in as data for ggplot!ggplot(data = ., # . stands in for stuff ^! aes(x = year, y = mean_life, color = continent))+geom_path(size=1)+labs(x = "Year", y = "Average Life Expectancy (Years)", color = "Continent", title = "Average Life Expectancy Over Time")+theme_classic(base_family = "Fira Sans Condensed", base_size=20)
tally
provides counts, best used with group_by
for factors
tally
provides counts, best used with group_by
for factors
gapminder %>% tally
## # A tibble: 1 × 1## n## <int>## 1 1704
tally
provides counts, best used with group_by
for factors
gapminder %>% tally
## # A tibble: 1 × 1## n## <int>## 1 1704
gapminder %>% group_by(continent) %>% tally
## # A tibble: 5 × 2## continent n## <fct> <int>## 1 Africa 624## 2 Americas 300## 3 Asia 396## 4 Europe 360## 5 Oceania 24
slice()
subsets rows by position instead of filter
ing by valuesslice()
subsets rows by position instead of filter
ing by valuesgapminder %>% slice(15:17) # see 15th through 17th observations
## # A tibble: 3 × 6## country continent year lifeExp pop gdpPercap## <fct> <fct> <int> <dbl> <int> <dbl>## 1 Albania Europe 1962 64.8 1728137 2313.## 2 Albania Europe 1967 66.2 1984060 2760.## 3 Albania Europe 1972 67.7 2263554 3313.
pull()
extracts a column from a tibble
(just like $
)pull()
extracts a column from a tibble
(just like $
)# Get all U.S. life expectancy observationsgapminder %>% filter(country == "United States") %>% pull(lifeExp)
## [1] 68.440 69.490 70.210 70.760 71.340 73.380 74.650 75.020 76.090 76.810## [11] 77.310 78.242
pull()
extracts a column from a tibble
(just like $
)# Get all U.S. life expectancy observationsgapminder %>% filter(country == "United States") %>% pull(lifeExp)
## [1] 68.440 69.490 70.210 70.760 71.340 73.380 74.650 75.020 76.090 76.810## [11] 77.310 78.242
# Get U.S. life expectancy in 2007gapminder %>% filter(country == "United States" & year == 2007) %>% pull(lifeExp)
## [1] 78.242
distinct()
shows the distinct values of a specified variable (recall n_distinct()
inside summarize()
just gives you the number of values)gapminder %>% distinct(country)
## # A tibble: 142 × 1## country ## <fct> ## 1 Afghanistan## 2 Albania ## 3 Algeria ## 4 Angola ## 5 Argentina ## 6 Australia ## 7 Austria ## 8 Bahrain ## 9 Bangladesh ## 10 Belgium ## # … with 132 more rows
tidyr
helps reshape data into more usable format† This is the namesake of the tidyverse
: all associated packages and functions use or require this data format!
‡ Alternatively, sometimes rule 3 is "every value is its own cell."
"Happy families are all alike; every unhappy family is unhappy in its own way." - Leo Tolstoy
"Tidy datasets are all alike, but every messy dataset is messy in its own way." - Hadley Wickham
# make example untidy data ex_wide<-tribble( ~"Country", ~"2000", ~"2010", "United States", 140, 180, "Canada", 102, 98, "China", 111, 123)ex_wide
## # A tibble: 3 × 3## Country `2000` `2010`## <chr> <dbl> <dbl>## 1 United States 140 180## 2 Canada 102 98## 3 China 111 123
year
variable!# make example untidy data ex_wide<-tribble( ~"Country", ~"2000", ~"2010", "United States", 140, 180, "Canada", 102, 98, "China", 111, 123)ex_wide
## # A tibble: 3 × 3## Country `2000` `2010`## <chr> <dbl> <dbl>## 1 United States 140 180## 2 Canada 102 98## 3 China 111 123
gather()
these columns into a new pair of variables2000
and 2010
)key
: name of variable whose values form the column names (we'll call it the year
)value
: name of the variable whose values are spread over the cells (we'll call it number of cases
)gather()
a wide data frame into a long data frame ex_wide
## # A tibble: 3 × 3## Country `2000` `2010`## <chr> <dbl> <dbl>## 1 United States 140 180## 2 Canada 102 98## 3 China 111 123
ex_wide %>% gather("2000","2010", key = "year", value = "cases")
## # A tibble: 6 × 3## Country year cases## <chr> <chr> <dbl>## 1 United States 2000 140## 2 Canada 2000 102## 3 China 2000 111## 4 United States 2010 180## 5 Canada 2010 98## 6 China 2010 123
ex_long # example I made (code hidden)
## # A tibble: 12 × 4## Country Year Type Count## <chr> <dbl> <chr> <dbl>## 1 United States 2000 Cases 140## 2 United States 2000 Population 300## 3 United States 2010 Cases 180## 4 United States 2010 Population 310## 5 Canada 2000 Cases 102## 6 Canada 2000 Population 110## 7 Canada 2010 Cases 98## 8 Canada 2010 Population 121## 9 China 2000 Cases 111## 10 China 2000 Population 1201## 11 China 2010 Cases 123## 12 China 2010 Population 1241
Cases
and one for Population
(categorized by type
of variable)ex_long # example I made (code hidden)
## # A tibble: 12 × 4## Country Year Type Count## <chr> <dbl> <chr> <dbl>## 1 United States 2000 Cases 140## 2 United States 2000 Population 300## 3 United States 2010 Cases 180## 4 United States 2010 Population 310## 5 Canada 2000 Cases 102## 6 Canada 2000 Population 110## 7 Canada 2010 Cases 98## 8 Canada 2010 Population 121## 9 China 2000 Cases 111## 10 China 2000 Population 1201## 11 China 2010 Cases 123## 12 China 2010 Population 1241
spread()
these columns into a new pair of variableskey
: column that contains variable names (here, the type
)value
: column that contains values from multiple variables (here, the count
)spread()
a long data frame into a wide data frame ex_long
## # A tibble: 12 × 4## Country Year Type Count## <chr> <dbl> <chr> <dbl>## 1 United States 2000 Cases 140## 2 United States 2000 Population 300## 3 United States 2010 Cases 180## 4 United States 2010 Population 310## 5 Canada 2000 Cases 102## 6 Canada 2000 Population 110## 7 Canada 2010 Cases 98## 8 Canada 2010 Population 121## 9 China 2000 Cases 111## 10 China 2000 Population 1201## 11 China 2010 Cases 123## 12 China 2010 Population 1241
ex_long %>% spread(key = "Type", value = "Count")
## # A tibble: 6 × 4## Country Year Cases Population## <chr> <dbl> <dbl> <dbl>## 1 Canada 2000 102 110## 2 Canada 2010 98 121## 3 China 2000 111 1201## 4 China 2010 123 1241## 5 United States 2000 140 300## 6 United States 2010 180 310
Often, data doesn't come from just one source, but several sources
We can combine datasets into a single dataframe (tibble) using dplyr
commands in several ways:
bind
dataframes together by row or by columnbind_rows()
adds observations (rows) to existing dataset1bind_cols()
adds variables (columns) to existing dataset2join
two dataframes by designating variable(s) as key
to match rows by identical values of that key
† Note the columns must be identical between the original dataset and the new observations
‡ Note the rows must be identical between original dataset and new variable
Sometimes you want to add rows (observations) or columns (variables) that happen to match up perfectly
In this case, simply using bind_*(old_df, new_df)
will work
bind_columns(old_df, new_df)
adds columns from new_df
to old_df
bind_rows(old_df, new_df)
adds rows from new_df
to old_df
bind_columns()
(Variables)
bind_rows()
(Observations)
x
and y
*x$x
and y$y
1
and 2
x
has observation 3
which y
does not havey
has observation 4
which x
does not havex
and y
into a single dataframe* Images on all following slides come from Garrick Aden-Buie's excellent tidyexplain
x
and y
for which there are matching rowsx
with no match in y
(3) will be droppedy
with no match in x
(4) will be droppedx
and add all columns from y
x
with no match in y
(3) will have NA
sy
with no match in x
(4) will be droppedy
and add all columns from x
y
with no match in x
(4) will have NA
sx
with no match in y
(3) will be droppedx
and y
NA
stibble
readr
and importing data
dplyr
and data wrangling
tidyr
and tidying or reshaping data
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
o | Tile View: Overview of Slides |
Esc | Back to slideshow |
NA
") 😡Ideally, you'd want to be able to do all of this in one program
"Yet far too much handcrafted work - what data scientists call "data wrangling," "data munging," and "data janitor work" - is still required. Data scientists, according to interviews and expert estimates, spend from 50 to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets."
"The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
suppressPackageStartupMessages()
and insert library()
command inside# install for first time# install.packages("tidyverse") # this takes a few minutes and may give several prompts# load tidyverse suppressPackageStartupMessages(library("tidyverse"))
tidyverse
contains a lot of packages, not all are loaded automaticallytidyverse_packages()
## [1] "broom" "cli" "crayon" "dbplyr" ## [5] "dplyr" "dtplyr" "forcats" "googledrive" ## [9] "googlesheets4" "ggplot2" "haven" "hms" ## [13] "httr" "jsonlite" "lubridate" "magrittr" ## [17] "modelr" "pillar" "purrr" "readr" ## [21] "readxl" "reprex" "rlang" "rstudioapi" ## [25] "rvest" "stringr" "tibble" "tidyr" ## [29] "xml2" "tidyverse"
tibble
for friendlier dataframesmagrittr
for "pipeable" codereadr
for importing datadplyr
for data wranglingtidyr
for tidying dataggplot2
for plotting data (we've already covered)broom
for tidy regression (not part of core tidyverse)forcats
for working with factorsstringr
for working with stringslubridate
for working with dates and timespurrr
for iterationtibble
converts all data.frames
into a friendlier version called tibbles
(or tbl_df
)diamonds
## # A tibble: 53,940 × 7## carat cut color clarity depth table price## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int>## 1 0.23 Ideal E SI2 61.5 55 326## 2 0.21 Premium E SI1 59.8 61 326## 3 0.23 Good E VS1 56.9 65 327## 4 0.29 Premium I VS2 62.4 58 334## 5 0.31 Good J SI2 63.3 58 335## 6 0.24 Very Good J VVS2 62.8 57 336## 7 0.24 Very Good I VVS1 62.3 57 336## 8 0.26 Very Good H SI1 61.9 55 337## 9 0.22 Fair E VS2 65.1 61 337## 10 0.23 Very Good H VS1 59.4 61 338## # … with 53,930 more rows
Prints much nicer output
Shows a bit of the str
ucture:
nrow() x ncol()
<dbl>
is numeric ("double")<ord>
is an ordered factor<int>
is an integerFundamental grammar of tidyverse:
tibble
from a data.frame
with as_tibble()
as_tibble(mpg) # take built-in dataframe mpg
tibble
from scratch with tibble()
, works like data.frame()
example<-tibble(x = seq(2,6,2), # sequence from 2 to 6 by 2's y = rnorm(3,0,1), # 3 random draws with mean 0, sd 1 colors = c("orange", "green", "blue"))example
## # A tibble: 3 × 3## x y colors## <dbl> <dbl> <chr> ## 1 2 -0.0747 orange## 2 4 -1.01 green ## 3 6 -1.05 blue
tibble
row-by-row with tribble()
example_2<-tribble( ~x, ~y, ~color, # each variable name starts with ~ 2, 1.5, "orange", 4, 0.2, "green", 6, 0.8, "blue") # last element has no commaexample_2
## # A tibble: 3 × 3## x y color ## <dbl> <dbl> <chr> ## 1 2 1.5 orange## 2 4 0.2 green ## 3 6 0.8 blue
The magrittr
package allows us to use the "pipe" operator (%>%
)†
%>%
"pipes" the output of the left of the pipe into the (1st) argument of the right
Running a function f
on object x
as f(x)
becomes x %>% f
in pipeable form
x
and then run function f
on it"† Keyboard shortcuts in R Studio: CTRL+Shift+M
(Windows) or Cmd+Shift+M
(Mac)
With ordinary math functions, read from outside ← (inside): g(f(x))
x
and perform function f()
on x
and then take that result and perform function g()
on itWith pipes, read operations from left → right:
x %>% f %>% g
take x
and then perform function f
on it, then perform function g
on that result
%>%
mentally as "and then"Example
ln(exp(x))
x %>% exp() %>% ln()
Example
park(drive(start_car(find("keys")), to = "campus"))
find("keys") %>% start_car() %>% drive(to = "campus") %>% park()
# look at top 6 rowshead(gapminder) # use pipe insteadgapminder %>% head()
## # A tibble: 6 × 6## country continent year lifeExp pop gdpPercap## <fct> <fct> <int> <dbl> <int> <dbl>## 1 Afghanistan Asia 1952 28.8 8425333 779.## 2 Afghanistan Asia 1957 30.3 9240934 821.## 3 Afghanistan Asia 1962 32.0 10267083 853.## 4 Afghanistan Asia 1967 34.0 11537966 836.## 5 Afghanistan Asia 1972 36.1 13079460 740.## 6 Afghanistan Asia 1977 38.4 14880372 786.
These two methods produce the same output (average hightway mpg of Audi cars)
Without the pipe
summarise(group_by(filter(mpg, manufacturer=="audi"), model), hwy_mean = mean(hwy))
These two methods produce the same output (average hightway mpg of Audi cars)
Without the pipe
summarise(group_by(filter(mpg, manufacturer=="audi"), model), hwy_mean = mean(hwy))
mpg %>% filter(manufacturer=="audi") %>% group_by(model) %>% summarise(hwy_mean = mean(hwy))
## # A tibble: 3 × 2## model hwy_mean## <chr> <dbl>## 1 a4 28.3## 2 a4 quattro 25.8## 3 a6 quattro 24
readr
helps load common spreadsheet files (.csv
, .tsv
) with simple commands:read_*(path/to/my_data.*)
*
can be .csv
or .tsv
Often this is enough, but many more customizations possible
You can also export your data from R into a common spreadsheet file with:
write_*(my_df, path = path/to/file_name.*)
my_df
is the name of your tibble
, and file_name
is the name of the file you want to save asRead more on the tidyverse website and the Readr Cheatsheet
readxl
has equivalent commands for Excel data types:
read_*("path/to/my/data.*")
write_*(my_dataframe, path=path/to/file_name.*)
*
can be .xls
or .xlsx
haven
has equivalent commands for other data types:
read_*("path/to/my_data.dta")
for STATA .dta
fileswrite_*(my_dataframe, path=path/to/file_name.*)
*
can be .dta
(STATA), .sav
(SPSS), .sas7bdat
(SAS)Most common: "where the hell is my data file"??
Recall R
looks for files to read_*()
in the default working directory (check what it is with getwd()
, change it with setwd()
)
You can tell R
where this data is by making the path
a part of the file's name when importing
..
to "move up one folder"/
to "enter a folder"Either use an absolute path on your computer:
# Exampledf <- read_csv("C:/Documents and Settings/Ryan Safner/Downloads/my_data.csv")
Most common: "where the hell is my data file"??
Recall R
looks for files to read_*()
in the default working directory (check what it is with getwd()
, change it with setwd()
)
You can tell R
where this data is by making the path
a part of the file's name when importing
..
to "move up one folder"/
to "enter a folder"# Example# If working directory is Documents, but data is in Downloads, like so: # # Ryan Safner/# |# |- Documents/# |- Downloads/# |- Photos/# |- Videos/df <- read_csv("../Downloads/my_data.csv")
Suggestion to make your data import easier: Download and move files to R's working directory
Your computer and working directory are different from mine (and others)
This is not a reproducible workflow!
We'll finally fix this next class with R Projects
dplyr
uses more efficient & intuitive commands to manipulate tibblesBase R
grammar passively runs functions on nouns: function(object)
dplyr
grammar actively uses verbs: verb(df, conditions)
†
Three great features:
%>%
pipe operatortibble
df %>% verb(conditions)
dplyr
verbsVerb | Does |
---|---|
filter() |
Keep only selected observations |
select() |
Keep only selected variables |
arrange() |
Reorder rows (e.g. in numerical order) |
mutate() |
Create new variables |
summarize() |
Collapse data into summary statistics |
group_by() |
Perform any of the above functions by groups/categories |
filter
keeps only selected observations (rows)filter
keeps only selected observations (rows)# look only at African observations# syntax without the pipefilter(gapminder, continent=="Africa")
# using the pipegapminder %>% filter(continent == "Africa")
filter
keeps only selected observations (rows)# look only at African observations# syntax without the pipefilter(gapminder, continent=="Africa")
# using the pipegapminder %>% filter(continent == "Africa")
## # A tibble: 624 × 6## country continent year lifeExp pop gdpPercap## <fct> <fct> <int> <dbl> <int> <dbl>## 1 Algeria Africa 1952 43.1 9279525 2449.## 2 Algeria Africa 1957 45.7 10270856 3014.## 3 Algeria Africa 1962 48.3 11000948 2551.## 4 Algeria Africa 1967 51.4 12760499 3247.## 5 Algeria Africa 1972 54.5 14760787 4183.## 6 Algeria Africa 1977 58.0 17152804 4910.## 7 Algeria Africa 1982 61.4 20033753 5745.## 8 Algeria Africa 1987 65.8 23254956 5681.## 9 Algeria Africa 1992 67.7 26298373 5023.## 10 Algeria Africa 1997 69.2 29072015 4797.## # … with 614 more rows
dplyr
functions never modify their inputs (i.e. never overwrite the original tibble
)<-
to assign it to a new tibble
tibble
by namedplyr
functions never modify their inputs (i.e. never overwrite the original tibble
)<-
to assign it to a new tibble
tibble
by name# base syntaxafrica <- filter(gapminder, continent=="Africa")
# using the pipeafrica <- gapminder %>% filter(continent == "Africa")
# look at new tibbleafrica
## # A tibble: 624 × 6## country continent year lifeExp pop gdpPercap## <fct> <fct> <int> <dbl> <int> <dbl>## 1 Algeria Africa 1952 43.1 9279525 2449.## 2 Algeria Africa 1957 45.7 10270856 3014.## 3 Algeria Africa 1962 48.3 11000948 2551.## 4 Algeria Africa 1967 51.4 12760499 3247.## 5 Algeria Africa 1972 54.5 14760787 4183.## 6 Algeria Africa 1977 58.0 17152804 4910.## 7 Algeria Africa 1982 61.4 20033753 5745.## 8 Algeria Africa 1987 65.8 23254956 5681.## 9 Algeria Africa 1992 67.7 26298373 5023.## 10 Algeria Africa 1997 69.2 29072015 4797.## # … with 614 more rows
(africa <- gapminder %>% filter(continent == "Africa"))
## # A tibble: 624 × 6## country continent year lifeExp pop gdpPercap## <fct> <fct> <int> <dbl> <int> <dbl>## 1 Algeria Africa 1952 43.1 9279525 2449.## 2 Algeria Africa 1957 45.7 10270856 3014.## 3 Algeria Africa 1962 48.3 11000948 2551.## 4 Algeria Africa 1967 51.4 12760499 3247.## 5 Algeria Africa 1972 54.5 14760787 4183.## 6 Algeria Africa 1977 58.0 17152804 4910.## 7 Algeria Africa 1982 61.4 20033753 5745.## 8 Algeria Africa 1987 65.8 23254956 5681.## 9 Algeria Africa 1992 67.7 26298373 5023.## 10 Algeria Africa 1997 69.2 29072015 4797.## # … with 614 more rows
tibble
, it would overwrite the original!# base syntaxgapminder <- filter(gapminder, continent=="Africa")
# using the pipegapminder <- gapminder %>% filter(continent == "Africa")# this overwrites gapminder!
TRUE
†>
, <
: greater than, less than>=
, <=
: greater than or equal to, less than or equal to==
‡, !=
: is equal to‡, is not equal to%in%
: is a member of some defined set (∈)&
: AND (commas also work instead)|
: OR!
: not† See ?Comparison
and ?Base::Logic
.
‡ Recall one =
assigns values to an object, two ==
tests an object for a condition!
# look only at African observations# in 1997gapminder %>% filter(continent == "Africa", year == 1997)
# look only at African observations# in 1997gapminder %>% filter(continent == "Africa", year == 1997)
## # A tibble: 52 × 6## country continent year lifeExp pop gdpPercap## <fct> <fct> <int> <dbl> <int> <dbl>## 1 Algeria Africa 1997 69.2 29072015 4797.## 2 Angola Africa 1997 41.0 9875024 2277.## 3 Benin Africa 1997 54.8 6066080 1233.## 4 Botswana Africa 1997 52.6 1536536 8647.## 5 Burkina Faso Africa 1997 50.3 10352843 946.## 6 Burundi Africa 1997 45.3 6121610 463.## 7 Cameroon Africa 1997 52.2 14195809 1694.## 8 Central African Republic Africa 1997 46.1 3696513 741.## 9 Chad Africa 1997 51.6 7562011 1005.## 10 Comoros Africa 1997 60.7 527982 1174.## # … with 42 more rows
# look only at African observations # or observations in 1997gapminder %>% filter(continent == "Africa" | year == 1997)
# look only at African observations # or observations in 1997gapminder %>% filter(continent == "Africa" | year == 1997)
## # A tibble: 714 × 6## country continent year lifeExp pop gdpPercap## <fct> <fct> <int> <dbl> <int> <dbl>## 1 Afghanistan Asia 1997 41.8 22227415 635.## 2 Albania Europe 1997 73.0 3428038 3193.## 3 Algeria Africa 1952 43.1 9279525 2449.## 4 Algeria Africa 1957 45.7 10270856 3014.## 5 Algeria Africa 1962 48.3 11000948 2551.## 6 Algeria Africa 1967 51.4 12760499 3247.## 7 Algeria Africa 1972 54.5 14760787 4183.## 8 Algeria Africa 1977 58.0 17152804 4910.## 9 Algeria Africa 1982 61.4 20033753 5745.## 10 Algeria Africa 1987 65.8 23254956 5681.## # … with 704 more rows
# look only at U.S. and U.K. # observations in 2002gapminder %>% filter(country %in% c("United States", "United Kingdom"), year == 2002)
# look only at U.S. and U.K. # observations in 2002gapminder %>% filter(country %in% c("United States", "United Kingdom"), year == 2002)
## # A tibble: 2 × 6## country continent year lifeExp pop gdpPercap## <fct> <fct> <int> <dbl> <int> <dbl>## 1 United Kingdom Europe 2002 78.5 59912431 29479.## 2 United States Americas 2002 77.3 287675526 39097.
arrange
reorders observations (rows) in a logical orderarrange
reorders observations (rows) in a logical order# order by smallest to largest pop# syntax without the pipearrange(gapminder, pop)
# using the pipegapminder %>% arrange(pop)
arrange
reorders observations (rows) in a logical order# order by smallest to largest pop# syntax without the pipearrange(gapminder, pop)
# using the pipegapminder %>% arrange(pop)
## # A tibble: 1,704 × 6## country continent year lifeExp pop gdpPercap## <fct> <fct> <int> <dbl> <int> <dbl>## 1 Sao Tome and Principe Africa 1952 46.5 60011 880.## 2 Sao Tome and Principe Africa 1957 48.9 61325 861.## 3 Djibouti Africa 1952 34.8 63149 2670.## 4 Sao Tome and Principe Africa 1962 51.9 65345 1072.## 5 Sao Tome and Principe Africa 1967 54.4 70787 1385.## 6 Djibouti Africa 1957 37.3 71851 2865.## 7 Sao Tome and Principe Africa 1972 56.5 76595 1533.## 8 Sao Tome and Principe Africa 1977 58.6 86796 1738.## 9 Djibouti Africa 1962 39.7 89898 3021.## 10 Sao Tome and Principe Africa 1982 60.4 98593 1890.## # … with 1,694 more rows
# order by year, with the smallest # to largest pop in each year# syntax without the pipearrange(gapminder, year, pop)
# using the pipegapminder %>% arrange(year, pop)
# order by year, with the smallest # to largest pop in each year# syntax without the pipearrange(gapminder, year, pop)
# using the pipegapminder %>% arrange(year, pop)
## # A tibble: 1,704 × 6## country continent year lifeExp pop gdpPercap## <fct> <fct> <int> <dbl> <int> <dbl>## 1 Sao Tome and Principe Africa 1952 46.5 60011 880.## 2 Djibouti Africa 1952 34.8 63149 2670.## 3 Bahrain Asia 1952 50.9 120447 9867.## 4 Iceland Europe 1952 72.5 147962 7268.## 5 Comoros Africa 1952 40.7 153936 1103.## 6 Kuwait Asia 1952 55.6 160000 108382.## 7 Equatorial Guinea Africa 1952 34.5 216964 376.## 8 Reunion Africa 1952 52.7 257700 2719.## 9 Gambia Africa 1952 30 284320 485.## 10 Swaziland Africa 1952 41.4 290243 1148.## # … with 1,694 more rows
desc()
to re-order in the opposite directiondesc()
to re-order in the opposite direction# order by largest to smallest pop# syntax without the pipearrange(gapminder, desc(pop))
# using the pipegapminder %>% arrange(desc(pop))
desc()
to re-order in the opposite direction# order by largest to smallest pop# syntax without the pipearrange(gapminder, desc(pop))
# using the pipegapminder %>% arrange(desc(pop))
## # A tibble: 1,704 × 6## country continent year lifeExp pop gdpPercap## <fct> <fct> <int> <dbl> <int> <dbl>## 1 China Asia 2007 73.0 1318683096 4959.## 2 China Asia 2002 72.0 1280400000 3119.## 3 China Asia 1997 70.4 1230075000 2289.## 4 China Asia 1992 68.7 1164970000 1656.## 5 India Asia 2007 64.7 1110396331 2452.## 6 China Asia 1987 67.3 1084035000 1379.## 7 India Asia 2002 62.9 1034172547 1747.## 8 China Asia 1982 65.5 1000281000 962.## 9 India Asia 1997 61.8 959000000 1459.## 10 China Asia 1977 64.0 943455000 741.## # … with 1,694 more rows
select
keeps only selected variables (columns)select
keeps only selected variables (columns)# keep only country, year, # and population variables# syntax without the pipeselect(gapminder, country, year, pop)
# using the pipegapminder %>% select(country, year, pop)
select
keeps only selected variables (columns)# keep only country, year, # and population variables# syntax without the pipeselect(gapminder, country, year, pop)
# using the pipegapminder %>% select(country, year, pop)
## # A tibble: 1,704 × 3## country year pop## <fct> <int> <int>## 1 Afghanistan 1952 8425333## 2 Afghanistan 1957 9240934## 3 Afghanistan 1962 10267083## 4 Afghanistan 1967 11537966## 5 Afghanistan 1972 13079460## 6 Afghanistan 1977 14880372## 7 Afghanistan 1982 12881816## 8 Afghanistan 1987 13867957## 9 Afghanistan 1992 16317921## 10 Afghanistan 1997 22227415## # … with 1,694 more rows
select
"all except" by negating a variable with -
select
"all except" by negating a variable with -
# keep all *except* gdpPercap# syntax without the pipeselect(gapminder, -gdpPercap)
# using the pipegapminder %>% select(-gdpPercap)
select
"all except" by negating a variable with -
# keep all *except* gdpPercap# syntax without the pipeselect(gapminder, -gdpPercap)
# using the pipegapminder %>% select(-gdpPercap)
## # A tibble: 1,704 × 5## country continent year lifeExp pop## <fct> <fct> <int> <dbl> <int>## 1 Afghanistan Asia 1952 28.8 8425333## 2 Afghanistan Asia 1957 30.3 9240934## 3 Afghanistan Asia 1962 32.0 10267083## 4 Afghanistan Asia 1967 34.0 11537966## 5 Afghanistan Asia 1972 36.1 13079460## 6 Afghanistan Asia 1977 38.4 14880372## 7 Afghanistan Asia 1982 39.9 12881816## 8 Afghanistan Asia 1987 40.8 13867957## 9 Afghanistan Asia 1992 41.7 16317921## 10 Afghanistan Asia 1997 41.8 22227415## # … with 1,694 more rows
select
reorders the columns in the order you provideeverything()
at the endselect
reorders the columns in the order you provideeverything()
at the end# keep all and move pop first# syntax without the pipeselect(gapminder, pop, everything())
# using the pipegapminder %>% select(pop, everything())
select
reorders the columns in the order you provideeverything()
at the end# keep all and move pop first# syntax without the pipeselect(gapminder, pop, everything())
# using the pipegapminder %>% select(pop, everything())
## # A tibble: 1,704 × 6## pop country continent year lifeExp gdpPercap## <int> <fct> <fct> <int> <dbl> <dbl>## 1 8425333 Afghanistan Asia 1952 28.8 779.## 2 9240934 Afghanistan Asia 1957 30.3 821.## 3 10267083 Afghanistan Asia 1962 32.0 853.## 4 11537966 Afghanistan Asia 1967 34.0 836.## 5 13079460 Afghanistan Asia 1972 36.1 740.## 6 14880372 Afghanistan Asia 1977 38.4 786.## 7 12881816 Afghanistan Asia 1982 39.9 978.## 8 13867957 Afghanistan Asia 1987 40.8 852.## 9 16317921 Afghanistan Asia 1992 41.7 649.## 10 22227415 Afghanistan Asia 1997 41.8 635.## # … with 1,694 more rows
select
has a lot of helper functions, useful for when you have hundreds of variables?select()
for a listselect
has a lot of helper functions, useful for when you have hundreds of variables?select()
for a list# keep all variables starting with "co"gapminder %>% select(starts_with("co"))
## # A tibble: 1,704 × 2## country continent## <fct> <fct> ## 1 Afghanistan Asia ## 2 Afghanistan Asia ## 3 Afghanistan Asia ## 4 Afghanistan Asia ## 5 Afghanistan Asia ## 6 Afghanistan Asia ## 7 Afghanistan Asia ## 8 Afghanistan Asia ## 9 Afghanistan Asia ## 10 Afghanistan Asia ## # … with 1,694 more rows
select
has a lot of helper functions, useful for when you have hundreds of variables?select()
for a list# keep all variables starting with "co"gapminder %>% select(starts_with("co"))
## # A tibble: 1,704 × 2## country continent## <fct> <fct> ## 1 Afghanistan Asia ## 2 Afghanistan Asia ## 3 Afghanistan Asia ## 4 Afghanistan Asia ## 5 Afghanistan Asia ## 6 Afghanistan Asia ## 7 Afghanistan Asia ## 8 Afghanistan Asia ## 9 Afghanistan Asia ## 10 Afghanistan Asia ## # … with 1,694 more rows
# keep country and all variables # containing "per"gapminder %>% select(country, contains("per"))
## # A tibble: 1,704 × 2## country gdpPercap## <fct> <dbl>## 1 Afghanistan 779.## 2 Afghanistan 821.## 3 Afghanistan 853.## 4 Afghanistan 836.## 5 Afghanistan 740.## 6 Afghanistan 786.## 7 Afghanistan 978.## 8 Afghanistan 852.## 9 Afghanistan 649.## 10 Afghanistan 635.## # … with 1,694 more rows
rename
changes the name of a variable (column)new_name = old_name
rename
changes the name of a variable (column)new_name = old_name
# rename gdpPercap to GDP# syntax without the piperename(gapminder, GDP = gdpPercap)
# using the pipegapminder %>% rename(GDP = gdpPercap)
rename
changes the name of a variable (column)new_name = old_name
# rename gdpPercap to GDP# syntax without the piperename(gapminder, GDP = gdpPercap)
# using the pipegapminder %>% rename(GDP = gdpPercap)
## # A tibble: 1,704 × 6## country continent year lifeExp pop GDP## <fct> <fct> <int> <dbl> <int> <dbl>## 1 Afghanistan Asia 1952 28.8 8425333 779.## 2 Afghanistan Asia 1957 30.3 9240934 821.## 3 Afghanistan Asia 1962 32.0 10267083 853.## 4 Afghanistan Asia 1967 34.0 11537966 836.## 5 Afghanistan Asia 1972 36.1 13079460 740.## 6 Afghanistan Asia 1977 38.4 14880372 786.## 7 Afghanistan Asia 1982 39.9 12881816 978.## 8 Afghanistan Asia 1987 40.8 13867957 852.## 9 Afghanistan Asia 1992 41.7 16317921 649.## 10 Afghanistan Asia 1997 41.8 22227415 635.## # … with 1,694 more rows
mutate
creates a new variable (column)new_variable_name = operation
# create variable "europe" if country # is in Europe# syntax without the pipemutate(gapminder, europe = ifelse(continent == "Europe", yes = "In Europe", no = "Not in Europe"))
# using the pipegapminder %>% mutate(europe = ifelse(continent == "Europe", yes = "In Europe", no = "Not in Europe"))
# create variable "europe" if country # is in Europe# syntax without the pipemutate(gapminder, europe = ifelse(continent == "Europe", yes = "In Europe", no = "Not in Europe"))
# using the pipegapminder %>% mutate(europe = ifelse(continent == "Europe", yes = "In Europe", no = "Not in Europe"))
## # A tibble: 1,704 × 4## country continent year europe ## <fct> <fct> <int> <chr> ## 1 Afghanistan Asia 1952 Not in Europe## 2 Afghanistan Asia 1957 Not in Europe## 3 Afghanistan Asia 1962 Not in Europe## 4 Afghanistan Asia 1967 Not in Europe## 5 Afghanistan Asia 1972 Not in Europe## 6 Afghanistan Asia 1977 Not in Europe## 7 Afghanistan Asia 1982 Not in Europe## 8 Afghanistan Asia 1987 Not in Europe## 9 Afghanistan Asia 1992 Not in Europe## 10 Afghanistan Asia 1997 Not in Europe## # … with 1,694 more rows
# create population in millions # syntax without the pipemutate(gapminder, pop_mil = pop / 1000000)
# using the pipegapminder %>% rename(pop_mil = pop / 1000000)
# create population in millions # syntax without the pipemutate(gapminder, pop_mil = pop / 1000000)
# using the pipegapminder %>% rename(pop_mil = pop / 1000000)
## # A tibble: 1,704 × 6## country continent year lifeExp pop pop_mil## <fct> <fct> <int> <dbl> <int> <dbl>## 1 Afghanistan Asia 1952 28.8 8425333 8.43## 2 Afghanistan Asia 1957 30.3 9240934 9.24## 3 Afghanistan Asia 1962 32.0 10267083 10.3 ## 4 Afghanistan Asia 1967 34.0 11537966 11.5 ## 5 Afghanistan Asia 1972 36.1 13079460 13.1 ## 6 Afghanistan Asia 1977 38.4 14880372 14.9 ## 7 Afghanistan Asia 1982 39.9 12881816 12.9 ## 8 Afghanistan Asia 1987 40.8 13867957 13.9 ## 9 Afghanistan Asia 1992 41.7 16317921 16.3 ## 10 Afghanistan Asia 1997 41.8 22227415 22.2 ## # … with 1,694 more rows
# create GDP variable from gdpPercap # and pop, in billions# syntax without the pipemutate(gapminder, GDP = ((gdpPercap * pop)/1000000000))
# using the pipegapminder %>% mutate(GDP = ((gdpPercap * pop)/1000000000))
# create GDP variable from gdpPercap # and pop, in billions# syntax without the pipemutate(gapminder, GDP = ((gdpPercap * pop)/1000000000))
# using the pipegapminder %>% mutate(GDP = ((gdpPercap * pop)/1000000000))
## # A tibble: 1,704 × 6## country continent year pop gdpPercap GDP## <fct> <fct> <int> <int> <dbl> <dbl>## 1 Afghanistan Asia 1952 8425333 779. 6.57## 2 Afghanistan Asia 1957 9240934 821. 7.59## 3 Afghanistan Asia 1962 10267083 853. 8.76## 4 Afghanistan Asia 1967 11537966 836. 9.65## 5 Afghanistan Asia 1972 13079460 740. 9.68## 6 Afghanistan Asia 1977 14880372 786. 11.7 ## 7 Afghanistan Asia 1982 12881816 978. 12.6 ## 8 Afghanistan Asia 1987 13867957 852. 11.8 ## 9 Afghanistan Asia 1992 16317921 649. 10.6 ## 10 Afghanistan Asia 1997 22227415 635. 14.1 ## # … with 1,694 more rows
class
of a variable inside mutate()
with as.*()
gapminder %>% head(., 2)
## # A tibble: 2 × 6## country continent year lifeExp pop gdpPercap## <fct> <fct> <int> <dbl> <int> <dbl>## 1 Afghanistan Asia 1952 28.8 8425333 779.## 2 Afghanistan Asia 1957 30.3 9240934 821.
# change year from an integer to a factorgapminder %>% mutate(year = as.factor(year))
## # A tibble: 1,704 × 6## country continent year lifeExp pop gdpPercap## <fct> <fct> <fct> <dbl> <int> <dbl>## 1 Afghanistan Asia 1952 28.8 8425333 779.## 2 Afghanistan Asia 1957 30.3 9240934 821.## 3 Afghanistan Asia 1962 32.0 10267083 853.## 4 Afghanistan Asia 1967 34.0 11537966 836.## 5 Afghanistan Asia 1972 36.1 13079460 740.## 6 Afghanistan Asia 1977 38.4 14880372 786.## 7 Afghanistan Asia 1982 39.9 12881816 978.## 8 Afghanistan Asia 1987 40.8 13867957 852.## 9 Afghanistan Asia 1992 41.7 16317921 649.## 10 Afghanistan Asia 1997 41.8 22227415 635.## # … with 1,694 more rows
gapminder %>% mutate(GDP = gdpPercap * pop, pop_millions = pop / 1000000)
## # A tibble: 1,704 × 8## country continent year lifeExp pop gdpPercap GDP pop_millions## <fct> <fct> <int> <dbl> <int> <dbl> <dbl> <dbl>## 1 Afghanistan Asia 1952 28.8 8425333 779. 6567086330. 8.43## 2 Afghanistan Asia 1957 30.3 9240934 821. 7585448670. 9.24## 3 Afghanistan Asia 1962 32.0 10267083 853. 8758855797. 10.3 ## 4 Afghanistan Asia 1967 34.0 11537966 836. 9648014150. 11.5 ## 5 Afghanistan Asia 1972 36.1 13079460 740. 9678553274. 13.1 ## 6 Afghanistan Asia 1977 38.4 14880372 786. 11697659231. 14.9 ## 7 Afghanistan Asia 1982 39.9 12881816 978. 12598563401. 12.9 ## 8 Afghanistan Asia 1987 40.8 13867957 852. 11820990309. 13.9 ## 9 Afghanistan Asia 1992 41.7 16317921 649. 10595901589. 16.3 ## 10 Afghanistan Asia 1997 41.8 22227415 635. 14121995875. 22.2 ## # … with 1,694 more rows
transmute
keeps only newly created variables (select
s only the new mutate
d variables)gapminder %>% transmute(GDP = gdpPercap * pop, pop_millions = pop / 1000000)
## # A tibble: 1,704 × 2## GDP pop_millions## <dbl> <dbl>## 1 6567086330. 8.43## 2 7585448670. 9.24## 3 8758855797. 10.3 ## 4 9648014150. 11.5 ## 5 9678553274. 13.1 ## 6 11697659231. 14.9 ## 7 12598563401. 12.9 ## 8 11820990309. 13.9 ## 9 10595901589. 16.3 ## 10 14121995875. 22.2 ## # … with 1,694 more rows
mutate()
:gapminder %>% select(country, year, lifeExp) %>% mutate(long_1 = lifeExp > 70, long_2 = ifelse(lifeExp > 70, "Long", "Short"))
## # A tibble: 1,704 × 5## country year lifeExp long_1 long_2## <fct> <int> <dbl> <lgl> <chr> ## 1 Afghanistan 1952 28.8 FALSE Short ## 2 Afghanistan 1957 30.3 FALSE Short ## 3 Afghanistan 1962 32.0 FALSE Short ## 4 Afghanistan 1967 34.0 FALSE Short ## 5 Afghanistan 1972 36.1 FALSE Short ## 6 Afghanistan 1977 38.4 FALSE Short ## 7 Afghanistan 1982 39.9 FALSE Short ## 8 Afghanistan 1987 40.8 FALSE Short ## 9 Afghanistan 1992 41.7 FALSE Short ## 10 Afghanistan 1997 41.8 FALSE Short ## # … with 1,694 more rows
mutate()
is order-aware, so you can chain multiple mutates that depend on previous mutatesgapminder %>% select(country, year, lifeExp) %>% mutate(dog_years = lifeExp * 7, comment = paste("Life expectancy in", country, "is", dog_years, "in dog years.", sep = " "))
## # A tibble: 1,704 × 5## country year lifeExp dog_years comment ## <fct> <int> <dbl> <dbl> <chr> ## 1 Afghanistan 1952 28.8 202. Life expectancy in Afghanistan is 201.60…## 2 Afghanistan 1957 30.3 212. Life expectancy in Afghanistan is 212.32…## 3 Afghanistan 1962 32.0 224. Life expectancy in Afghanistan is 223.97…## 4 Afghanistan 1967 34.0 238. Life expectancy in Afghanistan is 238.14…## 5 Afghanistan 1972 36.1 253. Life expectancy in Afghanistan is 252.61…## 6 Afghanistan 1977 38.4 269. Life expectancy in Afghanistan is 269.06…## 7 Afghanistan 1982 39.9 279. Life expectancy in Afghanistan is 278.97…## 8 Afghanistan 1987 40.8 286. Life expectancy in Afghanistan is 285.75…## 9 Afghanistan 1992 41.7 292. Life expectancy in Afghanistan is 291.71…## 10 Afghanistan 1997 41.8 292. Life expectancy in Afghanistan is 292.34…## # … with 1,694 more rows
case_when
creates a new variable with values that are conditional on values of other variables (e.g., "if/else")TRUE
: when gapminder %>% mutate(European = case_when( continent == "Europe" ~ "Aye", TRUE ~ "Nay" ))
## # A tibble: 1,704 × 7## country continent year lifeExp pop gdpPercap European## <fct> <fct> <int> <dbl> <int> <dbl> <chr> ## 1 Afghanistan Asia 1952 28.8 8425333 779. Nay ## 2 Afghanistan Asia 1957 30.3 9240934 821. Nay ## 3 Afghanistan Asia 1962 32.0 10267083 853. Nay ## 4 Afghanistan Asia 1967 34.0 11537966 836. Nay ## 5 Afghanistan Asia 1972 36.1 13079460 740. Nay ## 6 Afghanistan Asia 1977 38.4 14880372 786. Nay ## 7 Afghanistan Asia 1982 39.9 12881816 978. Nay ## 8 Afghanistan Asia 1987 40.8 13867957 852. Nay ## 9 Afghanistan Asia 1992 41.7 16317921 649. Nay ## 10 Afghanistan Asia 1997 41.8 22227415 635. Nay ## # … with 1,694 more rows
mutate
that work on a subset of variables:mutate_all()
affects every variablemutate_at()
affects named or selected variablesmutate_if()
affects variables that meet a criteria # round all observations of numeric# variables to 2 digitsgapminder %>% mutate_if(is.numeric, round, digits = 2)
## # A tibble: 1,704 × 6## country continent year lifeExp pop gdpPercap## <fct> <fct> <dbl> <dbl> <dbl> <dbl>## 1 Afghanistan Asia 1952 28.8 8425333 779.## 2 Afghanistan Asia 1957 30.3 9240934 821.## 3 Afghanistan Asia 1962 32 10267083 853.## 4 Afghanistan Asia 1967 34.0 11537966 836.## 5 Afghanistan Asia 1972 36.1 13079460 740.## 6 Afghanistan Asia 1977 38.4 14880372 786.## 7 Afghanistan Asia 1982 39.8 12881816 978.## 8 Afghanistan Asia 1987 40.8 13867957 852.## 9 Afghanistan Asia 1992 41.7 16317921 649.## 10 Afghanistan Asia 1997 41.8 22227415 635.## # … with 1,694 more rows
mutate
that work on a subset of variables:mutate_all()
affects every variablemutate_at()
affects named or selected variablesmutate_if()
affects variables that meet a criteria # make all factor variables uppercasegapminder %>% mutate_if(is.factor, toupper)
## # A tibble: 1,704 × 6## country continent year lifeExp pop gdpPercap## <chr> <chr> <int> <dbl> <int> <dbl>## 1 AFGHANISTAN ASIA 1952 28.8 8425333 779.## 2 AFGHANISTAN ASIA 1957 30.3 9240934 821.## 3 AFGHANISTAN ASIA 1962 32.0 10267083 853.## 4 AFGHANISTAN ASIA 1967 34.0 11537966 836.## 5 AFGHANISTAN ASIA 1972 36.1 13079460 740.## 6 AFGHANISTAN ASIA 1977 38.4 14880372 786.## 7 AFGHANISTAN ASIA 1982 39.9 12881816 978.## 8 AFGHANISTAN ASIA 1987 40.8 13867957 852.## 9 AFGHANISTAN ASIA 1992 41.7 16317921 649.## 10 AFGHANISTAN ASIA 1997 41.8 22227415 635.## # … with 1,694 more rows
tibble
(or overwrite original) if you want to "save" the new variables!summarize
† outputs a tibble of desired summary statisticsmutate
-ing a new variablesummarize
† outputs a tibble of desired summary statisticsmutate
-ing a new variable# get average life expectancy# call it avg_LEsummarize(gapminder, avg_LE = mean(lifeExp))
# using the pipegapminder %>% summarize(avg_LE = mean(lifeExp))
summarize
† outputs a tibble of desired summary statisticsmutate
-ing a new variable# get average life expectancy# call it avg_LEsummarize(gapminder, avg_LE = mean(lifeExp))
# using the pipegapminder %>% summarize(avg_LE = mean(lifeExp))
## # A tibble: 1 × 1## avg_LE## <dbl>## 1 59.5
† Also the more civilised non-U.S. English spelling summarise
also works. dplyr
was written by a Kiwi after all!
summarize()
commands:Command | Does |
---|---|
n() * |
Number of observations |
n_distinct() * |
Number of unique observations |
sum() |
Sum all observations of a variable |
mean() |
Average of all observations of a variable |
median() |
50th percentile of all observations of a variable |
sd() |
Standard deviation of all observations of a variable |
* Most commands require you to put a variable name inside the command's argument parentheses. These commands require nothing to be in parentheses!
summarize()
commands (continued):Command | Does |
---|---|
min() |
Minimum value of a variable |
max() |
Maximum value of a variable |
quantile(., 0.25) + |
Specified percentile (example 25 th percentile) of a variable |
first() |
First value of a variable |
last() |
Last value of a variable |
nth(., 2) + |
Specified position of a variable (example 2 nd) |
+ The .
is where you would put your variable name.
# summarize with n() gives size of current group, has no argumentsgapminder %>% summarize(amount = n()) # I've called it "amount"
## # A tibble: 1 × 1## amount## <int>## 1 1704
# summarize with n() gives size of current group, has no argumentsgapminder %>% summarize(amount = n()) # I've called it "amount"
## # A tibble: 1 × 1## amount## <int>## 1 1704
# count() is a dedicated command, counts observations by specified variablegapminder %>% count(year) # counts how many observations per year
## # A tibble: 12 × 2## year n## <int> <int>## 1 1952 142## 2 1957 142## 3 1962 142## 4 1967 142## 5 1972 142## 6 1977 142## 7 1982 142## 8 1987 142## 9 1992 142## 10 1997 142## 11 2002 142## 12 2007 142
TRUE
)TRUE=1
and FALSE=0
sum(x)
becomes the number of TRUE
s in x
mean(x)
becomes the proportionTRUE
)TRUE=1
and FALSE=0
sum(x)
becomes the number of TRUE
s in x
mean(x)
becomes the proportion# How many countries have life expectancy# over 70 in 2007?gapminder %>% filter(year=="2007") %>% summarize(Over_70 = sum(lifeExp>70))
## # A tibble: 1 × 1## Over_70## <int>## 1 83
TRUE
)TRUE=1
and FALSE=0
sum(x)
becomes the number of TRUE
s in x
mean(x)
becomes the proportion# How many countries have life expectancy# over 70 in 2007?gapminder %>% filter(year=="2007") %>% summarize(Over_70 = sum(lifeExp>70))
## # A tibble: 1 × 1## Over_70## <int>## 1 83
# What *proportion* of countries have life# expectancy over 70 in 2007?gapminder %>% filter(year=="2007") %>% summarize(Over_70 = mean(lifeExp>70))
## # A tibble: 1 × 1## Over_70## <dbl>## 1 0.585
summarize()
multiple variables at once, separate by commas# get average life expectancy and GDP # call each avg_LE, avg_GDPsummarize(gapminder, avg_LE = mean(lifeExp), avg_GDP = mean(gdpPercap))
# using the pipegapminder %>% summarize(avg_LE = mean(lifeExp), avg_GDP = mean(gdpPercap))
summarize()
multiple variables at once, separate by commas# get average life expectancy and GDP # call each avg_LE, avg_GDPsummarize(gapminder, avg_LE = mean(lifeExp), avg_GDP = mean(gdpPercap))
# using the pipegapminder %>% summarize(avg_LE = mean(lifeExp), avg_GDP = mean(gdpPercap))
## # A tibble: 1 × 2## avg_LE avg_GDP## <dbl> <dbl>## 1 59.5 7215.
summarize()
multiple statistics of a variable at once, separate by commas# get count, mean, sd, min, max# of life Expectancy summarize(gapminder, obs = n(), avg_LE = mean(lifeExp), sd_LE = sd(lifeExp), min_LE = min(lifeExp), max_LE = max(lifeExp))
# using the pipegapminder %>% summarize(obs = n(), avg_LE = mean(lifeExp), sd_LE = sd(lifeExp), min_LE = min(lifeExp), max_LE = max(lifeExp))
summarize()
multiple statistics of a variable at once, separate by commas# get count, mean, sd, min, max# of life Expectancy summarize(gapminder, obs = n(), avg_LE = mean(lifeExp), sd_LE = sd(lifeExp), min_LE = min(lifeExp), max_LE = max(lifeExp))
# using the pipegapminder %>% summarize(obs = n(), avg_LE = mean(lifeExp), sd_LE = sd(lifeExp), min_LE = min(lifeExp), max_LE = max(lifeExp))
## # A tibble: 1 × 5## obs avg_LE sd_LE min_LE max_LE## <int> <dbl> <dbl> <dbl> <dbl>## 1 1704 59.5 12.9 23.6 82.6
summarize()
that work on a subset of variablessummarize_all()
: affects every variablesummarize_at()
: affects named or selected variablessummarize_if()
: affects variables that meet a criteriasummarize()
that work on a subset of variablessummarize_all()
: affects every variablesummarize_at()
: affects named or selected variablessummarize_if()
: affects variables that meet a criteria# get the average of all# numeric variables gapminder %>% summarize_if(is.numeric, funs(avg = mean))
## # A tibble: 1 × 4## year_avg lifeExp_avg pop_avg gdpPercap_avg## <dbl> <dbl> <dbl> <dbl>## 1 1980. 59.5 29601212. 7215.
summarize()
that work on a subset of variablessummarize_all()
: affects every variablesummarize_at()
: affects named or selected variablessummarize_if()
: affects variables that meet a criteria# get the average of all# numeric variables gapminder %>% summarize_if(is.numeric, funs(avg = mean))
## # A tibble: 1 × 4## year_avg lifeExp_avg pop_avg gdpPercap_avg## <dbl> <dbl> <dbl> <dbl>## 1 1980. 59.5 29601212. 7215.
# get mean and sd for# pop and lifeExpgapminder %>% summarize_at(vars(pop, lifeExp), funs("avg" = mean, "std dev" = sd))
## # A tibble: 1 × 4## pop_avg lifeExp_avg `pop_std dev` `lifeExp_std dev`## <dbl> <dbl> <dbl> <dbl>## 1 29601212. 59.5 106157897. 12.9
If we have factor
variables grouping a variable into categories, we can run dplyr
verbs by group
summarize()
First define the group with group_by()
If we have factor
variables grouping a variable into categories, we can run dplyr
verbs by group
summarize()
First define the group with group_by()
# get average life expectancy and gdp by continent gapminder %>% group_by(continent) %>% summarize(mean_life = mean(lifeExp), mean_GDP = mean(gdpPercap))
## # A tibble: 5 × 3## continent mean_life mean_GDP## <fct> <dbl> <dbl>## 1 Africa 48.9 2194.## 2 Americas 64.7 7136.## 3 Asia 60.1 7902.## 4 Europe 71.9 14469.## 5 Oceania 74.3 18622.
# track changes in average life expectancy and gdp over time gapminder %>% group_by(year) %>% summarize(mean_life = mean(lifeExp), mean_GDP = mean(gdpPercap))
## # A tibble: 12 × 3## year mean_life mean_GDP## <int> <dbl> <dbl>## 1 1952 49.1 3725.## 2 1957 51.5 4299.## 3 1962 53.6 4726.## 4 1967 55.7 5484.## 5 1972 57.6 6770.## 6 1977 59.6 7313.## 7 1982 61.5 7519.## 8 1987 63.2 7901.## 9 1992 64.2 8159.## 10 1997 65.0 9090.## 11 2002 65.7 9918.## 12 2007 67.0 11680.
# track changes in average life expectancy and gdp by continent over time gapminder %>% group_by(continent, year) %>% summarize(mean_life = mean(lifeExp), mean_GDP = mean(gdpPercap))
## # A tibble: 60 × 4## # Groups: continent [5]## continent year mean_life mean_GDP## <fct> <int> <dbl> <dbl>## 1 Africa 1952 39.1 1253.## 2 Africa 1957 41.3 1385.## 3 Africa 1962 43.3 1598.## 4 Africa 1967 45.3 2050.## 5 Africa 1972 47.5 2340.## 6 Africa 1977 49.6 2586.## 7 Africa 1982 51.6 2482.## 8 Africa 1987 53.3 2283.## 9 Africa 1992 53.6 2282.## 10 Africa 1997 53.6 2379.## # … with 50 more rows
tidyverse
uses same grammar and design philosophytidyverse
uses same grammar and design philosophygapminder %>%group_by(continent, year) %>%summarize(mean_life = mean(lifeExp), mean_GDP = mean(gdpPercap)) %>%# now pipe this tibble in as data for ggplot!ggplot(data = ., # . stands in for stuff ^! aes(x = year, y = mean_life, color = continent))+geom_path(size=1)+labs(x = "Year", y = "Average Life Expectancy (Years)", color = "Continent", title = "Average Life Expectancy Over Time")+theme_classic(base_family = "Fira Sans Condensed", base_size=20)
tidyverse
uses same grammar and design philosophygapminder %>%group_by(continent, year) %>%summarize(mean_life = mean(lifeExp), mean_GDP = mean(gdpPercap)) %>%# now pipe this tibble in as data for ggplot!ggplot(data = ., # . stands in for stuff ^! aes(x = year, y = mean_life, color = continent))+geom_path(size=1)+labs(x = "Year", y = "Average Life Expectancy (Years)", color = "Continent", title = "Average Life Expectancy Over Time")+theme_classic(base_family = "Fira Sans Condensed", base_size=20)
tally
provides counts, best used with group_by
for factors
tally
provides counts, best used with group_by
for factors
gapminder %>% tally
## # A tibble: 1 × 1## n## <int>## 1 1704
tally
provides counts, best used with group_by
for factors
gapminder %>% tally
## # A tibble: 1 × 1## n## <int>## 1 1704
gapminder %>% group_by(continent) %>% tally
## # A tibble: 5 × 2## continent n## <fct> <int>## 1 Africa 624## 2 Americas 300## 3 Asia 396## 4 Europe 360## 5 Oceania 24
slice()
subsets rows by position instead of filter
ing by valuesslice()
subsets rows by position instead of filter
ing by valuesgapminder %>% slice(15:17) # see 15th through 17th observations
## # A tibble: 3 × 6## country continent year lifeExp pop gdpPercap## <fct> <fct> <int> <dbl> <int> <dbl>## 1 Albania Europe 1962 64.8 1728137 2313.## 2 Albania Europe 1967 66.2 1984060 2760.## 3 Albania Europe 1972 67.7 2263554 3313.
pull()
extracts a column from a tibble
(just like $
)pull()
extracts a column from a tibble
(just like $
)# Get all U.S. life expectancy observationsgapminder %>% filter(country == "United States") %>% pull(lifeExp)
## [1] 68.440 69.490 70.210 70.760 71.340 73.380 74.650 75.020 76.090 76.810## [11] 77.310 78.242
pull()
extracts a column from a tibble
(just like $
)# Get all U.S. life expectancy observationsgapminder %>% filter(country == "United States") %>% pull(lifeExp)
## [1] 68.440 69.490 70.210 70.760 71.340 73.380 74.650 75.020 76.090 76.810## [11] 77.310 78.242
# Get U.S. life expectancy in 2007gapminder %>% filter(country == "United States" & year == 2007) %>% pull(lifeExp)
## [1] 78.242
distinct()
shows the distinct values of a specified variable (recall n_distinct()
inside summarize()
just gives you the number of values)gapminder %>% distinct(country)
## # A tibble: 142 × 1## country ## <fct> ## 1 Afghanistan## 2 Albania ## 3 Algeria ## 4 Angola ## 5 Argentina ## 6 Australia ## 7 Austria ## 8 Bahrain ## 9 Bangladesh ## 10 Belgium ## # … with 132 more rows
tidyr
helps reshape data into more usable format† This is the namesake of the tidyverse
: all associated packages and functions use or require this data format!
‡ Alternatively, sometimes rule 3 is "every value is its own cell."
"Happy families are all alike; every unhappy family is unhappy in its own way." - Leo Tolstoy
"Tidy datasets are all alike, but every messy dataset is messy in its own way." - Hadley Wickham
# make example untidy data ex_wide<-tribble( ~"Country", ~"2000", ~"2010", "United States", 140, 180, "Canada", 102, 98, "China", 111, 123)ex_wide
## # A tibble: 3 × 3## Country `2000` `2010`## <chr> <dbl> <dbl>## 1 United States 140 180## 2 Canada 102 98## 3 China 111 123
year
variable!# make example untidy data ex_wide<-tribble( ~"Country", ~"2000", ~"2010", "United States", 140, 180, "Canada", 102, 98, "China", 111, 123)ex_wide
## # A tibble: 3 × 3## Country `2000` `2010`## <chr> <dbl> <dbl>## 1 United States 140 180## 2 Canada 102 98## 3 China 111 123
gather()
these columns into a new pair of variables2000
and 2010
)key
: name of variable whose values form the column names (we'll call it the year
)value
: name of the variable whose values are spread over the cells (we'll call it number of cases
)gather()
a wide data frame into a long data frame ex_wide
## # A tibble: 3 × 3## Country `2000` `2010`## <chr> <dbl> <dbl>## 1 United States 140 180## 2 Canada 102 98## 3 China 111 123
ex_wide %>% gather("2000","2010", key = "year", value = "cases")
## # A tibble: 6 × 3## Country year cases## <chr> <chr> <dbl>## 1 United States 2000 140## 2 Canada 2000 102## 3 China 2000 111## 4 United States 2010 180## 5 Canada 2010 98## 6 China 2010 123
ex_long # example I made (code hidden)
## # A tibble: 12 × 4## Country Year Type Count## <chr> <dbl> <chr> <dbl>## 1 United States 2000 Cases 140## 2 United States 2000 Population 300## 3 United States 2010 Cases 180## 4 United States 2010 Population 310## 5 Canada 2000 Cases 102## 6 Canada 2000 Population 110## 7 Canada 2010 Cases 98## 8 Canada 2010 Population 121## 9 China 2000 Cases 111## 10 China 2000 Population 1201## 11 China 2010 Cases 123## 12 China 2010 Population 1241
Cases
and one for Population
(categorized by type
of variable)ex_long # example I made (code hidden)
## # A tibble: 12 × 4## Country Year Type Count## <chr> <dbl> <chr> <dbl>## 1 United States 2000 Cases 140## 2 United States 2000 Population 300## 3 United States 2010 Cases 180## 4 United States 2010 Population 310## 5 Canada 2000 Cases 102## 6 Canada 2000 Population 110## 7 Canada 2010 Cases 98## 8 Canada 2010 Population 121## 9 China 2000 Cases 111## 10 China 2000 Population 1201## 11 China 2010 Cases 123## 12 China 2010 Population 1241
spread()
these columns into a new pair of variableskey
: column that contains variable names (here, the type
)value
: column that contains values from multiple variables (here, the count
)spread()
a long data frame into a wide data frame ex_long
## # A tibble: 12 × 4## Country Year Type Count## <chr> <dbl> <chr> <dbl>## 1 United States 2000 Cases 140## 2 United States 2000 Population 300## 3 United States 2010 Cases 180## 4 United States 2010 Population 310## 5 Canada 2000 Cases 102## 6 Canada 2000 Population 110## 7 Canada 2010 Cases 98## 8 Canada 2010 Population 121## 9 China 2000 Cases 111## 10 China 2000 Population 1201## 11 China 2010 Cases 123## 12 China 2010 Population 1241
ex_long %>% spread(key = "Type", value = "Count")
## # A tibble: 6 × 4## Country Year Cases Population## <chr> <dbl> <dbl> <dbl>## 1 Canada 2000 102 110## 2 Canada 2010 98 121## 3 China 2000 111 1201## 4 China 2010 123 1241## 5 United States 2000 140 300## 6 United States 2010 180 310
Often, data doesn't come from just one source, but several sources
We can combine datasets into a single dataframe (tibble) using dplyr
commands in several ways:
bind
dataframes together by row or by columnbind_rows()
adds observations (rows) to existing dataset1bind_cols()
adds variables (columns) to existing dataset2join
two dataframes by designating variable(s) as key
to match rows by identical values of that key
† Note the columns must be identical between the original dataset and the new observations
‡ Note the rows must be identical between original dataset and new variable
Sometimes you want to add rows (observations) or columns (variables) that happen to match up perfectly
In this case, simply using bind_*(old_df, new_df)
will work
bind_columns(old_df, new_df)
adds columns from new_df
to old_df
bind_rows(old_df, new_df)
adds rows from new_df
to old_df
bind_columns()
(Variables)
bind_rows()
(Observations)
x
and y
*x$x
and y$y
1
and 2
x
has observation 3
which y
does not havey
has observation 4
which x
does not havex
and y
into a single dataframe* Images on all following slides come from Garrick Aden-Buie's excellent tidyexplain
x
and y
for which there are matching rowsx
with no match in y
(3) will be droppedy
with no match in x
(4) will be droppedx
and add all columns from y
x
with no match in y
(3) will have NA
sy
with no match in x
(4) will be droppedy
and add all columns from x
y
with no match in x
(4) will have NA
sx
with no match in y
(3) will be droppedx
and y
NA
stibble
readr
and importing data
dplyr
and data wrangling
tidyr
and tidying or reshaping data