R is a interactive scripting language. R scripts have the file
extension .R
or .r
, and scripts can be run
“start-to-finish” from the command line,
e.g. Rscript helloworld.R
, similar to Python.
It is more common to use R interactively, generally with the
help of R Studio or a similar IDE. Run a single line of R code using
ctrl-enter
.
5 + (5 / 2)
## [1] 7.5
Assign a vector of numbers to a variable called
v
. The assignment operator can be inserted with
alt+-
.
v <- c(1, 2, 3)
We can now access this vector by referring to v
in our
script. For example, we can access specific contents of the vector
v[3] + 2
## [1] 5
v[-2]
## [1] 1 3
v[c(1, 1, 3)] # "multiple indexing" using another vector
## [1] 1 1 3
Notice that the contents of v
are unchanged. Assign new
contents to v
like so.
# change an item
v[2] <- -1
# append new item
(v <- c(v, 4)) # print the results of an assignment using ()
## [1] 1 -1 3 4
Warning
Appending to a vector in this way is very inefficient in R compared to other languages. Avoid building up large lists in this way.
An important defining feature of R is that many operations are automatically vectorized.
v <- 1:5
v + 1 # 1 is "recycled" and applied to each element
## [1] 2 3 4 5 6
v + 1:length(v)
## [1] 2 4 6 8 10
v + c(1, 2) # a 2-vector can not be applied to a 5-vector with recycling
## Warning in v + c(1, 2): longer object length is not a multiple of shorter object
## length
## [1] 2 4 4 6 6
v[-5] + c(1, 2) # but it can be applied to a 4-vector
## [1] 2 4 4 6
sin(v)
## [1] 0.8414710 0.9092974 0.1411200 -0.7568025 -0.9589243
Warning
A common source of errors is using vectorization unintentionally, especially with recycled elements. Try to become aware of which R functions use vectorization.
The major type distinctions in R are character, numeric, logical, and NULL. Some comments:
"..."
or '...'
3L
TRUE
, FALSE
and
NA
NULL
object is the unique “nothing” object, of
length zeroVectors can only contain a single type. Vector types are promoted eagerly.
v[2] <- 3.5
v[3] <- "foo"
Lists are like vectors, but can contain multiple types.
v <- 1:5
l <- as.list(v)
l[[2]] <- 3L
l[3:4] <- list(FALSE, list("bar", "bat"))
l
## [[1]]
## [1] 1
##
## [[2]]
## [1] 3
##
## [[3]]
## [1] FALSE
##
## [[4]]
## [[4]][[1]]
## [1] "bar"
##
## [[4]][[2]]
## [1] "bat"
##
##
## [[5]]
## [1] 5
Both vectors and lists can be named:
names(v) <- letters[1:5]; v
## a b c d e
## 1 2 3 4 5
A dataframe is like an excel file in code. It is a 2-dimensional array with named columns.
data("mtcars") # simultaneously load mtcars data and assign to variable with same name
mtcars
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
We will work with tabular data a lot in a moment. For now, here are some basic operations.
mtcars[,3] # contents of third column
## [1] 160.0 160.0 108.0 258.0 360.0 225.0 360.0 146.7 140.8 167.6 167.6 275.8
## [13] 275.8 275.8 472.0 460.0 440.0 78.7 75.7 71.1 120.1 318.0 304.0 350.0
## [25] 400.0 79.0 120.3 95.1 351.0 145.0 301.0 121.0
mtcars[c(1, 3, 5),] # subset rows
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.62 16.46 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
Info
Most R objects have implemented their own summary
function, including those provided by other packages. The output will
depend on the nature of the object.
Packages are an essential part of the R workflow. They are what makes the open-source, community-driven R ecosystem so diverse and up-to-date with the latest research.
Install a new package with
install.packages("<package name>")
, and load a
package with library(<package name>)
.
If you want to learn more about R in general, here are some resources. However, we will cover many of these basics in the following hands-on tutorial.
tidyverse
packagesFrom the website:
The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
The “core” tidyverse packages are the ones automatically installed/loaded:
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.5 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.1.4 ✔ stringr 1.4.0
## ✔ readr 2.1.1 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
They form a complete workflow for writing most R programs:
readr
for reading/writing tabular datatidyr
for cleaning and organizing data into a “tidy”
formatdplyr
and purrr
for manipulating and
summarizingforcats
and stringr
for special operations
on factors and strings, respectivelyIn addition to these core packages, other more specific packages are considered part of the “extended” tidyverse. These packages need to be installed/loaded independently.
library(tidygraph)
##
## Attaching package: 'tidygraph'
## The following object is masked from 'package:stats':
##
## filter
library(ggraph)
Cheatsheets are a great way to learn a package quickly. I also like to use them to learn a package while working. I’ll pull up the package I want to learn, and try to find ways to use different functions from it for whatever I’m doing.
R for Data Science (R4DS) can be good for all levels, and centers around the tidyverse. It starts very gently and assumes almost no R knowledge. There are also exercises and solutions available.
Advanced R is a more general programming guide, but draws heavily from tidyverse packages.
tibble
and tidyr
packagesThe most fundamental data structure for tidyverse is a modernization of the dataframe. The most important differences are clearer printing and more flexible column names.
starwars
## # A tibble: 87 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sk… 172 77 blond fair blue 19 male mascu…
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
## 4 Darth V… 202 136 none white yellow 41.9 male mascu…
## 5 Leia Or… 150 49 brown light brown 19 fema… femin…
## 6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
## 7 Beru Wh… 165 75 brown light blue 47 fema… femin…
## 8 R5-D4 97 32 <NA> white, red red NA none mascu…
## 9 Biggs D… 183 84 black light brown 24 male mascu…
## 10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
## # … with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
Compared to a data.frame
, a tibble
is
printed in a truncated format by default. Notice also the dimension of
the data, the inferred types of each column, and the dimension of each
list-columns (the starship
column for example) are also
printed.
Creating tibbles from scratch can be done similar to
data.frame
:
tibble(x = 1:5, y = LETTERS[1:5], z = x^2)
## # A tibble: 5 × 3
## x y z
## <int> <chr> <dbl>
## 1 1 A 1
## 2 2 B 4
## 3 3 C 9
## 4 4 D 16
## 5 5 E 25
or with tribble
tribble(
~x, ~y,
1, 'A',
2, 'B',
3, 'C'
)
## # A tibble: 3 × 2
## x y
## <dbl> <chr>
## 1 1 A
## 2 2 B
## 3 3 C
Info
The readr
functions such as read_csv
always
return a tibble
Creating tibbles from dataframes, matrices, or named vectors:
m <- matrix(1:10, ncol=2)
as_tibble(m)
## # A tibble: 5 × 2
## V1 V2
## <int> <int>
## 1 1 6
## 2 2 7
## 3 3 8
## 4 4 9
## 5 5 10
x <- c(a = 1, b = 2, c = 3)
enframe(x)
## # A tibble: 3 × 2
## name value
## <chr> <dbl>
## 1 a 1
## 2 b 2
## 3 c 3
You can largely use tibbles and dataframes interchangeably (functions that say they work with one will work with the other), so I’ll just call them dataframes from now on. See the vignette for more features and differences from regular dataframes.
vignette('tibble')
R for Data Science gives three rules for tidy data:
- Each variable has its own column
- Each observation has its own row
- Each value must have its own cell
The tidyr
package contains functions for making data
tidy. For example, suppose you are given some data from a field
survey.
survey <- tibble(
present_yes = sample(c('yes', NA), 15, replace = TRUE),
present_no = ifelse(is.na(present_yes), 'no', NA),
visit_id = 1:15
)
survey
## # A tibble: 15 × 3
## present_yes present_no visit_id
## <chr> <chr> <int>
## 1 yes <NA> 1
## 2 <NA> no 2
## 3 <NA> no 3
## 4 yes <NA> 4
## 5 yes <NA> 5
## 6 yes <NA> 6
## 7 yes <NA> 7
## 8 yes <NA> 8
## 9 yes <NA> 9
## 10 yes <NA> 10
## 11 yes <NA> 11
## 12 yes <NA> 12
## 13 <NA> no 13
## 14 <NA> no 14
## 15 <NA> no 15
Someone has spread a single variable over two columns, causing lots of blank spaces in the data. This may have been more convenient in the field, but programming with the data as-is will be a bit difficult. Thankfully, conversion to tidy format is easy.
unite(survey, 'present', present_yes, present_no, na.rm = TRUE)
## # A tibble: 15 × 2
## present visit_id
## <chr> <int>
## 1 yes 1
## 2 no 2
## 3 no 3
## 4 yes 4
## 5 yes 5
## 6 yes 6
## 7 yes 7
## 8 yes 8
## 9 yes 9
## 10 yes 10
## 11 yes 11
## 12 yes 12
## 13 no 13
## 14 no 14
## 15 no 15
Here, we have used unite
, which pastes two or more
columns of strings into a single column, while removing the
NA
s.
Another nice function is to drop rows which contain an
NA
, possibly only in specific columns.
nrow(starwars)
## [1] 87
nrow(drop_na(starwars)) # drop all rows with an NA
## [1] 6
nrow(drop_na(starwars, birth_year)) # only rows with NA in "birth_year"
## [1] 43
dplyr
The dplyr
package is great for performing seemingly
endless operations on dataframes. Just like in base R, we can subset a
dataframe by rows or columns:
# select specific cols
select(starwars, height, 6)
select(starwars, name:skin_color)
# select last 5 rows
slice_tail(starwars, n = 5)
# all cols with an underscore
select(starwars, contains('_'))
More complex slicing and subsetting is possible
# select cols conditionally
select(starwars, where(is.double))
select(starwars, !last_col() & ends_with('s'))
# slice rows conditionally (filtering)
filter(starwars, eye_color == 'blue')
Notice that most tidyverse functions are verbs or adverbs (coding is, after all, just giving instructions). There is a special operator, the pipe operator, which allows these commands to be chained together!
As an example, let’s create a random “population” of 1,000 of the starwars characters:
starwars |> # start of chain is the data
filter(eye_color == 'blue') |> # each step takes the previous result
select(height:species) |> # and does a new thing to it
slice_sample(n = 1000, replace = TRUE)
## # A tibble: 1,000 × 10
## height mass hair_color skin_color eye_color birth_year sex gender
## <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 183 NA brown fair blue NA <NA> <NA>
## 2 166 50 black yellow blue 40 female feminine
## 3 234 136 brown brown blue NA male masculine
## 4 175 79 none light blue 37 male masculine
## 5 183 NA brown fair blue 82 male masculine
## 6 183 NA brown fair blue 82 male masculine
## 7 180 NA auburn, grey fair blue 64 male masculine
## 8 165 75 brown light blue 47 female feminine
## 9 183 NA brown fair blue 82 male masculine
## 10 178 120 brown, grey light blue 52 male masculine
## # … with 990 more rows, and 2 more variables: homeworld <chr>, species <chr>
Info
x |> f |> g
is the same as g(f(x))
;
however, when you string a lot of functions together piping makes things
more readable.
Another family of dplyr
functions use existing columns
to make new ones. The most common is mutate
, which appends
a new row based on a vectorized function of existing rows.
transmute
is like mutate
but only the new
columns are kept.
starwars |>
transmute(mass_per_cm = mass / height) # divide mass column by height column
## # A tibble: 87 × 1
## mass_per_cm
## <dbl>
## 1 0.448
## 2 0.449
## 3 0.333
## 4 0.673
## 5 0.327
## 6 0.674
## 7 0.455
## 8 0.330
## 9 0.459
## 10 0.423
## # … with 77 more rows
This function, like many others, is often useful to apply separately to different groups in the data. Suppose we want to assign each character a unique id based on their homeworld:
starwars |>
group_by(homeworld) |>
mutate(world_id = str_c(homeworld, '_', 1:n())) |> # n() is short for group size
pull(world_id) # convert column "world_id" to a vector
## [1] "Tatooine_1" "Tatooine_2" "Naboo_1" "Tatooine_3"
## [5] "Alderaan_1" "Tatooine_4" "Tatooine_5" "Tatooine_6"
## [9] "Tatooine_7" "Stewjon_1" "Tatooine_8" "Eriadu_1"
## [13] "Kashyyyk_1" "Corellia_1" "Rodia_1" "Nal Hutta_1"
## [17] "Corellia_2" "Bestine IV_1" NA "Naboo_2"
## [21] "Kamino_1" NA "Trandosha_1" "Socorro_1"
## [25] "Bespin_1" "Mon Cala_1" "Chandrila_1" NA
## [29] "Endor_1" "Sullust_1" NA "Cato Neimoidia_1"
## [33] "Coruscant_1" "Naboo_3" "Naboo_4" "Naboo_5"
## [37] "Naboo_6" "Toydaria_1" "Malastare_1" "Naboo_7"
## [41] "Tatooine_9" "Dathomir_1" "Ryloth_1" "Ryloth_2"
## [45] "Vulpter_1" "Troiken_1" "Tund_1" "Haruun Kal_1"
## [49] "Cerea_1" "Glee Anselm_1" "Iridonia_1" "Coruscant_2"
## [53] "Iktotch_1" "Quermia_1" "Dorin_1" "Champala_1"
## [57] "Naboo_8" "Naboo_9" "Tatooine_10" "Geonosis_1"
## [61] "Mirial_1" "Mirial_2" "Naboo_10" "Serenno_1"
## [65] "Alderaan_2" "Concord Dawn_1" "Zolan_1" "Ojom_1"
## [69] "Kamino_2" "Kamino_3" "Coruscant_3" "Aleen Minor_1"
## [73] NA "Skako_1" "Muunilinst_1" "Shili_1"
## [77] "Kalee_1" "Kashyyyk_2" "Alderaan_3" "Umbara_1"
## [81] "Utapau_1" NA NA NA
## [85] NA NA "Naboo_11"
Warning
Here, the behavior of str_c
shows us that
mutate
and many other dplyr verbs operate on
vectors, not individual cells.
Grouping is frequently used with summarize
. Summarizing
is a form of reduction; it takes a (grouped) dataframe and returns a new
dataframe with one row per group:
starwars |>
group_by(gender) |>
drop_na(height) |> # remove NAs from height for mean and sd
summarize(m_height = mean(height), sd_height = sd(height))
## # A tibble: 3 × 3
## gender m_height sd_height
## <chr> <dbl> <dbl>
## 1 feminine 165. 23.6
## 2 masculine 177. 37.6
## 3 <NA> 181. 2.89
Grouping is not just for summarizing. We have seen it can be useful
with mutate, and it is useful with slice_*
and
filter
as well. Let’s take a random sample of five
characters from each species, first removing rows with less than 5
characters sharing their homeworld
:
starwars |>
group_by(homeworld) |>
mutate(count = n()) |>
filter(count >= 5) |>
slice_sample(n = 5)
## # A tibble: 15 × 15
## # Groups: homeworld [3]
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Jar Jar… 196 66 none orange orange 52 male mascu…
## 2 Dormé 165 NA brown light brown NA fema… femin…
## 3 Ric Olié 183 NA brown fair blue NA <NA> <NA>
## 4 Cordé 157 NA brown light brown NA fema… femin…
## 5 Gregar … 185 85 black dark brown NA male mascu…
## 6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
## 7 Beru Wh… 165 75 brown light blue 47 fema… femin…
## 8 Cliegg … 183 NA brown fair blue 82 male mascu…
## 9 Darth V… 202 136 none white yellow 41.9 male mascu…
## 10 Luke Sk… 172 77 blond fair blue 19 male mascu…
## 11 Captain… NA NA unknown unknown unknown NA <NA> <NA>
## 12 Finn NA NA black dark dark NA male mascu…
## 13 IG-88 200 140 none metal red 15 none mascu…
## 14 Rey NA NA brown light hazel NA fema… femin…
## 15 Yoda 66 17 white green brown 896 male mascu…
## # … with 6 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>, count <int>
Finally, the count
function is for the common
summarizing task of counting groups:
count(starwars, homeworld, sort = TRUE)
## # A tibble: 49 × 2
## homeworld n
## <chr> <int>
## 1 Naboo 11
## 2 Tatooine 10
## 3 <NA> 10
## 4 Alderaan 3
## 5 Coruscant 3
## 6 Kamino 3
## 7 Corellia 2
## 8 Kashyyyk 2
## 9 Mirial 2
## 10 Ryloth 2
## # … with 39 more rows
Info
Count group interactions by providing >1 column name, and use
add_count
to add the counts to the existing data.
midwest
is a dataset of county-level demographic
information in the Midwestern US. Consider any/all of the following
tasks:
midwest
?slice_*
family (type
?slice
into your console) to find the 50 counties with the
largest popdensity
. Now, how many times is each state
represented in this top sample?inmetro == 1
), compute the percentage of people below
the poverty line for each state. Also compute the variance.dim(midwest)
## [1] 437 28
midwest |>
slice_max(popdensity, n=50) |>
count(state)
## # A tibble: 5 × 2
## state n
## <chr> <int>
## 1 IL 10
## 2 IN 9
## 3 MI 10
## 4 OH 16
## 5 WI 5
midwest |>
filter(inmetro == 1) |>
group_by(state) |>
summarize(mpov = mean(percbelowpoverty), vpov = var(percbelowpoverty))
## # A tibble: 5 × 3
## state mpov vpov
## <chr> <dbl> <dbl>
## 1 IL 9.56 15.4
## 2 IN 9.59 16.0
## 3 MI 11.3 18.5
## 4 OH 11.5 15.1
## 5 WI 9.01 16.5
The R4DS book has some exercises for cleaning and manipulating data in Chapters 5, 10-13, and 16.
There are several vignettes that cover advanced dplyr concepts,
which I’ve found useful. Check out vignette('colwise')
for
applying dplyr functions to many columns at once,
vignette('two-table')
for various ways to combine
dataframes, and vignette('programming')
to learn more about
working with tidy evaluation, the magic system underlying many
tidyverse functions. Or, find all vignettes here.
One of our definitions of tidy data is “one observation per row.” What exactly is an observation? It depends!
This is why tidy data frequently contains columns with many “repeated” entries; we want each unit of analysis to be in its own row, even if it means other variables are repeated when they are shared across units.
sw_film <- starwars |>
unnest(c(films)) |> # rows corresponding to the nested entry are repeated
select(name, films, everything()) # just move films to front to emphasize we are interested in it
sw_film |>
group_by(films) |>
summarise(m_age = mean(birth_year, na.rm=TRUE)) |>
arrange(desc(m_age))
## # A tibble: 7 × 2
## films m_age
## <chr> <dbl>
## 1 The Phantom Menace 140.
## 2 Return of the Jedi 133.
## 3 The Empire Strikes Back 105.
## 4 Revenge of the Sith 103.
## 5 Attack of the Clones 98.7
## 6 A New Hope 90.9
## 7 The Force Awakens 56.8
The reason why the tidy version was easier in this case is because in
the original format, our variable of interest was not easily accessible.
If we wanted to make a running tally of birth_year
by film,
we would need to use a double for-loop to check which films each
character was in.
When would we deliberately nest for tidy data? One example is when the observable of interest depends on a subset of the data:
get_slope <- function(df) lm(height ~ mass, df)$coefficients[2]
starwars |>
select(sex, height, mass) |>
drop_na(height, mass) |> # remove characters with unknown height/weight
nest(data = c(height, mass)) |>
rowwise() |>
mutate(slope = get_slope(data))
## # A tibble: 5 × 3
## # Rowwise:
## sex data slope
## <chr> <list> <dbl>
## 1 male <tibble [44 × 2]> 1.03
## 2 none <tibble [4 × 2]> 0.978
## 3 female <tibble [9 × 2]> 0.0578
## 4 hermaphroditic <tibble [1 × 2]> NA
## 5 <NA> <tibble [1 × 2]> NA
As we will see soon, plotting is another case where the ambiguity of “variable” and “observation” matters. For this and other cases, there are ways to pivot the data, changing the context of variables and observations.
Similar to nesting, think of pivot_longer
as lengthening
the dataframe. Rather than every variable getting a row of values, we
make one row of the variables and one of the values:
x <- tribble(
~A, ~B, ~C,
1, 2, 3,
4, 5, 6
)
x
## # A tibble: 2 × 3
## A B C
## <dbl> <dbl> <dbl>
## 1 1 2 3
## 2 4 5 6
pivot_longer(x, everything())
## # A tibble: 6 × 2
## name value
## <chr> <dbl>
## 1 A 1
## 2 B 2
## 3 C 3
## 4 A 4
## 5 B 5
## 6 C 6
An example with data, counting how many times each color appears anywhere:
starwars |>
pivot_longer(
hair_color:eye_color,
names_to='attribute',
values_to='color'
) |>
count(color)
## # A tibble: 44 × 2
## color n
## <chr> <int>
## 1 auburn 1
## 2 auburn, grey 1
## 3 auburn, white 1
## 4 black 23
## 5 blond 3
## 6 blonde 1
## 7 blue 21
## 8 blue-gray 1
## 9 blue, grey 2
## 10 brown 43
## # … with 34 more rows
Here, since we want to count colors across several columns, we pivot
these columns into a single column and then count
it.
who
contains data on country-wide tuberculosis
numbers by year. Columns starting with “new” are counts from that year
for various groups. Using pivot_longer
, count the number of
times each country has an NA
in one of these columns. Which
countries had the most and least amount of missing data? Why was
pivoting helpful in this example?
us_rent_income
contains data on each state’s median
monthly rent and income for 2017. Use pivot_wider
to spread
these values into two columns income
and rent
.
Using these two columns, find the correlation between rent and
income.
who |>
pivot_longer(contains("new")) |>
group_by(country) |>
summarize(na_ct = sum(is.na(value))) |>
filter(na_ct %in% c(min(na_ct), max(na_ct)))
## # A tibble: 3 × 2
## country na_ct
## <chr> <int>
## 1 Bonaire, Saint Eustatius and Saba 84
## 2 Curacao 84
## 3 Monaco 1889
us_rent_income |>
pivot_wider(GEOID:NAME, names_from = variable, values_from = estimate) |>
summarize(corr = cor(income, rent, use = "complete.obs"))
## # A tibble: 1 × 1
## corr
## <dbl>
## 1 0.741
ggplot2
Some of you may have used ggplot2
before. The package
has its own philosophy (the “grammar of graphics”) that builds on those
of all tidyverse packages and tidy data. Let’s go over a few to help
make sense of how ggplot2
works:
Plots are made of layers. For ggplot, visual information
is added to plots one “group” at a time, using geometries. For
example, use geom_point
to add points and
geom_histrogram
to add histogram bins. Multiple types of
geometries can be layered onto the same plot.
Plots are an interpretation of relationships in the data. These
relationships can be specified with aesthetic mappings. For
example, if the points you are plotting come from two species, you can
color points according to their species with the mapping
species -> color
.
A plot is an object that we can perform operations on. This means
we can always continue to modify a plot by performing new steps, much
like how the |>
operator lets data be modified in single
steps. For example, scale_x_sqrt
adds the instruction to
take the square root of the variable on the x-axis before plotting. This
also means that ggplot2
objects can be stored in a variable
in R to be used later.
Here’s a simple example to see each of these ideas at work:
gg <- ggplot(data = slice_sample(diamonds, n=1000), mapping = aes(x = carat, y = price, col = cut)) +
geom_point(shape = 1) # shape is a "static" argument here
gg +
scale_x_log10() +
scale_y_log10() +
labs(x = 'Carat', y = 'Price', col = 'Cut quality') +
theme_bw()
Here we’ve plotted price
by carat
and
colored the points according to cut
. We then added some
visual modifications on top of the stored plot gg
.
Breaking down each of the above instructions, a basic anatomy of most ggplots is found:
ggplot
is the “master” function that sets a “blank
slate.” It is also where the data and aesthetic mapping that is common
to all geoms is specified.ggplot() # totally blank slate
ggplot(diamonds, aes(x = carat, y = price)) # data and the coordinate system
geom_*
functions add on layers of data. Notice that
these functions also have a data
and mapping
argument. These can be used to override these arguments from
ggplot
for that geom only.
Last come any modifications of the existing coordinates or geoms.
This includes scale_*
, renaming parts of the plot, and
changing the theme.
There are many geometries available, and the best way to learn them is to check out the cheatsheet and try them out on your own data. Rather than cover them here, we’ll now focus on some common tasks and challenges.
Separating data into a group of several plots is done with facets:
gg_hist <- ggplot(diamonds, aes(x = carat, fill = cut)) +
geom_histogram() +
facet_grid(cut ~ ., scales = 'free')
An example of a barplot (histogram for discrete variable) with 2d facetting:
ggplot(diamonds, aes(x = clarity)) +
geom_bar(fill = 'lightblue') +
facet_grid(cut ~ color, scales = 'free')
There are tons of different approaches to customizing color with ggplot, and its easy to get overwhelmed! Here we’ll focus on the essentials and leave you to research different color palettes or making your own.
Basically, color is changed through scaling a color aesthetic
(col
or fill
). Going back to our histogram
example, here we use one of the color palettes provided by
RColorBrewer
for discrete variables:
library(RColorBrewer)
my_brew_pal <- brewer.pal(8, 'Set2')
gg_hist +
scale_fill_discrete(type = my_brew_pal)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
As we see above, a color palette can be stored and used in multiple
plots. For presentations or papers, it can be nice to have a set of
specific colors fixed for a variable of interest. An easy way to do this
in ggplot is with scale_*_manual
. Say we want all plots
using a mapping with cut
to have consistent colors:
my_pal <- c('blue2', 'orangered', 'lightpink', 'springgreen', 'gold')
names(my_pal) <- levels(diamonds$cut)
gg_hist +
scale_fill_manual(values = my_pal)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(diamonds, aes(price, carat, col = cut)) +
geom_point(alpha = 0.1, data = slice_sample(diamonds, prop = 0.1)) +
geom_smooth() +
scale_color_manual(values = my_pal) +
ylim(0, 3)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 32 rows containing non-finite values (stat_smooth).
## Warning: Removed 4 rows containing missing values (geom_point).
In general, use scale_*_discrete
or
scale_*_continuous
if you want to provide existing palettes
or do other scaling functions (we’ll see another use of scale in a
moment), and scale_*_manual
for making a discrete palette
by hand.
Aesthetics other than color can be scaled in much the same way. A
common example is the shape
aesthetic.
ggplot(starwars, aes(eye_color, height, shape = gender)) + # col is continuous here
geom_point(size = 2) +
scale_shape_manual(values = c(19, 1), na.value = 13) +
scale_x_discrete(guide = guide_axis(angle = 45))
## Warning: Removed 6 rows containing missing values (geom_point).
diamonds |>
mutate(car_bin = ifelse(carat < mean(carat), 'low_car', 'high_car')) |>
group_by(car_bin, color) |>
summarize(mpr = mean(price)) |>
ggplot(aes(color, mpr, shape = car_bin, group = car_bin)) +
geom_point() +
geom_line()
## `summarise()` has grouped output by 'car_bin'. You can override using the
## `.groups` argument.
The R Graphics Cookbook has tons of examples for all things ggplot.
Chapter 3 in R4DS is a good review of ggplot concepts and basics, with exercises.
tidygraph
+
ggraph
tidygraph
is a package for constructing and analyzing
graphs with a workflow similar to tidyr/dplyr, and ggraph
is a package similar to ggplot for visualization.
In tidygraph, a graph is represented with two tibbles: one of nodes
(vertices) and one of edges (links). These tibbles are bundled together
as one object called a tbl_graph
.
(graph <- create_ring(10))
## # A tbl_graph: 10 nodes and 10 edges
## #
## # An undirected simple graph with 1 component
## #
## # Node Data: 10 × 0 (active)
## # … with 4 more rows
## #
## # Edge Data: 10 × 2
## from to
## <int> <int>
## 1 1 2
## 2 2 3
## 3 3 4
## # … with 7 more rows
In ggraph, a graph is plotted in “layers” just like ggplot. There are specific geoms for nodes and edges:
gg_net <- ggraph(graph) +
geom_node_point(color = 'blue', size = 5)
## Using `stress` as default layout
gg_net # just the nodes
gg_net +
geom_edge_link() # now add the edges on top
dplyr
verbs on graphsSince a tbl_graph
is made of tibbles, we can use all the
familiar functions on them:
graph <- graph |>
mutate(name = fruit[1:n()])
Here we see we have added an new attribute to the nodes, by
adding a new column to the nodes tibble. Also, notice how the nodes are
labelled active. This is how we specify whether we want to
manipulate the node or edge tibble. Switch between the two using
activate
.
graph <- graph |>
activate(edges) |>
mutate(weight = runif(n(), 1, 5))
Info
%N>%
can be used as shorthand for
|> activate(nodes) |>
, and %E>%
for
activating edges
Attributes can be used in aesthetic mappings when plotting. As with
all geoms, check which aesthetics can be controlled with a mapping with
?geom_node_*
etc.
ggraph(graph) +
geom_edge_link(mapping = aes(edge_color = weight), edge_width = 2) +
geom_node_label(mapping = aes(label = name))
## Using `stress` as default layout
In addition to the usual tidyverse functions, there are many new verbs specific to network science, such as for finding the shortest distance between two nodes or calculating a node’s degree. These can be useful, for example, for richer visualization.
In the following example, we generate a network using the famous Erdős–Rényi model, then compute two measures of node centrality, or how “connected” a node is. The degree is the number of neighbors and the page rank is from the formula for ordering web results in Google’s early days.
graph <- play_erdos_renyi(300, p = 0.03) |>
mutate(
degree = centrality_degree(),
pagerank = centrality_pagerank()
)
graph |>
as_tibble() |>
pivot_longer(c(degree, pagerank)) |>
ggplot(aes(x = value, fill = name)) +
geom_density() +
facet_wrap(~name, scales = 'free', nrow = 2)
ggraph(graph) +
geom_edge_link0(alpha = 0.05) + # use edge0 family for faster plotting
geom_node_point(aes(col = degree, size = degree))
## Using `stress` as default layout
One important type of network operation is clustering, where
group-level patterns are extracted from the data. The basic motive
behind clustering is that nodes within clusters should tend to be more
connected with each other than with nodes of other clusters. Many
different clustering algorithms have been proposed, and tidygraph
supports the most popular ones with the group_*
functions.
play_islands(n_islands = 5, size_islands = 10, p_within = 0.8, m_between = 3) |>
mutate(community = group_fast_greedy()) |> # classic method of maxing modularity
ggraph() +
geom_edge_link(alpha = 0.3) +
geom_node_point(aes(color = as.factor(community)), size = 7) +
theme_graph() # ggraph's provided theme: lots of customiziation is available!
## Using `stress` as default layout
To show more tidygraph
functionality as part of a more
realistic workflow, we will be implementing a simple stochastic
epidemiological model. Our infection process will spread over a graph
based on two rules.
beta
alpha
In epidemiological modeling, this model is known as the SIS model because a node’s state can only go from susceptible to infectious to susceptible again.
The workhorse behind our implementation will be
tidygraph::map_local
, which extracts the neighborhood of
each node as a list of subgraphs, then maps a function .f
over each of these graphs and returns the result. From the function’s
documentation:
The function provided to
.f
will be called with the following arguments in addition to those supplied through...
:
neighborhood
: The neighborhood graph of the nodegraph
: The full tbl_graph objectnode
: The index of the node currently mapped over
So, our main task will be to provide an .f
that takes
these arguments and updates the infection status of node
by
looking at its neighborhood
for other infectious nodes.
init_outbreak <- function(graph, nseed=1) {
init <- sample(length(graph), nseed)
graph |>
mutate(infected = ifelse(node_idx %in% init, TRUE, FALSE))
}
update_node <- function(neighborhood, graph, node, inf_stats, beta = 0.3, alpha = 0.1) {
if (inf_stats[node]) # if node is already infected, rec. with prob 0.1
if (runif(1) < alpha) FALSE else TRUE
ninf <- sum(pull(neighborhood, infected)) # get the inf. status of each neighbor
if (ninf > 0) {
inter <- runif(ninf)
if (any(inter < beta)) # pass on infection with ind. probabilities 0.3
return(TRUE)
}
return(FALSE)
}
run_step <- function(graph) {
graph |>
mutate(infected=map_local_lgl(.f = update_node, inf_stat = infected))
}
plot_outbreak <- function(graph) {
ggraph(graph) +
geom_edge_link0(alpha = 0.05) +
geom_node_point(aes(col = infected))
}
graph0 <- create_notable("Zachary") |>
mutate(node_idx = 1:n()) |>
init_outbreak(nseed = 2)
plot_outbreak(graph0) # plot the initial state
## Using `stress` as default layout
reduce(1:10, ~run_step(.x), .init = graph0) |>
plot_outbreak() # plot result after running 10 steps
## Using `stress` as default layout
A deeper summary of tidygraph’s features can be found here
Something we didn’t get into is the layout of graphs,
and a number of algorithms exist for deciding where to arrange nodes in
a plot. In ggraph you can specify which layout you want with
ggraph(graph, layout = ...)
. Many other ways beyond the
“ball and stick” visualization for graphs are also available; see here
for a good intro.
tidygraph
is essentially a wrapper around another
network analysis project called igraph
. While igraph itself
is implemented in the C language for speed, an R package also exists. If
you find something that you can’t accomplish using
tidygraph
, it is designed to be easy to fall back to using
igraph
temporarily.