QuEST coding workshop day 2: tidyverse introduction

tidyverse overview

From the website:

“The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.”

The “core” tidyverse packages are the ones automatically install/loaded:

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.1.2     v dplyr   1.0.6
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

They form a complete workflow for writing most R programs:

readr for reading/writing rectangular data
tidyr for cleaning and organizing data
dplyr and purrr for manipulating and summarizing
forcats and stringr for special operations on factors/strings

In addition to these core packages, other more specific packages are considered part of the “extended” tidyverse. These packages need to be installed/loaded independently.

Learning resources

Cheatsheets are a great way to learn a package quickly. I also like to use them to learn a package while working. I’ll pull up the package I want to learn, and try to find ways to use different functions from it for whatever I’m doing.
R for Data Science (R4DS) can be good for all levels, and centers around the tidyverse. It starts very gently and assumes almost no R knowledge. There are also exercises and solutions available.
Advanced R is a more general programming guide, but draws heavily from tidyverse packages.

tibbles and tidyr

The most fundamental data structure for tidyverse is a modernization of the dataframe. The most important differences are clearer printing and more flexible column names.

(dat <- starwars)

## # A tibble: 87 x 14
##    name    height  mass hair_color  skin_color eye_color birth_year sex   gender
##    <chr>    <int> <dbl> <chr>       <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Luke S~    172    77 blond       fair       blue            19   male  mascu~
##  2 C-3PO      167    75 <NA>        gold       yellow         112   none  mascu~
##  3 R2-D2       96    32 <NA>        white, bl~ red             33   none  mascu~
##  4 Darth ~    202   136 none        white      yellow          41.9 male  mascu~
##  5 Leia O~    150    49 brown       light      brown           19   fema~ femin~
##  6 Owen L~    178   120 brown, grey light      blue            52   male  mascu~
##  7 Beru W~    165    75 brown       light      blue            47   fema~ femin~
##  8 R5-D4       97    32 <NA>        white, red red             NA   none  mascu~
##  9 Biggs ~    183    84 black       light      brown           24   male  mascu~
## 10 Obi-Wa~    182    77 auburn, wh~ fair       blue-gray       57   male  mascu~
## # ... with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

Creating tibbles

read_* always returns a tibble. Creating tibbles from scratch can be done similar to data.frame:

tibble(x = 1:5, y = LETTERS[1:5], z = x^2)

## # A tibble: 5 x 3
##       x y         z
##   <int> <chr> <dbl>
## 1     1 A         1
## 2     2 B         4
## 3     3 C         9
## 4     4 D        16
## 5     5 E        25

or with tribble

tribble(
  ~x, ~y,
  1, 'A',
  2, 'B',
  3, 'C'
)

## # A tibble: 3 x 2
##       x y    
##   <dbl> <chr>
## 1     1 A    
## 2     2 B    
## 3     3 C

Creating tibbles from dataframes, matrices, or named vectors:

## # A tibble: 50 x 2
##    speed  dist
##    <dbl> <dbl>
##  1     4     2
##  2     4    10
##  3     7     4
##  4     7    22
##  5     8    16
##  6     9    10
##  7    10    18
##  8    10    26
##  9    10    34
## 10    11    17
## # ... with 40 more rows

## # A tibble: 3 x 2
##   name  value
##   <chr> <dbl>
## 1 a         1
## 2 b         2
## 3 c         3

You can largely use tibbles and dataframes interchangeably (functions that say they work with one will work with the other), so I’ll just call them dataframes from now on. See the vignette for more features and differences from regular dataframes

vignette('tibble')

Tidy data

R for Data Science gives three rules for tidy data:

Each variable has its own column
Each observation has its own row
Each value must have its own cell

The tidyr package contains functions for making data tidy. For example, suppose you are given some data from a field survey.

survey <- tibble(
  present_yes = sample(c('yes', NA), 15, replace = TRUE),
  present_no = ifelse(is.na(present_yes), 'no', NA)
)

survey

## # A tibble: 15 x 2
##    present_yes present_no
##    <chr>       <chr>     
##  1 <NA>        no        
##  2 yes         <NA>      
##  3 <NA>        no        
##  4 <NA>        no        
##  5 <NA>        no        
##  6 <NA>        no        
##  7 <NA>        no        
##  8 <NA>        no        
##  9 <NA>        no        
## 10 <NA>        no        
## 11 yes         <NA>      
## 12 <NA>        no        
## 13 yes         <NA>      
## 14 <NA>        no        
## 15 yes         <NA>

Someone has spread a single variable over two columns, causing lots of blank spaces in the data. This may have been more convenient in the field, but programming with the data as-is will be a bit difficult. Thankfully, conversion to tidy format is easy.

unite(survey, 'present', present_yes, present_no, na.rm = TRUE)

## # A tibble: 15 x 1
##    present
##    <chr>  
##  1 no     
##  2 yes    
##  3 no     
##  4 no     
##  5 no     
##  6 no     
##  7 no     
##  8 no     
##  9 no     
## 10 no     
## 11 yes    
## 12 no     
## 13 yes    
## 14 no     
## 15 yes

Here, we have used unite, which pastes two or more columns of strings into a single column, while removing the NAs.

Another nice tidyr use is to drop rows which contain an NA, possibly only in specific columns.

nrow(dat)

## [1] 87

nrow(drop_na(dat)) # drop all rows with an NA

## [1] 6

nrow(drop_na(dat, birth_year)) # only rows with NA in "birth_year"

## [1] 43

Dplyr

The dplyr package is great for performing seemingly endless operations on dataframes. Just like in base R, we can subset a dataframe by rows or columns:

## select specific cols
dat[,c(2, 6)]
select(dat, height, eye_color)

dat[,1:5]
select(dat, name:skin_color)

## select last 5 rows
tail(dat, 5)
slice_tail(dat, n = 5)

## all cols with an underscore
dat[, which(str_detect(colnames(dat), '_'))]
select(dat, contains('_'))

More complex slicing and subsetting is possible

## select cols conditionally
select(dat, where(is.double))
select(dat, !last_col() & ends_with('s'))

## slice rows conditionally (filtering)
filter(dat, eye_color == 'blue')

The pipe operator

Notice that most tidyverse functions are verbs or adverbs (coding is, after all, just giving instructions). There is a special operator, the pipe operator, which allows these commands to be chained together!

As an example, let’s create a random “population” of 1,000 of the starwars characters:

dat |> # start of chain is the data
  filter(eye_color == 'blue') |> # each step takes the previous result
  select(height:species) |> # and does a new thing to it
  drop_na() |>
  slice_sample(n = 1000, replace = TRUE)

## # A tibble: 1,000 x 10
##    height  mass hair_color  skin_color eye_color birth_year sex    gender   
##     <int> <dbl> <chr>       <chr>      <chr>          <dbl> <chr>  <chr>    
##  1    165    75 brown       light      blue            47   female feminine 
##  2    166    50 black       yellow     blue            40   female feminine 
##  3    166    50 black       yellow     blue            40   female feminine 
##  4    165    75 brown       light      blue            47   female feminine 
##  5    172    77 blond       fair       blue            19   male   masculine
##  6    178   120 brown, grey light      blue            52   male   masculine
##  7    175    79 none        light      blue            37   male   masculine
##  8    172    77 blond       fair       blue            19   male   masculine
##  9    172    77 blond       fair       blue            19   male   masculine
## 10    188    84 blond       fair       blue            41.9 male   masculine
## # ... with 990 more rows, and 2 more variables: homeworld <chr>, species <chr>

x |> f |> g is the same as g(f(x)); however, when you string alot of functions together piping makes things more readable.

Mutate, summarize, and count

Another family of dplyr functions use existing columns to make new ones. The most common is mutate, which appends a new row based on a vectorized function of existing rows.

dat |>
  mutate(mass_per_cm = mass / height) # divide mass column by height column

## # A tibble: 87 x 15
##    name    height  mass hair_color  skin_color eye_color birth_year sex   gender
##    <chr>    <int> <dbl> <chr>       <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Luke S~    172    77 blond       fair       blue            19   male  mascu~
##  2 C-3PO      167    75 <NA>        gold       yellow         112   none  mascu~
##  3 R2-D2       96    32 <NA>        white, bl~ red             33   none  mascu~
##  4 Darth ~    202   136 none        white      yellow          41.9 male  mascu~
##  5 Leia O~    150    49 brown       light      brown           19   fema~ femin~
##  6 Owen L~    178   120 brown, grey light      blue            52   male  mascu~
##  7 Beru W~    165    75 brown       light      blue            47   fema~ femin~
##  8 R5-D4       97    32 <NA>        white, red red             NA   none  mascu~
##  9 Biggs ~    183    84 black       light      brown           24   male  mascu~
## 10 Obi-Wa~    182    77 auburn, wh~ fair       blue-gray       57   male  mascu~
## # ... with 77 more rows, and 6 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>, mass_per_cm <dbl>

This function, like many others, is often useful to apply separately to different groups in the data. Suppose we want to assign each character a unique id based on their homeworld:

dat |>
  group_by(homeworld) |>
  mutate(world_id = str_c(homeworld, '_', 1:n())) |> # n() is short for group size
  pull(world_id) # convert column "world_id" to a vector

##  [1] "Tatooine_1"       "Tatooine_2"       "Naboo_1"          "Tatooine_3"      
##  [5] "Alderaan_1"       "Tatooine_4"       "Tatooine_5"       "Tatooine_6"      
##  [9] "Tatooine_7"       "Stewjon_1"        "Tatooine_8"       "Eriadu_1"        
## [13] "Kashyyyk_1"       "Corellia_1"       "Rodia_1"          "Nal Hutta_1"     
## [17] "Corellia_2"       "Bestine IV_1"     NA                 "Naboo_2"         
## [21] "Kamino_1"         NA                 "Trandosha_1"      "Socorro_1"       
## [25] "Bespin_1"         "Mon Cala_1"       "Chandrila_1"      NA                
## [29] "Endor_1"          "Sullust_1"        NA                 "Cato Neimoidia_1"
## [33] "Coruscant_1"      "Naboo_3"          "Naboo_4"          "Naboo_5"         
## [37] "Naboo_6"          "Toydaria_1"       "Malastare_1"      "Naboo_7"         
## [41] "Tatooine_9"       "Dathomir_1"       "Ryloth_1"         "Ryloth_2"        
## [45] "Vulpter_1"        "Troiken_1"        "Tund_1"           "Haruun Kal_1"    
## [49] "Cerea_1"          "Glee Anselm_1"    "Iridonia_1"       "Coruscant_2"     
## [53] "Iktotch_1"        "Quermia_1"        "Dorin_1"          "Champala_1"      
## [57] "Naboo_8"          "Naboo_9"          "Tatooine_10"      "Geonosis_1"      
## [61] "Mirial_1"         "Mirial_2"         "Naboo_10"         "Serenno_1"       
## [65] "Alderaan_2"       "Concord Dawn_1"   "Zolan_1"          "Ojom_1"          
## [69] "Kamino_2"         "Kamino_3"         "Coruscant_3"      "Aleen Minor_1"   
## [73] NA                 "Skako_1"          "Muunilinst_1"     "Shili_1"         
## [77] "Kalee_1"          "Kashyyyk_2"       "Alderaan_3"       "Umbara_1"        
## [81] "Utapau_1"         NA                 NA                 NA                
## [85] NA                 NA                 "Naboo_11"

Grouping is frequently used with summarize. Summarizing is a form of reduction; it takes a (grouped) dataframe and returns a new dataframe with one row per group:

dat |>
  group_by(gender) |>
  drop_na(height) |> # remove NAs from height for mean and sd
  summarize(m_height = mean(height), sd_height = sd(height))

## # A tibble: 3 x 3
##   gender    m_height sd_height
##   <chr>        <dbl>     <dbl>
## 1 feminine      165.     23.6 
## 2 masculine     177.     37.6 
## 3 <NA>          181.      2.89

A more complex example, grouping by two groups, then simulating a distribution for mass, based on the observations from each group.

mass_sim <- dat |>
  drop_na(mass) |> 
  group_by(eye_color, hair_color) |> # notice the number of groups is printed
  summarize(
    mass_boot = list(sample(mass, 1000, replace = TRUE) + rnorm(1000))
  ) |>
  unnest(mass_boot) # un-collapses the list by repeating the other rows

## `summarise()` has grouped output by 'eye_color'. You can override using the `.groups` argument.

## make a histogram for one group
mass_sim |>
  filter(eye_color == 'blue', hair_color == 'brown') |>
  pull(mass_boot) |>
  hist(breaks = 20, main = '')

Here we have used the tidyr function unnest, which “expands” a column of lists (a list of lists, really) into a longer column with each entry in each list as a new row.

Finally, the count function is for the common summarizing task of counting groups:

count(dat, homeworld, sort = TRUE)

## # A tibble: 49 x 2
##    homeworld     n
##    <chr>     <int>
##  1 Naboo        11
##  2 Tatooine     10
##  3 <NA>         10
##  4 Alderaan      3
##  5 Coruscant     3
##  6 Kamino        3
##  7 Corellia      2
##  8 Kashyyyk      2
##  9 Mirial        2
## 10 Ryloth        2
## # ... with 39 more rows

Count group cominations by providing \(>1\) column name, and use add_count to add the counts to the existing data.

Going further

The R4DS book has some exercises for cleaning and manipulating data in Chapters 5, 10-13, and 16.
There are several vignettes that cover advanced dplyr concepts, which I’ve found useful. Check out vignette('colwise') for applying dplyr functions to many columns at once, vignette('two-table') for various ways to combine dataframes, and vignette('programming') to learn more about working with tidy evaluation, the magic system underlying many tidyverse functions. Or, find all vignettes here.

Some other important concepts

Pivoting

Recall that tidy data is supposed to have one column per variable, and one row per observation. As we will see soon, plotting is one case where the ambiguity of “variable” and “observation” matters. For this and other cases, there are ways to pivot the data, changing the context of variables and observations.

Think of pivot_longer as lengthening the dataframe. Rather than every variable getting a row of values, we make one row of the variables and one of the values:

x <- tribble(
  ~A, ~B, ~C,
  1, 2, 3,
  4, 5, 6
)

x

## # A tibble: 2 x 3
##       A     B     C
##   <dbl> <dbl> <dbl>
## 1     1     2     3
## 2     4     5     6

pivot_longer(x, everything())

## # A tibble: 6 x 2
##   name  value
##   <chr> <dbl>
## 1 A         1
## 2 B         2
## 3 C         3
## 4 A         4
## 5 B         5
## 6 C         6

An example with data, counting how many times each color appears anywhere:

dat |>
  pivot_longer(
    hair_color:eye_color, 
    names_to='attribute', 
    values_to='color'
  ) |>
  count(color)

## # A tibble: 44 x 2
##    color             n
##    <chr>         <int>
##  1 auburn            1
##  2 auburn, grey      1
##  3 auburn, white     1
##  4 black            23
##  5 blond             3
##  6 blonde            1
##  7 blue             21
##  8 blue-gray         1
##  9 blue, grey        2
## 10 brown            43
## # ... with 34 more rows

Here, since we want to count colors across several columns, we pivot these columns into a single column and then count it.

Plotting with ggplot

Some of you may have used ggplot2 before. The package has its own philosophy (the “grammar of graphics”) that builds on those of all tidyverse packages and tidy data. Let’s go over a few to help make sense of how ggplot2 works:

Plots are made of layers. For ggplot, visual information is added to plots one “group” at a time, using geometries. For example, use geom_point to add points and geom_histrogram to add histogram bins. Multiple types of geometries can be layered onto the same plot.
Plots are an interpretation of relationships in the data. These relationships can be specified with aesthetic mappings. For example, if the points you are plotting come from two species, you can color points according to their species with the mapping species -> color.
A plot is an object that we can perform operations on. This means we can always continue to modify a plot by performing new steps, much like how the |> operator lets data be modified in single steps. For example, scale_x_sqrt adds the instruction to take the square root of the variable on the x-axis before plotting. This also means that ggplot2 objects can be stored in a variable in R to be used later.

Here’s a simple example to see each of these ideas at work:

gg <- ggplot(data = diamonds, mapping = aes(x = price, y = carat, col = cut)) +
  geom_point(shape = 1) # shape is a "static" argument here

gg +
  scale_x_log10() +
  labs(x = 'Price', y = 'Carat', col = 'Cut quality') +
  theme_bw()

Here we’ve plotted price by carat and colored the points according to cut. We then added some visual modifications on top of the stored plot gg.

Breaking down each of the instructions, a basic anatomy of most ggplots is found:

ggplot is the “master” function that sets a “blank slate.” It is also where data and aesthetic mappings, common to all geoms, is specified.

ggplot() # totally blank slate

ggplot(diamonds, aes(price, carat)) # data and the coordinate system

geom_* functions add on layers of data. Notice that these functions also have a data and mapping argument. These can be used to override these arguments from ggplot for that geom only.
Last come any modifications of the existing coordinates or geoms. This includes scale_*, renaming parts of the plot, and changing the theme.

There are many geometries available, and the best way to learn them is to check out the cheatsheet and try them out on your own data. Rather than cover them here, we’ll now focus on some common tasks and challenges.

Facetting

Separating data into a group of several plots is done with facets:

gg_hist <- ggplot(diamonds, aes(x = carat, fill = cut)) +
  geom_histogram() +
  facet_grid(cut ~ ., scales = 'free')

An example of a barplot (histogram for discrete variable) with 2d facetting:

ggplot(diamonds, aes(x = clarity)) +
  geom_bar(fill = 'lightblue') +
  facet_grid(cut ~ color, scales = 'free')

Changing color schemes and other scales

There are tons of different approaches to customizing color with ggplot, and its easy to get overwhelmed! Here we’ll focus on the essentials and leave you to research different color palettes or making your own.

Basically, color is changed through scaling a color aesthetic (col or fill). Going back to our histogram example, here we use one of the color palettes provided by RColorBrewer for discrete variables:

library(RColorBrewer)

my_brew_pal <- brewer.pal(8, 'Set2')

gg_hist +
  scale_fill_discrete(type = my_brew_pal)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As we see above, a color palette can be stored and used in multiple plots. For presentations or papers, it can be nice to have a set of specific colors fixed for a variable of interest. An easy way to do this in ggplot is with scale_*_manual. Say we want all plots using a mapping with cut to have consistent colors:

my_pal <- c('blue2', 'orangered', 'lightpink', 'springgreen', 'gold')
names(my_pal) <- levels(diamonds$cut)

gg_hist +
  scale_fill_manual(values = my_pal)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(diamonds, aes(price, carat, col = cut)) +
  geom_point(alpha = 0.1, data = slice_sample(diamonds, prop = 0.1)) +
  geom_smooth() +
  scale_color_manual(values = my_pal) +
  ylim(0, 3)

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

## Warning: Removed 32 rows containing non-finite values (stat_smooth).

## Warning: Removed 2 rows containing missing values (geom_point).

In general, use scale_*_discrete or scale_*_continuous if you want to provide existing palettes or do other scaling functions (we’ll see another use of scale in a moment), and scale_*_manual for making a discrete palette by hand.

Aesthetics other than color can be scaled in much the same way. A common example is shape:

ggplot(starwars, aes(eye_color, height, shape = gender)) + # col is continuous here
  geom_point(size = 2) +
  scale_shape_manual(values = c(19, 1), na.value = 13) +
  scale_x_discrete(guide = guide_axis(angle = 45))

## Warning: Removed 6 rows containing missing values (geom_point).

diamonds |>
  mutate(car_bin = ifelse(carat < mean(carat), 'low_car', 'high_car')) |> 
  group_by(car_bin, color) |> 
  summarize(mpr = mean(price)) |>
  ggplot(aes(color, mpr, shape = car_bin, group = car_bin)) +
  geom_point() +
  geom_line()

## `summarise()` has grouped output by 'car_bin'. You can override using the `.groups` argument.

Going further

The R Graphics Cookbook has tons of examples for all things ggplot.
Chapter 3 in R4DS is a good review of ggplot concepts and basics, with exercises.

Bonus: for-loop frenzy

Control flow review

The functions documented in ?Control are reserved words. That means you can’t name a variable or function one of these words. Going over some basic examples:

## foo: add or subtract a and c, depending on b
foo <- function(a, b = 3, c) {
  if (a < b)
    ret <- bar(a, c)
  else
    ret <- bat(a, c)
  return(ret)
}

bar <- function(d, e) d + e

bat <- function(f, g) f - g

foo(1, c = 4)

## [1] 5

for (i in 1:10)
  print(paste0(i, ':', foo(i, c = 4)))

## [1] "1:5"
## [1] "2:6"
## [1] "3:-1"
## [1] "4:0"
## [1] "5:1"
## [1] "6:2"
## [1] "7:3"
## [1] "8:4"
## [1] "9:5"
## [1] "10:6"

Use break to skip out of a for/while loop entirely, or next to skip to the next iteration:

fav_fruits <- c('blueberry', 'orange')
enemy_fruit <- 'honeydew'
for (l in fruit) {
  if (l %in% fav_fruits) {
    print('yum')
    next
  }
  if (l == enemy_fruit) {
    print('ewwwww!')
    break
  }
  print(paste0("I don't like ", l))
}

## [1] "I don't like apple"
## [1] "I don't like apricot"
## [1] "I don't like avocado"
## [1] "I don't like banana"
## [1] "I don't like bell pepper"
## [1] "I don't like bilberry"
## [1] "I don't like blackberry"
## [1] "I don't like blackcurrant"
## [1] "I don't like blood orange"
## [1] "yum"
## [1] "I don't like boysenberry"
## [1] "I don't like breadfruit"
## [1] "I don't like canary melon"
## [1] "I don't like cantaloupe"
## [1] "I don't like cherimoya"
## [1] "I don't like cherry"
## [1] "I don't like chili pepper"
## [1] "I don't like clementine"
## [1] "I don't like cloudberry"
## [1] "I don't like coconut"
## [1] "I don't like cranberry"
## [1] "I don't like cucumber"
## [1] "I don't like currant"
## [1] "I don't like damson"
## [1] "I don't like date"
## [1] "I don't like dragonfruit"
## [1] "I don't like durian"
## [1] "I don't like eggplant"
## [1] "I don't like elderberry"
## [1] "I don't like feijoa"
## [1] "I don't like fig"
## [1] "I don't like goji berry"
## [1] "I don't like gooseberry"
## [1] "I don't like grape"
## [1] "I don't like grapefruit"
## [1] "I don't like guava"
## [1] "ewwwww!"

For-loop challenge

Exercise 1

Write a function boxR that takes arguments w and h and prints a \(w \times h\) rectangle of *s. Hints: use nested for loops, and use cat('*') to print * without moving to a newline, and cat('\n') to move to a newline.

Solution

boxR <- function(w, h) {
  for (i in 1:w) {
    for (j in 1:h) {
      cat('*')
    }
    cat('\n')
  }
}

boxR(3, 4)

## ****
## ****
## ****

Exercise 2

A programming rite of passage. Write a function triangleR that takes odd integer b and prints an isosceles triangle with base \(b\). Hint: draw pictures!

Solution

triangleR <- function(b) {
  space <- (b - 1) %/% 2
  for (i in space:0) {
    for (j in 1:b) {
      if (j <= i | j > b - i)
        cat(' ')
      else
        cat('*')
    }
    cat('\n')
  }
}

triangleR(11)

##      *     
##     ***    
##    *****   
##   *******  
##  ********* 
## ***********

Bonus: network analysis with tidygraph + ggraph

tidygraph is a package for constructing and analyzing graphs with a workflow similar to tidyr/dplyr, and ggraph is a package similar to ggplot for visualization.

library(tidygraph)

## 
## Attaching package: 'tidygraph'

## The following object is masked from 'package:stats':
## 
##     filter

library(ggraph)

In tidygraph, a graph is represented with two tibbles: one of nodes (vertices) and one of edges (links). These tibbles are bundled together as one object called a tbl_graph.

(graph <- create_ring(10))

## # A tbl_graph: 10 nodes and 10 edges
## #
## # An undirected simple graph with 1 component
## #
## # Node Data: 10 x 0 (active)
## # ... with 4 more rows
## #
## # Edge Data: 10 x 2
##    from    to
##   <int> <int>
## 1     1     2
## 2     2     3
## 3     3     4
## # ... with 7 more rows

In ggraph, a graph is plotted in “layers” just like ggplot. There are specific geoms for nodes and edges:

gg_net <- ggraph(graph) +
  geom_node_point(color = 'blue', size = 5)

## Using `stress` as default layout

gg_net

gg_net +
  geom_edge_link()

dplyr verbs on graphs

Since a tbl_graph is made of tibbles, we can use all the familiar functions on them:

graph <- graph |> 
  mutate(name = fruit[1:n()])

Here we see we have added an new attribute to the nodes, by adding a new column to the nodes tibble. Also, notice how the nodes are labelled active. This is how we specify whether we want to manipulate the node or edge tibble. Switch between the two using activate.

graph <- graph |> 
  activate(edges) |> 
  mutate(weight = runif(n(), 1, 5))

Also, %N>% can be used as shorthand for |> activate(nodes) |>, and %E>% for activating edges.

Attributes can be used in aesthetic mappings when plotting. As with all geoms, check which aesthetics can be controlled with a mapping with ?geom_node_* etc.

ggraph(graph) +
  geom_edge_link(mapping = aes(edge_color = weight), edge_width = 2) +
  geom_node_label(mapping = aes(label = name))

## Using `stress` as default layout

Specialized verbs on graphs

In addition to the usual tidyverse functions, there are many new verbs specific to network science, such as for finding the shortest distance between two nodes or calculating a node’s degree. These can be useful, for example, for richer visualization.

In the following example, we generate a network using the famous Erdős–Rényi model, then compute two measures of node centrality, or how “connected” a node is. The degree is the number of neighbors and the page rank is from the formula for ordering web results in Google’s early days.

graph <- play_erdos_renyi(300, p = 0.03) |> 
  mutate(
    degree = centrality_degree(),
    pagerank = centrality_pagerank()
  )

graph |> 
  as_tibble() |> 
  pivot_longer(c(degree, pagerank)) |> 
  ggplot(aes(x = value, fill = name)) +
  geom_density() +
  facet_wrap(~name, scales = 'free', nrow = 2)

ggraph(graph) +
  geom_edge_link0(alpha = 0.1) + # use edge0 family for faster plotting
  geom_node_point(aes(col = degree, size = degree))

## Using `stress` as default layout

One important type of network operation is clustering, where group-level patterns are extracted from the data. The basic motive behind clustering is that nodes within clusters should tend to be more connected with each other than with nodes of other clusters. Many different clustering algorithms have been proposed, and tidygraph supports the most popular ones with the group_* functions.

play_islands(n_islands = 5, size_islands = 10, p_within = 0.8, m_between = 3) |> 
  mutate(community = as.factor(group_infomap())) |> 
  ggraph() + 
  geom_edge_link(alpha = 0.3) + 
  geom_node_point(aes(colour = community), size = 7) + 
  theme_graph() # a simple t

## Using `stress` as default layout

Going further

A deeper summary of tidygraph’s features can be found here
Something we didn’t get into is the layout of graphs, and a number of algorithms exist for deciding where to arrange nodes in a plot. In ggraph you can specify which layout you want with ggraph(graph, layout = ...). Many other ways beyond the “ball and stick” visualization for graphs are also available; see here for a good intro.
tidygraph borrows heavily from another network analysis project called igraph. While igraph itself is implemented in the C language for speed, an R package also exists. Anything you can do in R’s igraph can be done in tidygraph, and personally I find the igraph package a bit outdated, but if you don’t like the “two-tibble” approach know that another option exists.

Bonus: deSolve and simulating data over multiple parameters

You’ve seen how to do some work with systems of differential equations (compartmental models) on paper, but how can we see how these systems evolve over time as a whole? For that, we usually need a computer to help, using numerical integration.

In MOCS, you will learn much more about numerical integration and even write your own algorithms for it. However, there is also an R package for this using state-of-the-art algorithms, called deSolve.

library(deSolve)

The main function of interest here is ode (for ordinary differential equation), which will simulate a system given a list derivatives and an initial state. We specify a compartmental model like this:

SIS_system <- function(t, state, parameters) {
  with(as.list(c(state, parameters)), { # add names to environment
    dI <- beta * (1 - I) * I - alpha * I # derivative for I compartment
    return(list(dI)) # always must return a list of all derivatives
  })
}

and simulate this system with \(\beta = 0.3\), \(\alpha = 0.1\), and \(I_0 = 0.1\):

sis_res <- ode(
  y = c(I = 0.1), # named vector of initial states
  times = 0:50, # integrate the system from t=0 to 50
  func = SIS_system, # where to get the derivatives for updating
  parms = c(alpha = 0.1, beta = 0.3) # fixed parameters
)

ode returns a matrix, with time as the first column and the state variables for each \(t\) as the following columns:

plot(sis_res[,'time'], sis_res[,'I'])

As we can see, \(I\) is approaching the expected steady-state \(1 - \alpha / \beta\).

Examining the reproductive number

To check our derivation of \(R_0\), we will perform a sensitivity analysis, where we basically test our model over a wide combination of variables. Since we only have two parameters and a very simple model, this can be done with a simple grid over \(\alpha\) and \(\beta\):

get_final_state <- function(a, b) {
  res <- ode(
    y = c(I = 0.1), 
    times = 0:100, 
    func = SIS_system, 
    parms = c(alpha = a, beta = b)
  )
  
  res[nrow(res), 'I']
}

## generate all pairs of parameters
param_grid <- expand.grid(alpha = seq(.05, 2, .1), beta = seq(.05, 2, .1))

## for each pair, run SIS and save the final state
inf_stead <- double(nrow(param_grid))
for(i in 1:nrow(param_grid)) {
  inf_stead[i] <- get_final_state(param_grid[i, 1], param_grid[i, 2])
}

gg_heat <- param_grid |> 
  as_tibble() |> 
  mutate(inf_stead = inf_stead) |> 
  ggplot(aes(beta, alpha, fill = inf_stead)) +
  geom_tile() +
  geom_abline(slope=1, intercept=0, col = 'red') # R0 has a func of beta

gg_heat

The blank space around the heatmap can be removed with scales:

gg_heat +
  scale_x_continuous(expand=expansion(mult=c(0, 0)), name = 'Transmission') +
  scale_y_continuous(expand=expansion(mult=c(0, 0)), name = 'Recovery')

QuEST coding workshop day 2: tidyverse introduction

8/20/2021

tidyverse overview

Learning resources

tibbles and tidyr

Creating tibbles

Tidy data

Dplyr

The pipe operator

Mutate, summarize, and count

Going further

Some other important concepts

Pivoting

Plotting with ggplot

Facetting

Changing color schemes and other scales

Going further

Bonus: for-loop frenzy

Control flow review

For-loop challenge

Exercise 1

Exercise 2

Bonus: network analysis with tidygraph + ggraph

dplyr verbs on graphs

Specialized verbs on graphs

Going further

Bonus: deSolve and simulating data over multiple parameters

Examining the reproductive number