R in 5 minutes

R is a interactive scripting language. R scripts have the file extension .R or .r, and scripts can be run “start-to-finish” from the command line, e.g. Rscript helloworld.R, similar to Python.

It is more common to use R interactively, generally with the help of R Studio or a similar IDE. Run a single line of R code using ctrl-enter.

5 + (5 / 2)
## [1] 7.5

Vectors

Assign a vector of numbers to a variable called v. The assignment operator can be inserted with alt+-.

v <- c(1, 2, 3)

We can now access this vector by referring to v in our script. For example, we can access specific contents of the vector

v[3] + 2
## [1] 5
v[-2]
## [1] 1 3
v[c(1, 1, 3)] # "multiple indexing" using another vector
## [1] 1 1 3

Notice that the contents of v are unchanged. Assign new contents to v like so.

# change an item
v[2] <- -1 
# append new item
(v <- c(v, 4)) # print the results of an assignment using ()
## [1]  1 -1  3  4

Warning

Appending to a vector in this way is very inefficient in R compared to other languages. Avoid building up large lists in this way.

Vectorized operations

An important defining feature of R is that many operations are automatically vectorized.

v <- 1:5
v + 1 # 1 is "recycled" and applied to each element
## [1] 2 3 4 5 6
v + 1:length(v)
## [1]  2  4  6  8 10
v + c(1, 2) # a 2-vector can not be applied to a 5-vector with recycling
## Warning in v + c(1, 2): longer object length is not a multiple of shorter object
## length
## [1] 2 4 4 6 6
v[-5] + c(1, 2) # but it can be applied to a 4-vector
## [1] 2 4 4 6
sin(v)
## [1]  0.8414710  0.9092974  0.1411200 -0.7568025 -0.9589243

Warning

A common source of errors is using vectorization unintentionally, especially with recycled elements. Try to become aware of which R functions use vectorization.

Types and lists

The major type distinctions in R are character, numeric, logical, and NULL. Some comments:

  • Strings and characters are equivalent, and can be created using "..." or '...'
  • Numerics include doubles like 3.1415927 and integers like 3L
  • Logical objects are either TRUE, FALSE and NA
  • The NULL object is the unique “nothing” object, of length zero

Vectors can only contain a single type. Vector types are promoted eagerly.

v[2] <- 3.5
v[3] <- "foo"

Lists are like vectors, but can contain multiple types.

v <- 1:5
l <- as.list(v)
l[[2]] <- 3L
l[3:4] <- list(FALSE, list("bar", "bat"))
l
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 3
## 
## [[3]]
## [1] FALSE
## 
## [[4]]
## [[4]][[1]]
## [1] "bar"
## 
## [[4]][[2]]
## [1] "bat"
## 
## 
## [[5]]
## [1] 5

Both vectors and lists can be named:

names(v) <- letters[1:5]; v
## a b c d e 
## 1 2 3 4 5

Dataframes

A dataframe is like an excel file in code. It is a 2-dimensional array with named columns.

data("mtcars") # simultaneously load mtcars data and assign to variable with same name
mtcars
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

We will work with tabular data a lot in a moment. For now, here are some basic operations.

mtcars[,3] # contents of third column
##  [1] 160.0 160.0 108.0 258.0 360.0 225.0 360.0 146.7 140.8 167.6 167.6 275.8
## [13] 275.8 275.8 472.0 460.0 440.0  78.7  75.7  71.1 120.1 318.0 304.0 350.0
## [25] 400.0  79.0 120.3  95.1 351.0 145.0 301.0 121.0
mtcars[c(1, 3, 5),] # subset rows
##                    mpg cyl disp  hp drat   wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.62 16.46  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.32 18.61  1  1    4    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.44 17.02  0  0    3    2
summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

Info

Most R objects have implemented their own summary function, including those provided by other packages. The output will depend on the nature of the object.

Installing and loading packages

Packages are an essential part of the R workflow. They are what makes the open-source, community-driven R ecosystem so diverse and up-to-date with the latest research.

Install a new package with install.packages("<package name>"), and load a package with library(<package name>).

The long version

If you want to learn more about R in general, here are some resources. However, we will cover many of these basics in the following hands-on tutorial.

  • The official R manual for the most recent version of R is here
  • For quick examples of a large number of functions in base R, see Learn X in Y Minutes’ post.

The tidyverse packages

From the website:

The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

The “core” tidyverse packages are the ones automatically installed/loaded:

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.5     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.1.4     ✔ stringr 1.4.0
## ✔ readr   2.1.1     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

They form a complete workflow for writing most R programs:

In addition to these core packages, other more specific packages are considered part of the “extended” tidyverse. These packages need to be installed/loaded independently.

library(tidygraph)
## 
## Attaching package: 'tidygraph'
## The following object is masked from 'package:stats':
## 
##     filter
library(ggraph)

Learning resources

  • Cheatsheets are a great way to learn a package quickly. I also like to use them to learn a package while working. I’ll pull up the package I want to learn, and try to find ways to use different functions from it for whatever I’m doing.

  • R for Data Science (R4DS) can be good for all levels, and centers around the tidyverse. It starts very gently and assumes almost no R knowledge. There are also exercises and solutions available.

  • Advanced R is a more general programming guide, but draws heavily from tidyverse packages.

Tidy data: the tibble and tidyr packages

The most fundamental data structure for tidyverse is a modernization of the dataframe. The most important differences are clearer printing and more flexible column names.

starwars
## # A tibble: 87 × 14
##    name     height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
##  2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
##  3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
##  4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
##  5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
##  6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
##  7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
##  8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
##  9 Biggs D…    183    84 black      light      brown           24   male  mascu…
## 10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
## # … with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

Compared to a data.frame, a tibble is printed in a truncated format by default. Notice also the dimension of the data, the inferred types of each column, and the dimension of each list-columns (the starship column for example) are also printed.

Creating tibbles

Creating tibbles from scratch can be done similar to data.frame:

tibble(x = 1:5, y = LETTERS[1:5], z = x^2)
## # A tibble: 5 × 3
##       x y         z
##   <int> <chr> <dbl>
## 1     1 A         1
## 2     2 B         4
## 3     3 C         9
## 4     4 D        16
## 5     5 E        25

or with tribble

tribble(
  ~x, ~y,
  1, 'A',
  2, 'B',
  3, 'C'
)
## # A tibble: 3 × 2
##       x y    
##   <dbl> <chr>
## 1     1 A    
## 2     2 B    
## 3     3 C

Info

The readr functions such as read_csv always return a tibble

Creating tibbles from dataframes, matrices, or named vectors:

m <- matrix(1:10, ncol=2)
as_tibble(m)
## # A tibble: 5 × 2
##      V1    V2
##   <int> <int>
## 1     1     6
## 2     2     7
## 3     3     8
## 4     4     9
## 5     5    10
x <- c(a = 1, b = 2, c = 3)
enframe(x)
## # A tibble: 3 × 2
##   name  value
##   <chr> <dbl>
## 1 a         1
## 2 b         2
## 3 c         3

You can largely use tibbles and dataframes interchangeably (functions that say they work with one will work with the other), so I’ll just call them dataframes from now on. See the vignette for more features and differences from regular dataframes.

vignette('tibble')

What makes data “tidy”?

R for Data Science gives three rules for tidy data:

  1. Each variable has its own column
  2. Each observation has its own row
  3. Each value must have its own cell

The tidyr package contains functions for making data tidy. For example, suppose you are given some data from a field survey.

survey <- tibble(
  present_yes = sample(c('yes', NA), 15, replace = TRUE),
  present_no = ifelse(is.na(present_yes), 'no', NA),
  visit_id = 1:15
)

survey
## # A tibble: 15 × 3
##    present_yes present_no visit_id
##    <chr>       <chr>         <int>
##  1 yes         <NA>              1
##  2 <NA>        no                2
##  3 <NA>        no                3
##  4 yes         <NA>              4
##  5 yes         <NA>              5
##  6 yes         <NA>              6
##  7 yes         <NA>              7
##  8 yes         <NA>              8
##  9 yes         <NA>              9
## 10 yes         <NA>             10
## 11 yes         <NA>             11
## 12 yes         <NA>             12
## 13 <NA>        no               13
## 14 <NA>        no               14
## 15 <NA>        no               15

Someone has spread a single variable over two columns, causing lots of blank spaces in the data. This may have been more convenient in the field, but programming with the data as-is will be a bit difficult. Thankfully, conversion to tidy format is easy.

unite(survey, 'present', present_yes, present_no, na.rm = TRUE)
## # A tibble: 15 × 2
##    present visit_id
##    <chr>      <int>
##  1 yes            1
##  2 no             2
##  3 no             3
##  4 yes            4
##  5 yes            5
##  6 yes            6
##  7 yes            7
##  8 yes            8
##  9 yes            9
## 10 yes           10
## 11 yes           11
## 12 yes           12
## 13 no            13
## 14 no            14
## 15 no            15

Here, we have used unite, which pastes two or more columns of strings into a single column, while removing the NAs.

Another nice function is to drop rows which contain an NA, possibly only in specific columns.

nrow(starwars)
## [1] 87
nrow(drop_na(starwars)) # drop all rows with an NA
## [1] 6
nrow(drop_na(starwars, birth_year)) # only rows with NA in "birth_year"
## [1] 43

Manipulating tabular data with dplyr

The dplyr package is great for performing seemingly endless operations on dataframes. Just like in base R, we can subset a dataframe by rows or columns:

# select specific cols
select(starwars, height, 6)
select(starwars, name:skin_color)

# select last 5 rows
slice_tail(starwars, n = 5)

# all cols with an underscore
select(starwars, contains('_'))

More complex slicing and subsetting is possible

# select cols conditionally
select(starwars, where(is.double))
select(starwars, !last_col() & ends_with('s'))

# slice rows conditionally (filtering)
filter(starwars, eye_color == 'blue')

The pipe operator

Notice that most tidyverse functions are verbs or adverbs (coding is, after all, just giving instructions). There is a special operator, the pipe operator, which allows these commands to be chained together!

As an example, let’s create a random “population” of 1,000 of the starwars characters:

starwars |> # start of chain is the data
  filter(eye_color == 'blue') |> # each step takes the previous result
  select(height:species) |> # and does a new thing to it
  slice_sample(n = 1000, replace = TRUE)
## # A tibble: 1,000 × 10
##    height  mass hair_color   skin_color eye_color birth_year sex    gender   
##     <int> <dbl> <chr>        <chr>      <chr>          <dbl> <chr>  <chr>    
##  1    183    NA brown        fair       blue              NA <NA>   <NA>     
##  2    166    50 black        yellow     blue              40 female feminine 
##  3    234   136 brown        brown      blue              NA male   masculine
##  4    175    79 none         light      blue              37 male   masculine
##  5    183    NA brown        fair       blue              82 male   masculine
##  6    183    NA brown        fair       blue              82 male   masculine
##  7    180    NA auburn, grey fair       blue              64 male   masculine
##  8    165    75 brown        light      blue              47 female feminine 
##  9    183    NA brown        fair       blue              82 male   masculine
## 10    178   120 brown, grey  light      blue              52 male   masculine
## # … with 990 more rows, and 2 more variables: homeworld <chr>, species <chr>

Info

x |> f |> g is the same as g(f(x)); however, when you string a lot of functions together piping makes things more readable.

Mutate, summarize, and count

Another family of dplyr functions use existing columns to make new ones. The most common is mutate, which appends a new row based on a vectorized function of existing rows. transmute is like mutate but only the new columns are kept.

starwars |>
  transmute(mass_per_cm = mass / height) # divide mass column by height column
## # A tibble: 87 × 1
##    mass_per_cm
##          <dbl>
##  1       0.448
##  2       0.449
##  3       0.333
##  4       0.673
##  5       0.327
##  6       0.674
##  7       0.455
##  8       0.330
##  9       0.459
## 10       0.423
## # … with 77 more rows

This function, like many others, is often useful to apply separately to different groups in the data. Suppose we want to assign each character a unique id based on their homeworld:

starwars |>
  group_by(homeworld) |>
  mutate(world_id = str_c(homeworld, '_', 1:n())) |> # n() is short for group size
  pull(world_id) # convert column "world_id" to a vector
##  [1] "Tatooine_1"       "Tatooine_2"       "Naboo_1"          "Tatooine_3"      
##  [5] "Alderaan_1"       "Tatooine_4"       "Tatooine_5"       "Tatooine_6"      
##  [9] "Tatooine_7"       "Stewjon_1"        "Tatooine_8"       "Eriadu_1"        
## [13] "Kashyyyk_1"       "Corellia_1"       "Rodia_1"          "Nal Hutta_1"     
## [17] "Corellia_2"       "Bestine IV_1"     NA                 "Naboo_2"         
## [21] "Kamino_1"         NA                 "Trandosha_1"      "Socorro_1"       
## [25] "Bespin_1"         "Mon Cala_1"       "Chandrila_1"      NA                
## [29] "Endor_1"          "Sullust_1"        NA                 "Cato Neimoidia_1"
## [33] "Coruscant_1"      "Naboo_3"          "Naboo_4"          "Naboo_5"         
## [37] "Naboo_6"          "Toydaria_1"       "Malastare_1"      "Naboo_7"         
## [41] "Tatooine_9"       "Dathomir_1"       "Ryloth_1"         "Ryloth_2"        
## [45] "Vulpter_1"        "Troiken_1"        "Tund_1"           "Haruun Kal_1"    
## [49] "Cerea_1"          "Glee Anselm_1"    "Iridonia_1"       "Coruscant_2"     
## [53] "Iktotch_1"        "Quermia_1"        "Dorin_1"          "Champala_1"      
## [57] "Naboo_8"          "Naboo_9"          "Tatooine_10"      "Geonosis_1"      
## [61] "Mirial_1"         "Mirial_2"         "Naboo_10"         "Serenno_1"       
## [65] "Alderaan_2"       "Concord Dawn_1"   "Zolan_1"          "Ojom_1"          
## [69] "Kamino_2"         "Kamino_3"         "Coruscant_3"      "Aleen Minor_1"   
## [73] NA                 "Skako_1"          "Muunilinst_1"     "Shili_1"         
## [77] "Kalee_1"          "Kashyyyk_2"       "Alderaan_3"       "Umbara_1"        
## [81] "Utapau_1"         NA                 NA                 NA                
## [85] NA                 NA                 "Naboo_11"

Warning

Here, the behavior of str_c shows us that mutate and many other dplyr verbs operate on vectors, not individual cells.

Grouping is frequently used with summarize. Summarizing is a form of reduction; it takes a (grouped) dataframe and returns a new dataframe with one row per group:

starwars |>
    group_by(gender) |>
    drop_na(height) |> # remove NAs from height for mean and sd
    summarize(m_height = mean(height), sd_height = sd(height))
## # A tibble: 3 × 3
##   gender    m_height sd_height
##   <chr>        <dbl>     <dbl>
## 1 feminine      165.     23.6 
## 2 masculine     177.     37.6 
## 3 <NA>          181.      2.89

Grouping is not just for summarizing. We have seen it can be useful with mutate, and it is useful with slice_* and filter as well. Let’s take a random sample of five characters from each species, first removing rows with less than 5 characters sharing their homeworld:

starwars |>
    group_by(homeworld) |>
    mutate(count = n()) |> 
    filter(count >= 5) |> 
    slice_sample(n = 5)
## # A tibble: 15 × 15
## # Groups:   homeworld [3]
##    name     height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Jar Jar…    196    66 none       orange     orange          52   male  mascu…
##  2 Dormé       165    NA brown      light      brown           NA   fema… femin…
##  3 Ric Olié    183    NA brown      fair       blue            NA   <NA>  <NA>  
##  4 Cordé       157    NA brown      light      brown           NA   fema… femin…
##  5 Gregar …    185    85 black      dark       brown           NA   male  mascu…
##  6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
##  7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
##  8 Cliegg …    183    NA brown      fair       blue            82   male  mascu…
##  9 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
## 10 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
## 11 Captain…     NA    NA unknown    unknown    unknown         NA   <NA>  <NA>  
## 12 Finn         NA    NA black      dark       dark            NA   male  mascu…
## 13 IG-88       200   140 none       metal      red             15   none  mascu…
## 14 Rey          NA    NA brown      light      hazel           NA   fema… femin…
## 15 Yoda         66    17 white      green      brown          896   male  mascu…
## # … with 6 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>, count <int>

Finally, the count function is for the common summarizing task of counting groups:

count(starwars, homeworld, sort = TRUE)
## # A tibble: 49 × 2
##    homeworld     n
##    <chr>     <int>
##  1 Naboo        11
##  2 Tatooine     10
##  3 <NA>         10
##  4 Alderaan      3
##  5 Coruscant     3
##  6 Kamino        3
##  7 Corellia      2
##  8 Kashyyyk      2
##  9 Mirial        2
## 10 Ryloth        2
## # … with 39 more rows

Info

Count group interactions by providing >1 column name, and use add_count to add the counts to the existing data.

Excercises

midwest is a dataset of county-level demographic information in the Midwestern US. Consider any/all of the following tasks:

Questions

  1. What are the dimensions of midwest?
  2. Use a function from the slice_* family (type ?slice into your console) to find the 50 counties with the largest popdensity. Now, how many times is each state represented in this top sample?
  3. Considering only counties in a metro area (i.e. inmetro == 1), compute the percentage of people below the poverty line for each state. Also compute the variance.

Solutions

dim(midwest)
## [1] 437  28
midwest |> 
    slice_max(popdensity, n=50) |> 
    count(state)
## # A tibble: 5 × 2
##   state     n
##   <chr> <int>
## 1 IL       10
## 2 IN        9
## 3 MI       10
## 4 OH       16
## 5 WI        5
midwest |> 
    filter(inmetro == 1) |> 
    group_by(state) |> 
    summarize(mpov = mean(percbelowpoverty), vpov = var(percbelowpoverty))
## # A tibble: 5 × 3
##   state  mpov  vpov
##   <chr> <dbl> <dbl>
## 1 IL     9.56  15.4
## 2 IN     9.59  16.0
## 3 MI    11.3   18.5
## 4 OH    11.5   15.1
## 5 WI     9.01  16.5

Going further

  • The R4DS book has some exercises for cleaning and manipulating data in Chapters 5, 10-13, and 16.

  • There are several vignettes that cover advanced dplyr concepts, which I’ve found useful. Check out vignette('colwise') for applying dplyr functions to many columns at once, vignette('two-table') for various ways to combine dataframes, and vignette('programming') to learn more about working with tidy evaluation, the magic system underlying many tidyverse functions. Or, find all vignettes here.

Some other important concepts

Nesting and unnesting

One of our definitions of tidy data is “one observation per row.” What exactly is an observation? It depends!

This is why tidy data frequently contains columns with many “repeated” entries; we want each unit of analysis to be in its own row, even if it means other variables are repeated when they are shared across units.

sw_film <- starwars |> 
    unnest(c(films)) |> # rows corresponding to the nested entry are repeated
    select(name, films, everything()) # just move films to front to emphasize we are interested in it
    
sw_film |> 
    group_by(films) |> 
    summarise(m_age = mean(birth_year, na.rm=TRUE)) |> 
    arrange(desc(m_age))
## # A tibble: 7 × 2
##   films                   m_age
##   <chr>                   <dbl>
## 1 The Phantom Menace      140. 
## 2 Return of the Jedi      133. 
## 3 The Empire Strikes Back 105. 
## 4 Revenge of the Sith     103. 
## 5 Attack of the Clones     98.7
## 6 A New Hope               90.9
## 7 The Force Awakens        56.8

The reason why the tidy version was easier in this case is because in the original format, our variable of interest was not easily accessible. If we wanted to make a running tally of birth_year by film, we would need to use a double for-loop to check which films each character was in.

When would we deliberately nest for tidy data? One example is when the observable of interest depends on a subset of the data:

get_slope <- function(df) lm(height ~ mass, df)$coefficients[2]

starwars |> 
    select(sex, height, mass) |> 
    drop_na(height, mass) |> # remove characters with unknown height/weight
    nest(data = c(height, mass)) |> 
    rowwise() |> 
    mutate(slope = get_slope(data))
## # A tibble: 5 × 3
## # Rowwise: 
##   sex            data                slope
##   <chr>          <list>              <dbl>
## 1 male           <tibble [44 × 2]>  1.03  
## 2 none           <tibble [4 × 2]>   0.978 
## 3 female         <tibble [9 × 2]>   0.0578
## 4 hermaphroditic <tibble [1 × 2]>  NA     
## 5 <NA>           <tibble [1 × 2]>  NA

Pivoting

As we will see soon, plotting is another case where the ambiguity of “variable” and “observation” matters. For this and other cases, there are ways to pivot the data, changing the context of variables and observations.

Similar to nesting, think of pivot_longer as lengthening the dataframe. Rather than every variable getting a row of values, we make one row of the variables and one of the values:

x <- tribble(
  ~A, ~B, ~C,
  1, 2, 3,
  4, 5, 6
)

x
## # A tibble: 2 × 3
##       A     B     C
##   <dbl> <dbl> <dbl>
## 1     1     2     3
## 2     4     5     6
pivot_longer(x, everything())
## # A tibble: 6 × 2
##   name  value
##   <chr> <dbl>
## 1 A         1
## 2 B         2
## 3 C         3
## 4 A         4
## 5 B         5
## 6 C         6

An example with data, counting how many times each color appears anywhere:

starwars |>
  pivot_longer(
    hair_color:eye_color, 
    names_to='attribute', 
    values_to='color'
  ) |>
  count(color)
## # A tibble: 44 × 2
##    color             n
##    <chr>         <int>
##  1 auburn            1
##  2 auburn, grey      1
##  3 auburn, white     1
##  4 black            23
##  5 blond             3
##  6 blonde            1
##  7 blue             21
##  8 blue-gray         1
##  9 blue, grey        2
## 10 brown            43
## # … with 34 more rows

Here, since we want to count colors across several columns, we pivot these columns into a single column and then count it.

Excercises

Questions

  1. who contains data on country-wide tuberculosis numbers by year. Columns starting with “new” are counts from that year for various groups. Using pivot_longer, count the number of times each country has an NA in one of these columns. Which countries had the most and least amount of missing data? Why was pivoting helpful in this example?

  2. us_rent_income contains data on each state’s median monthly rent and income for 2017. Use pivot_wider to spread these values into two columns income and rent. Using these two columns, find the correlation between rent and income.

Solutions

who |>
    pivot_longer(contains("new")) |> 
    group_by(country) |> 
    summarize(na_ct = sum(is.na(value))) |> 
    filter(na_ct %in% c(min(na_ct), max(na_ct)))
## # A tibble: 3 × 2
##   country                           na_ct
##   <chr>                             <int>
## 1 Bonaire, Saint Eustatius and Saba    84
## 2 Curacao                              84
## 3 Monaco                             1889
us_rent_income |> 
    pivot_wider(GEOID:NAME, names_from = variable, values_from = estimate) |>
    summarize(corr = cor(income, rent, use = "complete.obs"))
## # A tibble: 1 × 1
##    corr
##   <dbl>
## 1 0.741

Plotting with ggplot2

Some of you may have used ggplot2 before. The package has its own philosophy (the “grammar of graphics”) that builds on those of all tidyverse packages and tidy data. Let’s go over a few to help make sense of how ggplot2 works:

Here’s a simple example to see each of these ideas at work:

gg <- ggplot(data = slice_sample(diamonds, n=1000), mapping = aes(x = carat, y = price, col = cut)) +
  geom_point(shape = 1) # shape is a "static" argument here

gg +
    scale_x_log10() +
    scale_y_log10() +
    labs(x = 'Carat', y = 'Price', col = 'Cut quality') +
    theme_bw()

Here we’ve plotted price by carat and colored the points according to cut. We then added some visual modifications on top of the stored plot gg.

Breaking down each of the above instructions, a basic anatomy of most ggplots is found:

  1. ggplot is the “master” function that sets a “blank slate.” It is also where the data and aesthetic mapping that is common to all geoms is specified.
ggplot() # totally blank slate

ggplot(diamonds, aes(x = carat, y = price)) # data and the coordinate system

  1. geom_* functions add on layers of data. Notice that these functions also have a data and mapping argument. These can be used to override these arguments from ggplot for that geom only.

  2. Last come any modifications of the existing coordinates or geoms. This includes scale_*, renaming parts of the plot, and changing the theme.

There are many geometries available, and the best way to learn them is to check out the cheatsheet and try them out on your own data. Rather than cover them here, we’ll now focus on some common tasks and challenges.

Facetting

Separating data into a group of several plots is done with facets:

gg_hist <- ggplot(diamonds, aes(x = carat, fill = cut)) +
  geom_histogram() +
  facet_grid(cut ~ ., scales = 'free')

An example of a barplot (histogram for discrete variable) with 2d facetting:

ggplot(diamonds, aes(x = clarity)) +
  geom_bar(fill = 'lightblue') +
  facet_grid(cut ~ color, scales = 'free')

Changing color schemes and other scales

There are tons of different approaches to customizing color with ggplot, and its easy to get overwhelmed! Here we’ll focus on the essentials and leave you to research different color palettes or making your own.

Basically, color is changed through scaling a color aesthetic (col or fill). Going back to our histogram example, here we use one of the color palettes provided by RColorBrewer for discrete variables:

library(RColorBrewer)

my_brew_pal <- brewer.pal(8, 'Set2')

gg_hist +
  scale_fill_discrete(type = my_brew_pal)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As we see above, a color palette can be stored and used in multiple plots. For presentations or papers, it can be nice to have a set of specific colors fixed for a variable of interest. An easy way to do this in ggplot is with scale_*_manual. Say we want all plots using a mapping with cut to have consistent colors:

my_pal <- c('blue2', 'orangered', 'lightpink', 'springgreen', 'gold')
names(my_pal) <- levels(diamonds$cut)

gg_hist +
  scale_fill_manual(values = my_pal)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(diamonds, aes(price, carat, col = cut)) +
  geom_point(alpha = 0.1, data = slice_sample(diamonds, prop = 0.1)) +
  geom_smooth() +
  scale_color_manual(values = my_pal) +
  ylim(0, 3)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 32 rows containing non-finite values (stat_smooth).
## Warning: Removed 4 rows containing missing values (geom_point).

In general, use scale_*_discrete or scale_*_continuous if you want to provide existing palettes or do other scaling functions (we’ll see another use of scale in a moment), and scale_*_manual for making a discrete palette by hand.

Aesthetics other than color can be scaled in much the same way. A common example is the shape aesthetic.

ggplot(starwars, aes(eye_color, height, shape = gender)) + # col is continuous here
  geom_point(size = 2) +
  scale_shape_manual(values = c(19, 1), na.value = 13) +
  scale_x_discrete(guide = guide_axis(angle = 45))
## Warning: Removed 6 rows containing missing values (geom_point).

diamonds |>
  mutate(car_bin = ifelse(carat < mean(carat), 'low_car', 'high_car')) |>
  group_by(car_bin, color) |>
  summarize(mpr = mean(price)) |>
  ggplot(aes(color, mpr, shape = car_bin, group = car_bin)) +
  geom_point() +
  geom_line()
## `summarise()` has grouped output by 'car_bin'. You can override using the
## `.groups` argument.

Going further

  • The R Graphics Cookbook has tons of examples for all things ggplot.

  • Chapter 3 in R4DS is a good review of ggplot concepts and basics, with exercises.

Bonus: Intro to Network Science with tidygraph + ggraph

tidygraph is a package for constructing and analyzing graphs with a workflow similar to tidyr/dplyr, and ggraph is a package similar to ggplot for visualization.

In tidygraph, a graph is represented with two tibbles: one of nodes (vertices) and one of edges (links). These tibbles are bundled together as one object called a tbl_graph.

(graph <- create_ring(10))
## # A tbl_graph: 10 nodes and 10 edges
## #
## # An undirected simple graph with 1 component
## #
## # Node Data: 10 × 0 (active)
## # … with 4 more rows
## #
## # Edge Data: 10 × 2
##    from    to
##   <int> <int>
## 1     1     2
## 2     2     3
## 3     3     4
## # … with 7 more rows

In ggraph, a graph is plotted in “layers” just like ggplot. There are specific geoms for nodes and edges:

gg_net <- ggraph(graph) +
  geom_node_point(color = 'blue', size = 5)
## Using `stress` as default layout
gg_net # just the nodes

gg_net +
  geom_edge_link() # now add the edges on top

dplyr verbs on graphs

Since a tbl_graph is made of tibbles, we can use all the familiar functions on them:

graph <- graph |> 
  mutate(name = fruit[1:n()])

Here we see we have added an new attribute to the nodes, by adding a new column to the nodes tibble. Also, notice how the nodes are labelled active. This is how we specify whether we want to manipulate the node or edge tibble. Switch between the two using activate.

graph <- graph |> 
  activate(edges) |> 
  mutate(weight = runif(n(), 1, 5))

Info

%N>% can be used as shorthand for |> activate(nodes) |>, and %E>% for activating edges

Attributes can be used in aesthetic mappings when plotting. As with all geoms, check which aesthetics can be controlled with a mapping with ?geom_node_* etc.

ggraph(graph) +
  geom_edge_link(mapping = aes(edge_color = weight), edge_width = 2) +
  geom_node_label(mapping = aes(label = name))
## Using `stress` as default layout

Specialized verbs on graphs

In addition to the usual tidyverse functions, there are many new verbs specific to network science, such as for finding the shortest distance between two nodes or calculating a node’s degree. These can be useful, for example, for richer visualization.

In the following example, we generate a network using the famous Erdős–Rényi model, then compute two measures of node centrality, or how “connected” a node is. The degree is the number of neighbors and the page rank is from the formula for ordering web results in Google’s early days.

graph <- play_erdos_renyi(300, p = 0.03) |>
  mutate(
    degree = centrality_degree(),
    pagerank = centrality_pagerank()
  )

graph |>
  as_tibble() |>
  pivot_longer(c(degree, pagerank)) |>
  ggplot(aes(x = value, fill = name)) +
  geom_density() +
  facet_wrap(~name, scales = 'free', nrow = 2)

ggraph(graph) +
  geom_edge_link0(alpha = 0.05) + # use edge0 family for faster plotting
  geom_node_point(aes(col = degree, size = degree))
## Using `stress` as default layout

One important type of network operation is clustering, where group-level patterns are extracted from the data. The basic motive behind clustering is that nodes within clusters should tend to be more connected with each other than with nodes of other clusters. Many different clustering algorithms have been proposed, and tidygraph supports the most popular ones with the group_* functions.

play_islands(n_islands = 5, size_islands = 10, p_within = 0.8, m_between = 3) |> 
  mutate(community = group_fast_greedy()) |> # classic method of maxing modularity
  ggraph() + 
  geom_edge_link(alpha = 0.3) + 
  geom_node_point(aes(color = as.factor(community)), size = 7) + 
  theme_graph() # ggraph's provided theme: lots of customiziation is available!
## Using `stress` as default layout

Epidemics on networks

To show more tidygraph functionality as part of a more realistic workflow, we will be implementing a simple stochastic epidemiological model. Our infection process will spread over a graph based on two rules.

  1. An uninfected, or susceptible, node can become infected from any neighbor that is infected, independently with probability beta
  2. A node that is already infectious can recover with probability alpha

In epidemiological modeling, this model is known as the SIS model because a node’s state can only go from susceptible to infectious to susceptible again.

The workhorse behind our implementation will be tidygraph::map_local, which extracts the neighborhood of each node as a list of subgraphs, then maps a function .f over each of these graphs and returns the result. From the function’s documentation:

The function provided to .f will be called with the following arguments in addition to those supplied through ...:

  • neighborhood: The neighborhood graph of the node
  • graph: The full tbl_graph object
  • node: The index of the node currently mapped over

So, our main task will be to provide an .f that takes these arguments and updates the infection status of node by looking at its neighborhood for other infectious nodes.

init_outbreak <- function(graph, nseed=1) {
    init <- sample(length(graph), nseed)
    
    graph |> 
        mutate(infected = ifelse(node_idx %in% init, TRUE, FALSE))
}

update_node <- function(neighborhood, graph, node, inf_stats, beta = 0.3, alpha = 0.1) {
    if (inf_stats[node]) # if node is already infected, rec. with prob 0.1
        if (runif(1) < alpha) FALSE else TRUE
    
    ninf <- sum(pull(neighborhood, infected)) # get the inf. status of each neighbor
    if (ninf > 0) {
        inter <- runif(ninf)
        if (any(inter < beta)) # pass on infection with ind. probabilities 0.3
            return(TRUE)
    }
    return(FALSE)
}

run_step <- function(graph) {
    graph |> 
        mutate(infected=map_local_lgl(.f = update_node, inf_stat = infected))
}

plot_outbreak <- function(graph) {
    ggraph(graph) +
        geom_edge_link0(alpha = 0.05) +
        geom_node_point(aes(col = infected))
}

graph0 <- create_notable("Zachary") |> 
    mutate(node_idx = 1:n()) |> 
    init_outbreak(nseed = 2)

plot_outbreak(graph0) # plot the initial state
## Using `stress` as default layout

reduce(1:10, ~run_step(.x), .init = graph0) |>
    plot_outbreak() # plot result after running 10 steps
## Using `stress` as default layout

Going further

  • A deeper summary of tidygraph’s features can be found here

  • Something we didn’t get into is the layout of graphs, and a number of algorithms exist for deciding where to arrange nodes in a plot. In ggraph you can specify which layout you want with ggraph(graph, layout = ...). Many other ways beyond the “ball and stick” visualization for graphs are also available; see here for a good intro.

  • tidygraph is essentially a wrapper around another network analysis project called igraph. While igraph itself is implemented in the C language for speed, an R package also exists. If you find something that you can’t accomplish using tidygraph, it is designed to be easy to fall back to using igraph temporarily.