From the website:
“The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.”
The “core” tidyverse packages are the ones automatically install/loaded:
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.1.2 v dplyr 1.0.6
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
They form a complete workflow for writing most R programs:
readr
for reading/writing rectangular datatidyr
for cleaning and organizing datadplyr
and purrr
for manipulating and summarizingforcats
and stringr
for special operations on factors/stringsIn addition to these core packages, other more specific packages are considered part of the “extended” tidyverse. These packages need to be installed/loaded independently.
Cheatsheets are a great way to learn a package quickly. I also like to use them to learn a package while working. I’ll pull up the package I want to learn, and try to find ways to use different functions from it for whatever I’m doing.
R for Data Science (R4DS) can be good for all levels, and centers around the tidyverse. It starts very gently and assumes almost no R knowledge. There are also exercises and solutions available.
Advanced R is a more general programming guide, but draws heavily from tidyverse packages.
The most fundamental data structure for tidyverse is a modernization of the dataframe. The most important differences are clearer printing and more flexible column names.
(dat <- starwars)
## # A tibble: 87 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke S~ 172 77 blond fair blue 19 male mascu~
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu~
## 3 R2-D2 96 32 <NA> white, bl~ red 33 none mascu~
## 4 Darth ~ 202 136 none white yellow 41.9 male mascu~
## 5 Leia O~ 150 49 brown light brown 19 fema~ femin~
## 6 Owen L~ 178 120 brown, grey light blue 52 male mascu~
## 7 Beru W~ 165 75 brown light blue 47 fema~ femin~
## 8 R5-D4 97 32 <NA> white, red red NA none mascu~
## 9 Biggs ~ 183 84 black light brown 24 male mascu~
## 10 Obi-Wa~ 182 77 auburn, wh~ fair blue-gray 57 male mascu~
## # ... with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
read_*
always returns a tibble. Creating tibbles from scratch can be done similar to data.frame
:
tibble(x = 1:5, y = LETTERS[1:5], z = x^2)
## # A tibble: 5 x 3
## x y z
## <int> <chr> <dbl>
## 1 1 A 1
## 2 2 B 4
## 3 3 C 9
## 4 4 D 16
## 5 5 E 25
or with tribble
tribble(
~x, ~y,
1, 'A',
2, 'B',
3, 'C'
)
## # A tibble: 3 x 2
## x y
## <dbl> <chr>
## 1 1 A
## 2 2 B
## 3 3 C
Creating tibbles from dataframes, matrices, or named vectors:
## # A tibble: 50 x 2
## speed dist
## <dbl> <dbl>
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
## 7 10 18
## 8 10 26
## 9 10 34
## 10 11 17
## # ... with 40 more rows
## # A tibble: 3 x 2
## name value
## <chr> <dbl>
## 1 a 1
## 2 b 2
## 3 c 3
You can largely use tibbles and dataframes interchangeably (functions that say they work with one will work with the other), so I’ll just call them dataframes from now on. See the vignette for more features and differences from regular dataframes
vignette('tibble')
R for Data Science gives three rules for tidy data:
The tidyr
package contains functions for making data tidy. For example, suppose you are given some data from a field survey.
survey <- tibble(
present_yes = sample(c('yes', NA), 15, replace = TRUE),
present_no = ifelse(is.na(present_yes), 'no', NA)
)
survey
## # A tibble: 15 x 2
## present_yes present_no
## <chr> <chr>
## 1 <NA> no
## 2 yes <NA>
## 3 <NA> no
## 4 <NA> no
## 5 <NA> no
## 6 <NA> no
## 7 <NA> no
## 8 <NA> no
## 9 <NA> no
## 10 <NA> no
## 11 yes <NA>
## 12 <NA> no
## 13 yes <NA>
## 14 <NA> no
## 15 yes <NA>
Someone has spread a single variable over two columns, causing lots of blank spaces in the data. This may have been more convenient in the field, but programming with the data as-is will be a bit difficult. Thankfully, conversion to tidy format is easy.
unite(survey, 'present', present_yes, present_no, na.rm = TRUE)
## # A tibble: 15 x 1
## present
## <chr>
## 1 no
## 2 yes
## 3 no
## 4 no
## 5 no
## 6 no
## 7 no
## 8 no
## 9 no
## 10 no
## 11 yes
## 12 no
## 13 yes
## 14 no
## 15 yes
Here, we have used unite
, which pastes two or more columns of strings into a single column, while removing the NA
s.
Another nice tidyr use is to drop rows which contain an NA
, possibly only in specific columns.
nrow(dat)
## [1] 87
nrow(drop_na(dat)) # drop all rows with an NA
## [1] 6
nrow(drop_na(dat, birth_year)) # only rows with NA in "birth_year"
## [1] 43
The dplyr
package is great for performing seemingly endless operations on dataframes. Just like in base R, we can subset a dataframe by rows or columns:
## select specific cols
dat[,c(2, 6)]
select(dat, height, eye_color)
dat[,1:5]
select(dat, name:skin_color)
## select last 5 rows
tail(dat, 5)
slice_tail(dat, n = 5)
## all cols with an underscore
dat[, which(str_detect(colnames(dat), '_'))]
select(dat, contains('_'))
More complex slicing and subsetting is possible
## select cols conditionally
select(dat, where(is.double))
select(dat, !last_col() & ends_with('s'))
## slice rows conditionally (filtering)
filter(dat, eye_color == 'blue')
Notice that most tidyverse functions are verbs or adverbs (coding is, after all, just giving instructions). There is a special operator, the pipe operator, which allows these commands to be chained together!
As an example, let’s create a random “population” of 1,000 of the starwars characters:
dat |> # start of chain is the data
filter(eye_color == 'blue') |> # each step takes the previous result
select(height:species) |> # and does a new thing to it
drop_na() |>
slice_sample(n = 1000, replace = TRUE)
## # A tibble: 1,000 x 10
## height mass hair_color skin_color eye_color birth_year sex gender
## <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 165 75 brown light blue 47 female feminine
## 2 166 50 black yellow blue 40 female feminine
## 3 166 50 black yellow blue 40 female feminine
## 4 165 75 brown light blue 47 female feminine
## 5 172 77 blond fair blue 19 male masculine
## 6 178 120 brown, grey light blue 52 male masculine
## 7 175 79 none light blue 37 male masculine
## 8 172 77 blond fair blue 19 male masculine
## 9 172 77 blond fair blue 19 male masculine
## 10 188 84 blond fair blue 41.9 male masculine
## # ... with 990 more rows, and 2 more variables: homeworld <chr>, species <chr>
x |> f |> g
is the same as g(f(x))
; however, when you string alot of functions together piping makes things more readable.
Another family of dplyr
functions use existing columns to make new ones. The most common is mutate
, which appends a new row based on a vectorized function of existing rows.
dat |>
mutate(mass_per_cm = mass / height) # divide mass column by height column
## # A tibble: 87 x 15
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke S~ 172 77 blond fair blue 19 male mascu~
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu~
## 3 R2-D2 96 32 <NA> white, bl~ red 33 none mascu~
## 4 Darth ~ 202 136 none white yellow 41.9 male mascu~
## 5 Leia O~ 150 49 brown light brown 19 fema~ femin~
## 6 Owen L~ 178 120 brown, grey light blue 52 male mascu~
## 7 Beru W~ 165 75 brown light blue 47 fema~ femin~
## 8 R5-D4 97 32 <NA> white, red red NA none mascu~
## 9 Biggs ~ 183 84 black light brown 24 male mascu~
## 10 Obi-Wa~ 182 77 auburn, wh~ fair blue-gray 57 male mascu~
## # ... with 77 more rows, and 6 more variables: homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>, mass_per_cm <dbl>
This function, like many others, is often useful to apply separately to different groups in the data. Suppose we want to assign each character a unique id based on their homeworld:
dat |>
group_by(homeworld) |>
mutate(world_id = str_c(homeworld, '_', 1:n())) |> # n() is short for group size
pull(world_id) # convert column "world_id" to a vector
## [1] "Tatooine_1" "Tatooine_2" "Naboo_1" "Tatooine_3"
## [5] "Alderaan_1" "Tatooine_4" "Tatooine_5" "Tatooine_6"
## [9] "Tatooine_7" "Stewjon_1" "Tatooine_8" "Eriadu_1"
## [13] "Kashyyyk_1" "Corellia_1" "Rodia_1" "Nal Hutta_1"
## [17] "Corellia_2" "Bestine IV_1" NA "Naboo_2"
## [21] "Kamino_1" NA "Trandosha_1" "Socorro_1"
## [25] "Bespin_1" "Mon Cala_1" "Chandrila_1" NA
## [29] "Endor_1" "Sullust_1" NA "Cato Neimoidia_1"
## [33] "Coruscant_1" "Naboo_3" "Naboo_4" "Naboo_5"
## [37] "Naboo_6" "Toydaria_1" "Malastare_1" "Naboo_7"
## [41] "Tatooine_9" "Dathomir_1" "Ryloth_1" "Ryloth_2"
## [45] "Vulpter_1" "Troiken_1" "Tund_1" "Haruun Kal_1"
## [49] "Cerea_1" "Glee Anselm_1" "Iridonia_1" "Coruscant_2"
## [53] "Iktotch_1" "Quermia_1" "Dorin_1" "Champala_1"
## [57] "Naboo_8" "Naboo_9" "Tatooine_10" "Geonosis_1"
## [61] "Mirial_1" "Mirial_2" "Naboo_10" "Serenno_1"
## [65] "Alderaan_2" "Concord Dawn_1" "Zolan_1" "Ojom_1"
## [69] "Kamino_2" "Kamino_3" "Coruscant_3" "Aleen Minor_1"
## [73] NA "Skako_1" "Muunilinst_1" "Shili_1"
## [77] "Kalee_1" "Kashyyyk_2" "Alderaan_3" "Umbara_1"
## [81] "Utapau_1" NA NA NA
## [85] NA NA "Naboo_11"
Grouping is frequently used with summarize
. Summarizing is a form of reduction; it takes a (grouped) dataframe and returns a new dataframe with one row per group:
dat |>
group_by(gender) |>
drop_na(height) |> # remove NAs from height for mean and sd
summarize(m_height = mean(height), sd_height = sd(height))
## # A tibble: 3 x 3
## gender m_height sd_height
## <chr> <dbl> <dbl>
## 1 feminine 165. 23.6
## 2 masculine 177. 37.6
## 3 <NA> 181. 2.89
A more complex example, grouping by two groups, then simulating a distribution for mass
, based on the observations from each group.
mass_sim <- dat |>
drop_na(mass) |>
group_by(eye_color, hair_color) |> # notice the number of groups is printed
summarize(
mass_boot = list(sample(mass, 1000, replace = TRUE) + rnorm(1000))
) |>
unnest(mass_boot) # un-collapses the list by repeating the other rows
## `summarise()` has grouped output by 'eye_color'. You can override using the `.groups` argument.
## make a histogram for one group
mass_sim |>
filter(eye_color == 'blue', hair_color == 'brown') |>
pull(mass_boot) |>
hist(breaks = 20, main = '')
Here we have used the tidyr function unnest
, which “expands” a column of lists (a list of lists, really) into a longer column with each entry in each list as a new row.
Finally, the count
function is for the common summarizing task of counting groups:
count(dat, homeworld, sort = TRUE)
## # A tibble: 49 x 2
## homeworld n
## <chr> <int>
## 1 Naboo 11
## 2 Tatooine 10
## 3 <NA> 10
## 4 Alderaan 3
## 5 Coruscant 3
## 6 Kamino 3
## 7 Corellia 2
## 8 Kashyyyk 2
## 9 Mirial 2
## 10 Ryloth 2
## # ... with 39 more rows
Count group cominations by providing \(>1\) column name, and use add_count
to add the counts to the existing data.
The R4DS book has some exercises for cleaning and manipulating data in Chapters 5, 10-13, and 16.
There are several vignettes that cover advanced dplyr concepts, which I’ve found useful. Check out vignette('colwise')
for applying dplyr functions to many columns at once, vignette('two-table')
for various ways to combine dataframes, and vignette('programming')
to learn more about working with tidy evaluation, the magic system underlying many tidyverse functions. Or, find all vignettes here.
Recall that tidy data is supposed to have one column per variable, and one row per observation. As we will see soon, plotting is one case where the ambiguity of “variable” and “observation” matters. For this and other cases, there are ways to pivot the data, changing the context of variables and observations.
Think of pivot_longer
as lengthening the dataframe. Rather than every variable getting a row of values, we make one row of the variables and one of the values:
x <- tribble(
~A, ~B, ~C,
1, 2, 3,
4, 5, 6
)
x
## # A tibble: 2 x 3
## A B C
## <dbl> <dbl> <dbl>
## 1 1 2 3
## 2 4 5 6
pivot_longer(x, everything())
## # A tibble: 6 x 2
## name value
## <chr> <dbl>
## 1 A 1
## 2 B 2
## 3 C 3
## 4 A 4
## 5 B 5
## 6 C 6
An example with data, counting how many times each color appears anywhere:
dat |>
pivot_longer(
hair_color:eye_color,
names_to='attribute',
values_to='color'
) |>
count(color)
## # A tibble: 44 x 2
## color n
## <chr> <int>
## 1 auburn 1
## 2 auburn, grey 1
## 3 auburn, white 1
## 4 black 23
## 5 blond 3
## 6 blonde 1
## 7 blue 21
## 8 blue-gray 1
## 9 blue, grey 2
## 10 brown 43
## # ... with 34 more rows
Here, since we want to count colors across several columns, we pivot these columns into a single column and then count
it.
Some of you may have used ggplot2
before. The package has its own philosophy (the “grammar of graphics”) that builds on those of all tidyverse packages and tidy data. Let’s go over a few to help make sense of how ggplot2
works:
Plots are made of layers. For ggplot, visual information is added to plots one “group” at a time, using geometries. For example, use geom_point
to add points and geom_histrogram
to add histogram bins. Multiple types of geometries can be layered onto the same plot.
Plots are an interpretation of relationships in the data. These relationships can be specified with aesthetic mappings. For example, if the points you are plotting come from two species, you can color points according to their species with the mapping species -> color
.
A plot is an object that we can perform operations on. This means we can always continue to modify a plot by performing new steps, much like how the |>
operator lets data be modified in single steps. For example, scale_x_sqrt
adds the instruction to take the square root of the variable on the x-axis before plotting. This also means that ggplot2
objects can be stored in a variable in R to be used later.
Here’s a simple example to see each of these ideas at work:
gg <- ggplot(data = diamonds, mapping = aes(x = price, y = carat, col = cut)) +
geom_point(shape = 1) # shape is a "static" argument here
gg +
scale_x_log10() +
labs(x = 'Price', y = 'Carat', col = 'Cut quality') +
theme_bw()
Here we’ve plotted price
by carat
and colored the points according to cut
. We then added some visual modifications on top of the stored plot gg
.
Breaking down each of the instructions, a basic anatomy of most ggplots is found:
ggplot
is the “master” function that sets a “blank slate.” It is also where data and aesthetic mappings, common to all geoms, is specified.ggplot() # totally blank slate
ggplot(diamonds, aes(price, carat)) # data and the coordinate system
geom_*
functions add on layers of data. Notice that these functions also have a data
and mapping
argument. These can be used to override these arguments from ggplot
for that geom only.
Last come any modifications of the existing coordinates or geoms. This includes scale_*
, renaming parts of the plot, and changing the theme.
There are many geometries available, and the best way to learn them is to check out the cheatsheet and try them out on your own data. Rather than cover them here, we’ll now focus on some common tasks and challenges.
Separating data into a group of several plots is done with facets:
gg_hist <- ggplot(diamonds, aes(x = carat, fill = cut)) +
geom_histogram() +
facet_grid(cut ~ ., scales = 'free')
An example of a barplot (histogram for discrete variable) with 2d facetting:
ggplot(diamonds, aes(x = clarity)) +
geom_bar(fill = 'lightblue') +
facet_grid(cut ~ color, scales = 'free')
There are tons of different approaches to customizing color with ggplot, and its easy to get overwhelmed! Here we’ll focus on the essentials and leave you to research different color palettes or making your own.
Basically, color is changed through scaling a color aesthetic (col
or fill
). Going back to our histogram example, here we use one of the color palettes provided by RColorBrewer
for discrete variables:
library(RColorBrewer)
my_brew_pal <- brewer.pal(8, 'Set2')
gg_hist +
scale_fill_discrete(type = my_brew_pal)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
As we see above, a color palette can be stored and used in multiple plots. For presentations or papers, it can be nice to have a set of specific colors fixed for a variable of interest. An easy way to do this in ggplot is with scale_*_manual
. Say we want all plots using a mapping with cut
to have consistent colors:
my_pal <- c('blue2', 'orangered', 'lightpink', 'springgreen', 'gold')
names(my_pal) <- levels(diamonds$cut)
gg_hist +
scale_fill_manual(values = my_pal)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(diamonds, aes(price, carat, col = cut)) +
geom_point(alpha = 0.1, data = slice_sample(diamonds, prop = 0.1)) +
geom_smooth() +
scale_color_manual(values = my_pal) +
ylim(0, 3)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 32 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).
In general, use scale_*_discrete
or scale_*_continuous
if you want to provide existing palettes or do other scaling functions (we’ll see another use of scale in a moment), and scale_*_manual
for making a discrete palette by hand.
Aesthetics other than color can be scaled in much the same way. A common example is shape:
ggplot(starwars, aes(eye_color, height, shape = gender)) + # col is continuous here
geom_point(size = 2) +
scale_shape_manual(values = c(19, 1), na.value = 13) +
scale_x_discrete(guide = guide_axis(angle = 45))
## Warning: Removed 6 rows containing missing values (geom_point).
diamonds |>
mutate(car_bin = ifelse(carat < mean(carat), 'low_car', 'high_car')) |>
group_by(car_bin, color) |>
summarize(mpr = mean(price)) |>
ggplot(aes(color, mpr, shape = car_bin, group = car_bin)) +
geom_point() +
geom_line()
## `summarise()` has grouped output by 'car_bin'. You can override using the `.groups` argument.
The R Graphics Cookbook has tons of examples for all things ggplot.
Chapter 3 in R4DS is a good review of ggplot concepts and basics, with exercises.
The functions documented in ?Control
are reserved words. That means you can’t name a variable or function one of these words. Going over some basic examples:
## foo: add or subtract a and c, depending on b
foo <- function(a, b = 3, c) {
if (a < b)
ret <- bar(a, c)
else
ret <- bat(a, c)
return(ret)
}
bar <- function(d, e) d + e
bat <- function(f, g) f - g
foo(1, c = 4)
## [1] 5
for (i in 1:10)
print(paste0(i, ':', foo(i, c = 4)))
## [1] "1:5"
## [1] "2:6"
## [1] "3:-1"
## [1] "4:0"
## [1] "5:1"
## [1] "6:2"
## [1] "7:3"
## [1] "8:4"
## [1] "9:5"
## [1] "10:6"
Use break
to skip out of a for/while loop entirely, or next
to skip to the next iteration:
fav_fruits <- c('blueberry', 'orange')
enemy_fruit <- 'honeydew'
for (l in fruit) {
if (l %in% fav_fruits) {
print('yum')
next
}
if (l == enemy_fruit) {
print('ewwwww!')
break
}
print(paste0("I don't like ", l))
}
## [1] "I don't like apple"
## [1] "I don't like apricot"
## [1] "I don't like avocado"
## [1] "I don't like banana"
## [1] "I don't like bell pepper"
## [1] "I don't like bilberry"
## [1] "I don't like blackberry"
## [1] "I don't like blackcurrant"
## [1] "I don't like blood orange"
## [1] "yum"
## [1] "I don't like boysenberry"
## [1] "I don't like breadfruit"
## [1] "I don't like canary melon"
## [1] "I don't like cantaloupe"
## [1] "I don't like cherimoya"
## [1] "I don't like cherry"
## [1] "I don't like chili pepper"
## [1] "I don't like clementine"
## [1] "I don't like cloudberry"
## [1] "I don't like coconut"
## [1] "I don't like cranberry"
## [1] "I don't like cucumber"
## [1] "I don't like currant"
## [1] "I don't like damson"
## [1] "I don't like date"
## [1] "I don't like dragonfruit"
## [1] "I don't like durian"
## [1] "I don't like eggplant"
## [1] "I don't like elderberry"
## [1] "I don't like feijoa"
## [1] "I don't like fig"
## [1] "I don't like goji berry"
## [1] "I don't like gooseberry"
## [1] "I don't like grape"
## [1] "I don't like grapefruit"
## [1] "I don't like guava"
## [1] "ewwwww!"
Write a function boxR
that takes arguments w
and h
and prints a \(w \times h\) rectangle of *
s. Hints: use nested for loops, and use cat('*')
to print *
without moving to a newline, and cat('\n')
to move to a newline.
Solution
boxR <- function(w, h) {
for (i in 1:w) {
for (j in 1:h) {
cat('*')
}
cat('\n')
}
}
boxR(3, 4)
## ****
## ****
## ****
A programming rite of passage. Write a function triangleR
that takes odd integer b
and prints an isosceles triangle with base \(b\). Hint: draw pictures!
Solution
triangleR <- function(b) {
space <- (b - 1) %/% 2
for (i in space:0) {
for (j in 1:b) {
if (j <= i | j > b - i)
cat(' ')
else
cat('*')
}
cat('\n')
}
}
triangleR(11)
## *
## ***
## *****
## *******
## *********
## ***********
tidygraph
is a package for constructing and analyzing graphs with a workflow similar to tidyr/dplyr, and ggraph
is a package similar to ggplot for visualization.
library(tidygraph)
##
## Attaching package: 'tidygraph'
## The following object is masked from 'package:stats':
##
## filter
library(ggraph)
In tidygraph, a graph is represented with two tibbles: one of nodes (vertices) and one of edges (links). These tibbles are bundled together as one object called a tbl_graph
.
(graph <- create_ring(10))
## # A tbl_graph: 10 nodes and 10 edges
## #
## # An undirected simple graph with 1 component
## #
## # Node Data: 10 x 0 (active)
## # ... with 4 more rows
## #
## # Edge Data: 10 x 2
## from to
## <int> <int>
## 1 1 2
## 2 2 3
## 3 3 4
## # ... with 7 more rows
In ggraph, a graph is plotted in “layers” just like ggplot. There are specific geoms for nodes and edges:
gg_net <- ggraph(graph) +
geom_node_point(color = 'blue', size = 5)
## Using `stress` as default layout
gg_net
gg_net +
geom_edge_link()
Since a tbl_graph
is made of tibbles, we can use all the familiar functions on them:
graph <- graph |>
mutate(name = fruit[1:n()])
Here we see we have added an new attribute to the nodes, by adding a new column to the nodes tibble. Also, notice how the nodes are labelled active. This is how we specify whether we want to manipulate the node or edge tibble. Switch between the two using activate
.
graph <- graph |>
activate(edges) |>
mutate(weight = runif(n(), 1, 5))
Also, %N>%
can be used as shorthand for |> activate(nodes) |>
, and %E>%
for activating edges.
Attributes can be used in aesthetic mappings when plotting. As with all geoms, check which aesthetics can be controlled with a mapping with ?geom_node_*
etc.
ggraph(graph) +
geom_edge_link(mapping = aes(edge_color = weight), edge_width = 2) +
geom_node_label(mapping = aes(label = name))
## Using `stress` as default layout
In addition to the usual tidyverse functions, there are many new verbs specific to network science, such as for finding the shortest distance between two nodes or calculating a node’s degree. These can be useful, for example, for richer visualization.
In the following example, we generate a network using the famous Erdős–Rényi model, then compute two measures of node centrality, or how “connected” a node is. The degree is the number of neighbors and the page rank is from the formula for ordering web results in Google’s early days.
graph <- play_erdos_renyi(300, p = 0.03) |>
mutate(
degree = centrality_degree(),
pagerank = centrality_pagerank()
)
graph |>
as_tibble() |>
pivot_longer(c(degree, pagerank)) |>
ggplot(aes(x = value, fill = name)) +
geom_density() +
facet_wrap(~name, scales = 'free', nrow = 2)
ggraph(graph) +
geom_edge_link0(alpha = 0.1) + # use edge0 family for faster plotting
geom_node_point(aes(col = degree, size = degree))
## Using `stress` as default layout
One important type of network operation is clustering, where group-level patterns are extracted from the data. The basic motive behind clustering is that nodes within clusters should tend to be more connected with each other than with nodes of other clusters. Many different clustering algorithms have been proposed, and tidygraph supports the most popular ones with the group_*
functions.
play_islands(n_islands = 5, size_islands = 10, p_within = 0.8, m_between = 3) |>
mutate(community = as.factor(group_infomap())) |>
ggraph() +
geom_edge_link(alpha = 0.3) +
geom_node_point(aes(colour = community), size = 7) +
theme_graph() # a simple t
## Using `stress` as default layout
A deeper summary of tidygraph’s features can be found here
Something we didn’t get into is the layout of graphs, and a number of algorithms exist for deciding where to arrange nodes in a plot. In ggraph you can specify which layout you want with ggraph(graph, layout = ...)
. Many other ways beyond the “ball and stick” visualization for graphs are also available; see here for a good intro.
tidygraph borrows heavily from another network analysis project called igraph
. While igraph itself is implemented in the C language for speed, an R package also exists. Anything you can do in R’s igraph can be done in tidygraph, and personally I find the igraph package a bit outdated, but if you don’t like the “two-tibble” approach know that another option exists.
You’ve seen how to do some work with systems of differential equations (compartmental models) on paper, but how can we see how these systems evolve over time as a whole? For that, we usually need a computer to help, using numerical integration.
In MOCS, you will learn much more about numerical integration and even write your own algorithms for it. However, there is also an R package for this using state-of-the-art algorithms, called deSolve
.
library(deSolve)
The main function of interest here is ode
(for ordinary differential equation), which will simulate a system given a list derivatives and an initial state. We specify a compartmental model like this:
SIS_system <- function(t, state, parameters) {
with(as.list(c(state, parameters)), { # add names to environment
dI <- beta * (1 - I) * I - alpha * I # derivative for I compartment
return(list(dI)) # always must return a list of all derivatives
})
}
and simulate this system with \(\beta = 0.3\), \(\alpha = 0.1\), and \(I_0 = 0.1\):
sis_res <- ode(
y = c(I = 0.1), # named vector of initial states
times = 0:50, # integrate the system from t=0 to 50
func = SIS_system, # where to get the derivatives for updating
parms = c(alpha = 0.1, beta = 0.3) # fixed parameters
)
ode
returns a matrix, with time
as the first column and the state variables for each \(t\) as the following columns:
plot(sis_res[,'time'], sis_res[,'I'])
As we can see, \(I\) is approaching the expected steady-state \(1 - \alpha / \beta\).
To check our derivation of \(R_0\), we will perform a sensitivity analysis, where we basically test our model over a wide combination of variables. Since we only have two parameters and a very simple model, this can be done with a simple grid over \(\alpha\) and \(\beta\):
get_final_state <- function(a, b) {
res <- ode(
y = c(I = 0.1),
times = 0:100,
func = SIS_system,
parms = c(alpha = a, beta = b)
)
res[nrow(res), 'I']
}
## generate all pairs of parameters
param_grid <- expand.grid(alpha = seq(.05, 2, .1), beta = seq(.05, 2, .1))
## for each pair, run SIS and save the final state
inf_stead <- double(nrow(param_grid))
for(i in 1:nrow(param_grid)) {
inf_stead[i] <- get_final_state(param_grid[i, 1], param_grid[i, 2])
}
gg_heat <- param_grid |>
as_tibble() |>
mutate(inf_stead = inf_stead) |>
ggplot(aes(beta, alpha, fill = inf_stead)) +
geom_tile() +
geom_abline(slope=1, intercept=0, col = 'red') # R0 has a func of beta
gg_heat
The blank space around the heatmap can be removed with scales:
gg_heat +
scale_x_continuous(expand=expansion(mult=c(0, 0)), name = 'Transmission') +
scale_y_continuous(expand=expansion(mult=c(0, 0)), name = 'Recovery')