We’ve noted that functions are helpful when you repeat code more than twice; we’ve also noted that a lot of statistical methods involve doing the same thing a large number of times. Putting those together motivates a careful approach to iteratation.

Meanwhile, R’s data structures, especially data frames, are surprisingly flexible. This is useful when the “observations” you want to store become more complex than single values; for example, each row many contain a few scalar observations as well a complete data set. In these cases, list columns are an appropriate column type, and map functions provide a way to interact with those columns.

This is the second module in the Iteration topic.

Overview

Learning Objectives

There are stops along the way, but our goal is to use map functions and iterate over listcolumns in data frames.

Video Lecture


Example

I’ll write code for today’s content in a new R Markdown document called iteration_and_listcols.Rmd in the iteration repo / directory I started last time. The code chunk below loads the tidyverse and sets a seed for reproducibility.

library(tidyverse)
library(rvest)
## 
## Attaching package: 'rvest'
## The following object is masked from 'package:readr':
## 
##     guess_encoding
set.seed(1)

Things are gonna get a little weird…

Lists

We need a brief digression about lists before we do anything.

In R, vectors are limited to a single data class – all elements are characters, or all numeric, or all logical. Trying to join the following vectors will result in coersion, as would creating vectors of mixed types.

vec_numeric = 5:8
vec_char = c("My", "name", "is", "Jeff")
vec_logical = c(TRUE, TRUE, TRUE, FALSE)

Lists provide a way to store anything you want. This flexibility is great, but is offset by a certain … clunkiness. Lists contain indexed elements, and the indexed elements themselves be scalars, vectors, or other things entirely.

l = list(
  vec_numeric = 5:8,
  mat         = matrix(1:8, 2, 4),
  vec_logical = c(TRUE, FALSE),
  summary     = summary(rnorm(1000)))
l
## $vec_numeric
## [1] 5 6 7 8
## 
## $mat
##      [,1] [,2] [,3] [,4]
## [1,]    1    3    5    7
## [2,]    2    4    6    8
## 
## $vec_logical
## [1]  TRUE FALSE
## 
## $summary
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -3.00805 -0.69737 -0.03532 -0.01165  0.68843  3.81028

Lists can be accessed using names or indices, and the things in lists can be accessed in the way you would usually access an object of that type.

l$vec_numeric
## [1] 5 6 7 8
l[[1]]
## [1] 5 6 7 8
l[[1]][1:3]
## [1] 5 6 7

Lists seem bizarre but are really useful. Right now, we’ll use them to hold general inputs and outputs of iterative processes. Even more importantly, we’ll see that data frames are actually a very specific kind of list – one comprised of vectors of the same length – which is why they can store variables of different types.

for loops

For this example, I’m going to start with the list defined below.

list_norms = 
  list(
    a = rnorm(20, 3, 1),
    b = rnorm(20, 0, 5),
    c = rnorm(20, 10, .2),
    d = rnorm(20, -3, 1)
  )

is.list(list_norms)
## [1] TRUE

I’d like to apply my simple mean_and_sd function from writing functions to each element of this list For completeness, that function is below.

mean_and_sd = function(x) {
  
  if (!is.numeric(x)) {
    stop("Argument x should be numeric")
  } else if (length(x) == 1) {
    stop("Cannot be computed for length 1 vectors")
  }
  
  mean_x = mean(x)
  sd_x = sd(x)

  tibble(
    mean = mean_x, 
    sd = sd_x
  )
}

We can apply the mean_and_sd function to each element of list_norms using the lines below.

mean_and_sd(list_norms[[1]])
## # A tibble: 1 × 2
##    mean    sd
##   <dbl> <dbl>
## 1  2.70  1.12
mean_and_sd(list_norms[[2]])
## # A tibble: 1 × 2
##    mean    sd
##   <dbl> <dbl>
## 1 0.416  4.08
mean_and_sd(list_norms[[3]])
## # A tibble: 1 × 2
##    mean    sd
##   <dbl> <dbl>
## 1  10.1 0.191
mean_and_sd(list_norms[[4]])
## # A tibble: 1 × 2
##    mean    sd
##   <dbl> <dbl>
## 1 -3.43  1.18

But now we’ve broken our “don’t repeat code more than twice” rule! Specifically, we’ve applied the same function / operation to the elements of a list sequentially. This is exactly the kind of code repetition for loops address

Below, I define an output list with the same number of entries as my target dataframe; a sequence to iterate over; and a for loop body that applies the mean_and_sd function for each sequence element and saves the result.

output = vector("list", length = 4)

for (i in 1:4) {
  output[[i]] = mean_and_sd(list_norms[[i]])
}

This is already much cleaner than using four almost-identical lines of code, and will make life easier the larger our sequence gets.

In this example, I bypassed a common first step in writing loops because I already had the function I wanted to repeat. Frequently, however, I’ll start with repeated code segements, then abstract the underlying process into a function, and then wrap things up in a for loop.

map

A criticism of for loops is that there’s a lot of overhead – you have to define your output vector / list, there’s the for loop bookkeeping to do, etc – that distracts from the purpose of the code. In this case, we want to apply mean_and_sd to each element of list_norms, but we have to scan inside the for loop to figure that out.

The map functions in purrr try to make the purpose of your code clear. Compare the loop above to the line below.

output = map(list_norms, mean_and_sd)

The first argument to map is the list (or vector, or data frame) we want to iterate over, and the second argument is the function we want to apply to each element. The line above will produce the same output as the previous loop, but is clearer and easier to understand (once you’re used to map …).

This code (using map) is why we pointed out in writing functions that functions can be passed as arguments to other functions. The second argument in map(list_norms, mean_and_sd) is a function we just wrote. To see how powerful this can be, suppose we wanted to apply a different function, say median, to each column of list_norms. The chunk below includes both the loop and the map approach.

output = vector("list", length = 4)

for (i in 1:4) {
  output[[i]] = median(list_norms[[i]])
}

output = map(list_norms, median)

Again, both options produce the same output, but the map places the focus squarely on the function you want to apply by removing much of the bookkeeping.

map variants

There are some useful variants to the basic map function if you know what kind of output you’re going to produce. Below we use map_dbl because median outputs a single numeric value each time; the result is a vector instead of a list. Using the .id argument keeps the names of the elements in the input list.

output = map_dbl(list_norms, median, .id = "input")

If we tried to use map_int or map_lgl, we’d get an error because the output of median isn’t a integer or a logical. This is a good way to help catch mistakes when they arise.

Similarly, since we know mean_and_sd produces a data frame, we can use the output-specific map_dfr; this will produce a single data frame.

output = map_dfr(list_norms, mean_and_sd, .id = "input")

The map_df variants can be helpful when your map statement is part of a longer chain of piped commands.

Lastly, the variant map2 (and map2_dbl, etc) is helpful when your function has two arguments. In these cases, I find it best to be specific about arguments using something like the following (more on anonymous functions shortly):

output = map2(input_1, input_2, \(x,y) func(arg_1 = x, arg_2 = y))

List columns and operations

listcol_df = 
  tibble(
    name = c("a", "b", "c", "d"),
    samp = list_norms
  )

The name column is a character column – if you pull this column from the listcol_df data frame, the result is a character vector. Similarly, the samp column is a list column – on it’s own, it’s a list.

listcol_df |> pull(name)
## [1] "a" "b" "c" "d"
listcol_df |> pull(samp)
## $a
##  [1] 4.134965 4.111932 2.129222 3.210732 3.069396 1.337351 3.810840 1.087654
##  [9] 1.753247 3.998154 2.459127 2.783624 1.378063 1.549036 3.350910 2.825453
## [17] 2.408572 1.665973 1.902701 5.036104
## 
## $b
##  [1] -1.63244797  3.87002606  3.92503200  3.81623040  1.47404380 -6.26177962
##  [7] -5.04751876  3.75695597 -6.54176756  2.63770049 -2.66769787 -1.99188007
## [13] -3.94784725 -1.15070568  4.38592421  2.26866589 -1.16232074  4.35002762
## [19]  8.28001867 -0.03184464
## 
## $c
##  [1] 10.094098 10.055644  9.804419  9.814683 10.383954 10.176256 10.148416
##  [8] 10.029515 10.097078 10.030371 10.008400 10.044684  9.797907 10.480244
## [15] 10.160392  9.949758 10.242578  9.874548 10.342232  9.921125
## 
## $d
##  [1] -5.321491 -1.635881 -1.867771 -3.774316 -4.410375 -4.834528 -3.269014
##  [8] -4.833929 -3.814468 -2.836428 -2.144481 -3.819963 -3.123603 -2.745052
## [15] -1.281074 -3.958544 -4.604310 -4.845609 -2.444263 -3.060119

The list column really is a list, and will behave as such elsewhere in R. So, for example, you can examine the first list entry using usual list index procedures.

listcol_df$samp[[1]]
##  [1] 4.134965 4.111932 2.129222 3.210732 3.069396 1.337351 3.810840 1.087654
##  [9] 1.753247 3.998154 2.459127 2.783624 1.378063 1.549036 3.350910 2.825453
## [17] 2.408572 1.665973 1.902701 5.036104

You will need to be able to manipulate list columns, but usual operations for columns that might appear in mutate (like mean or recode) often don’t apply to the entries in a list column. Instead, recognizing list columns as list columns motivates an approach for working with them.

Let’s apply mean_and_sd to the first element of our list column.

mean_and_sd(listcol_df$samp[[1]])
## # A tibble: 1 × 2
##    mean    sd
##   <dbl> <dbl>
## 1  2.70  1.12

Great! Keeping in mind that listcol_df$samp is a list, we can apply mean_and_sd function to each element using map.

map(listcol_df$samp, mean_and_sd)
## $a
## # A tibble: 1 × 2
##    mean    sd
##   <dbl> <dbl>
## 1  2.70  1.12
## 
## $b
## # A tibble: 1 × 2
##    mean    sd
##   <dbl> <dbl>
## 1 0.416  4.08
## 
## $c
## # A tibble: 1 × 2
##    mean    sd
##   <dbl> <dbl>
## 1  10.1 0.191
## 
## $d
## # A tibble: 1 × 2
##    mean    sd
##   <dbl> <dbl>
## 1 -3.43  1.18

The map function returns a list; we could store the results as a new list column … !!!

We’ve been using mutate to define a new variable in a data frame, especially one that is a function of an existing variable. That’s exactly what we will keep doing.

listcol_df = 
  listcol_df |> 
  mutate(summary = map(samp, mean_and_sd))

listcol_df
## # A tibble: 4 × 3
##   name  samp         summary         
##   <chr> <named list> <named list>    
## 1 a     <dbl [20]>   <tibble [1 × 2]>
## 2 b     <dbl [20]>   <tibble [1 × 2]>
## 3 c     <dbl [20]>   <tibble [1 × 2]>
## 4 d     <dbl [20]>   <tibble [1 × 2]>

Revisiting NSDUH

In reading data from the web and elsewhere, we wrote code that allowed us to import data tables from a recent NSDUH survey; in writing functions we wrapped that code into a function called nsduh_table which, for a given table_number, produces a data frame containing state, age group, year, and percents. A similar function, which omits the argument table_name, is shown below.

nsduh_table <- function(html, table_num) {
  
  table = 
    html |> 
    html_table() |> 
    nth(table_num) |>
    slice(-1) |> 
    select(-contains("P Value")) |>
    pivot_longer(
      -State,
      names_to = "age_year", 
      values_to = "percent") |>
    separate(age_year, into = c("age", "year"), sep = "\\(") |>
    mutate(
      year = str_replace(year, "\\)", ""),
      percent = str_replace(percent, "[a-c]$", ""),
      percent = as.numeric(percent)) |>
    filter(!(State %in% c("Total U.S.", "Northeast", "Midwest", "South", "West")))
  
  table
}

We can use this function to import three tables using the next code chunk, which downloads and extracts the page HTML and then iterates over table numbers. The results are combined using bind_rows(). Note that, because this version of our function doesn’t include table_name, that information is lost for now.

nsduh_url = "http://samhda.s3-us-gov-west-1.amazonaws.com/s3fs-public/field-uploads/2k15StateFiles/NSDUHsaeShortTermCHG2015.htm"

nsduh_html = read_html(nsduh_url)

output = vector("list", 3)

for (i in c(1, 4, 5)) {
  output[[i]] = nsduh_table(nsduh_html, i)
}

nsduh_results = bind_rows(output)

We can also import these data using map(). Here I’m supplying the html argument after the name of the function that I’m iterating over.

nsduh_results = 
  map(c(1, 4, 5), nsduh_table, html = nsduh_html) |> 
  bind_rows()

As with previous examples, using a for loop is pretty okay but the map call is clearer.

We can also do this using data frames and list columns.

nsduh_results= 
  tibble(
    name = c("marj", "cocaine", "heroine"),
    number = c(1, 4, 5)) |> 
  mutate(table = map(number, \(num) nsduh_table(html = nsduh_html, num))) |> 
  unnest(cols = "table")

Operations on nested data

Shifting gears a bit, let’s revisit the weather data from visualization and elsewhere; these data consist of one year of observations from three monitoring stations. The code below pulls these data into R (using the rnoaa package, which interacts with the NOAA API).

weather_df = 
  rnoaa::meteo_pull_monitors(
    c("USW00094728", "USW00022534", "USS0023B17S"),
    var = c("PRCP", "TMIN", "TMAX"), 
    date_min = "2021-01-01",
    date_max = "2022-12-31") |>
  mutate(
    name = recode(
      id, 
      USW00094728 = "CentralPark_NY", 
      USW00022534 = "Molokai_HI",
      USS0023B17S = "Waterhole_WA"),
    tmin = tmin / 10,
    tmax = tmax / 10) |>
  select(name, id, everything())
## Registered S3 method overwritten by 'hoardr':
##   method           from
##   print.cache_info httr
## using cached file: /Users/jeffgoldsmith/Library/Caches/org.R-project.R/R/rnoaa/noaa_ghcnd/USW00094728.dly
## date created (size, mb): 2023-09-19 15:41:55.07359 (8.524)
## file min/max dates: 1869-01-01 / 2023-09-30
## using cached file: /Users/jeffgoldsmith/Library/Caches/org.R-project.R/R/rnoaa/noaa_ghcnd/USW00022534.dly
## date created (size, mb): 2023-09-25 10:06:23.827176 (3.83)
## file min/max dates: 1949-10-01 / 2023-09-30
## using cached file: /Users/jeffgoldsmith/Library/Caches/org.R-project.R/R/rnoaa/noaa_ghcnd/USS0023B17S.dly
## date created (size, mb): 2023-09-19 15:42:03.139582 (0.994)
## file min/max dates: 1999-09-01 / 2023-09-30

The station name and id are constant across the year’s temperature and precipitation data. For that reason, we can reorganize these data into a new data frame with a single row for each station. Weather data will be separated into three station-specific data frames, each of which is the data “observation” for the respective station.

weather_nest = 
  nest(weather_df, data = date:tmin)

weather_nest
## # A tibble: 3 × 3
##   name           id          data              
##   <chr>          <chr>       <list>            
## 1 CentralPark_NY USW00094728 <tibble [730 × 4]>
## 2 Molokai_HI     USW00022534 <tibble [730 × 4]>
## 3 Waterhole_WA   USS0023B17S <tibble [730 × 4]>

This is a different way of producing a list column. The result is a lot like listcol_df, in that the columns in weather_nest are a vector and a list.

weather_nest |> pull(name)
## [1] "CentralPark_NY" "Molokai_HI"     "Waterhole_WA"
weather_nest |> pull(data)
## [[1]]
## # A tibble: 730 × 4
##    date        prcp  tmax  tmin
##    <date>     <dbl> <dbl> <dbl>
##  1 2021-01-01   157   4.4   0.6
##  2 2021-01-02    13  10.6   2.2
##  3 2021-01-03    56   3.3   1.1
##  4 2021-01-04     5   6.1   1.7
##  5 2021-01-05     0   5.6   2.2
##  6 2021-01-06     0   5     1.1
##  7 2021-01-07     0   5    -1  
##  8 2021-01-08     0   2.8  -2.7
##  9 2021-01-09     0   2.8  -4.3
## 10 2021-01-10     0   5    -1.6
## # ℹ 720 more rows
## 
## [[2]]
## # A tibble: 730 × 4
##    date        prcp  tmax  tmin
##    <date>     <dbl> <dbl> <dbl>
##  1 2021-01-01     0  27.8  22.2
##  2 2021-01-02     0  28.3  23.9
##  3 2021-01-03     0  28.3  23.3
##  4 2021-01-04     0  30    18.9
##  5 2021-01-05     0  28.9  21.7
##  6 2021-01-06     0  27.8  20  
##  7 2021-01-07     0  29.4  21.7
##  8 2021-01-08     0  28.3  18.3
##  9 2021-01-09     0  27.8  18.9
## 10 2021-01-10     0  28.3  18.9
## # ℹ 720 more rows
## 
## [[3]]
## # A tibble: 730 × 4
##    date        prcp  tmax  tmin
##    <date>     <dbl> <dbl> <dbl>
##  1 2021-01-01   254   3.2   0  
##  2 2021-01-02   152   0.9  -3.2
##  3 2021-01-03     0   0.2  -4.2
##  4 2021-01-04   559   0.9  -3.2
##  5 2021-01-05    25   0.5  -3.3
##  6 2021-01-06    51   0.8  -4.8
##  7 2021-01-07     0   0.2  -5.8
##  8 2021-01-08    25   0.5  -8.3
##  9 2021-01-09     0   0.1  -7.7
## 10 2021-01-10   203   0.9  -0.1
## # ℹ 720 more rows

Of course, if you can nest data you should be able to unnest it as well, and you can (with the caveat that you’re unnesting a list column that contains a data frame).

unnest(weather_nest, cols = data)
## # A tibble: 2,190 × 6
##    name           id          date        prcp  tmax  tmin
##    <chr>          <chr>       <date>     <dbl> <dbl> <dbl>
##  1 CentralPark_NY USW00094728 2021-01-01   157   4.4   0.6
##  2 CentralPark_NY USW00094728 2021-01-02    13  10.6   2.2
##  3 CentralPark_NY USW00094728 2021-01-03    56   3.3   1.1
##  4 CentralPark_NY USW00094728 2021-01-04     5   6.1   1.7
##  5 CentralPark_NY USW00094728 2021-01-05     0   5.6   2.2
##  6 CentralPark_NY USW00094728 2021-01-06     0   5     1.1
##  7 CentralPark_NY USW00094728 2021-01-07     0   5    -1  
##  8 CentralPark_NY USW00094728 2021-01-08     0   2.8  -2.7
##  9 CentralPark_NY USW00094728 2021-01-09     0   2.8  -4.3
## 10 CentralPark_NY USW00094728 2021-01-10     0   5    -1.6
## # ℹ 2,180 more rows

Nesting columns can help with data organization and comprehension by masking complexity you’re less concerned about right now and clarifying the things you are concerned about. In the weather data, it can be helpful to think of stations as the basic unit of observation, and daily weather recordings as a more granular level of observation. Nesting can also simplify the use of analytic approaches across levels of a higher variable.

Suppose we want to fit the simple linear regression relating tmax to tmin for each station-specific data frame. First I’ll write a quick function that takes a data frame as the sole argument to fit this model.

weather_lm = function(df) {
  lm(tmax ~ tmin, data = df)
}

Let’s make sure this works on a single data frame.

weather_lm(weather_nest$data[[1]])
## 
## Call:
## lm(formula = tmax ~ tmin, data = df)
## 
## Coefficients:
## (Intercept)         tmin  
##       7.514        1.034

Since weather$data is a list, we can apply our weather_lm function to each data frame using map.

map(weather_nest$data, weather_lm)
## [[1]]
## 
## Call:
## lm(formula = tmax ~ tmin, data = df)
## 
## Coefficients:
## (Intercept)         tmin  
##       7.514        1.034  
## 
## 
## [[2]]
## 
## Call:
## lm(formula = tmax ~ tmin, data = df)
## 
## Coefficients:
## (Intercept)         tmin  
##     21.7547       0.3222  
## 
## 
## [[3]]
## 
## Call:
## lm(formula = tmax ~ tmin, data = df)
## 
## Coefficients:
## (Intercept)         tmin  
##       7.532        1.137

As an aside, you can avoid the creation of a dedicated function using map’s syntax for “anonymous” (i.e. not named and saved) functions. This is helpful for really short operations – here, rather than writing a separate (named) function that takes the argument df and returns a regression of tmax on tmin, I can use \(df) to indicate that same function. If you have a one-line function, this is a good option; otherwise creating a named function is probably better.

map(weather_nest$data, \(df) lm(tmax ~ tmin, data = df))
## [[1]]
## 
## Call:
## lm(formula = tmax ~ tmin, data = df)
## 
## Coefficients:
## (Intercept)         tmin  
##       7.514        1.034  
## 
## 
## [[2]]
## 
## Call:
## lm(formula = tmax ~ tmin, data = df)
## 
## Coefficients:
## (Intercept)         tmin  
##     21.7547       0.3222  
## 
## 
## [[3]]
## 
## Call:
## lm(formula = tmax ~ tmin, data = df)
## 
## Coefficients:
## (Intercept)         tmin  
##       7.532        1.137

Let’s use mutate to fit this model, and to store the result in the same dataframe.

weather_nest = 
  weather_nest |> 
  mutate(models = map(data, weather_lm))

weather_nest
## # A tibble: 3 × 4
##   name           id          data               models
##   <chr>          <chr>       <list>             <list>
## 1 CentralPark_NY USW00094728 <tibble [730 × 4]> <lm>  
## 2 Molokai_HI     USW00022534 <tibble [730 × 4]> <lm>  
## 3 Waterhole_WA   USS0023B17S <tibble [730 × 4]> <lm>

This is great! We now have a data frame that has rows for each station; columns contain weather datasets and fitted models. This makes it very easy to keep track of models across stations, and to perform additional analyses.

This is, for sure, a fairly complex bit of code, but in just a few lines we’re able to fit separate linear models to each of our stations. And, once you get used to list columns, map, and the rest of it, these lines of code are pretty clear and can be extended to larger datasets with more complex structures.

Other materials

Iteration can be tricky – the readings below will help as you work through this!

  • The chapter on iteration in R for Data Science has a useful treatment of iteration using loops and map
  • Jenny Bryan’s purrr tutorial has a lot of useful information and examples
  • R Programming for Data Science has information on loops and loop functions; given Roger Peng’s tendency towards base R he focuses on lapply and others instead of map
  • This question and response on stack overflow explains why one might prefer map to lapply

List columns take some getting used to; there are some materials to help with that.

  • R for Data Science has a chapter on fitting many models
  • Jenny Bryan’s purrr tutorial has useful list-column examples
  • The R file used to create the starwars dataset (in the tidyverse) using the Star Wars API processes list output (from the API) using several map variants

The code that I produced working examples in lecture is here.