If you use the same code twice, you need a function – this will improve code readability, facilitate troubleshooting, and reduce chances for mistakes. This content looks at the best approaches for writing R functions.

This is the first module in the Iteration topic.

Overview

Learning Objectives

Create simple R functions to abstract common processes.

Video Lecture


Example

For this topic, I’ll create a GitHub repo + directory / R Project called iteration. I’ll write code for today’s content in a new R Markdown document called writing_functions.Rmd, and I’m going to load the usual packages. I’m also setting the seed so that the output on this page is fixed.

library(tidyverse)
library(rvest)
## Loading required package: xml2
## 
## Attaching package: 'rvest'
## The following object is masked from 'package:purrr':
## 
##     pluck
## The following object is masked from 'package:readr':
## 
##     guess_encoding
set.seed(1)

My first function

The best way to build up a function is to start with code you’ve written outside a function. To see how this might work, I’ll start with a simple example: the chunk below takes a sample from a normal distribution and then computes the vector of Z scores for the sample.

x_vec = rnorm(25, mean = 5, sd = 3)

(x_vec - mean(x_vec)) / sd(x_vec)
##  [1] -0.83687228  0.01576465 -1.05703126  1.50152998  0.16928872 -1.04107494
##  [7]  0.33550276  0.59957343  0.42849461 -0.49894708  1.41364561  0.23279252
## [13] -0.83138529 -2.50852027  1.00648110 -0.22481531 -0.19456260  0.81587675
## [19]  0.68682298  0.44756609  0.78971253  0.64568566 -0.09904161 -2.27133861
## [25]  0.47485186

If I want to repeat this (admittedly simple) process for lots of samples, I might want to have a function that takes the sample as an argument, computes the vector of Z scores in the body, and returns the result. I define such a function below.

z_scores = function(x) {
  
  z = (x - mean(x)) / sd(x)
  z
  
}

z_scores(x_vec)
##  [1] -0.83687228  0.01576465 -1.05703126  1.50152998  0.16928872 -1.04107494
##  [7]  0.33550276  0.59957343  0.42849461 -0.49894708  1.41364561  0.23279252
## [13] -0.83138529 -2.50852027  1.00648110 -0.22481531 -0.19456260  0.81587675
## [19]  0.68682298  0.44756609  0.78971253  0.64568566 -0.09904161 -2.27133861
## [25]  0.47485186

I can try this with a few samples and confirm that it works. I should also try to think of ways this code might break; the attempts below try a variety of inputs to see what happens.

z_scores(3)
## [1] NA
z_scores("my name is jeff")
## Error in x - mean(x): non-numeric argument to binary operator
z_scores(iris)
## Error in is.data.frame(x): 'list' object cannot be coerced to type 'double'
z_scores(sample(c(TRUE, FALSE), 25, replace = TRUE))
##  [1] -0.7348469  1.3063945 -0.7348469 -0.7348469  1.3063945  1.3063945
##  [7] -0.7348469 -0.7348469 -0.7348469  1.3063945  1.3063945 -0.7348469
## [13] -0.7348469 -0.7348469 -0.7348469 -0.7348469 -0.7348469  1.3063945
## [19] -0.7348469 -0.7348469 -0.7348469 -0.7348469  1.3063945  1.3063945
## [25]  1.3063945

These all did something I didn’t want, but only two returned errors. To avoid behavior you don’t want (i.e. to “fail noisily and as soon as possible”) we’ll add some checks on the argument values using conditional statements.

z_scores = function(x) {
  
  if (!is.numeric(x)) {
    stop("Argument x should be numeric")
  } else if (length(x) == 1) {
    stop("Z scores cannot be computed for length 1 vectors")
  }
  
  z = mean(x) / sd(x)
  
  z
}

Fantastic – we have a pretty solid function for computing Z scores!

Multiple outputs

In some cases it might be better to return the mean and standard deviation instead of the Z scores. A first option is to store each of the values in a named list, and to return that list. (We’ll talk more about lists in iteration and listcols.)

mean_and_sd = function(x) {
  
  if (!is.numeric(x)) {
    stop("Argument x should be numeric")
  } else if (length(x) == 1) {
    stop("Cannot be computed for length 1 vectors")
  }
  
  mean_x = mean(x)
  sd_x = sd(x)

  list(mean = mean_x, 
       sd = sd_x)
}

Alternatively, we might store values in a data frame.

mean_and_sd = function(x) {
  
  if (!is.numeric(x)) {
    stop("Argument x should be numeric")
  } else if (length(x) == 1) {
    stop("Cannot be computed for length 1 vectors")
  }
  
  mean_x = mean(x)
  sd_x = sd(x)

  tibble(
    mean = mean_x, 
    sd = sd_x
  )
}

In general, either of these will be fine; which one you choose will depend on what kind of values you want to return, and what you plan to do with the function itself. If you want to return the original sample along with the computed values, a list might make sense; if you plan to run your function a lot and study the results, having a data frame will make it easier to use other tools. We’ll see more of that in iteration and simulation.

Multiple inputs

As exciting as it is to compute Z scores, let’s start setting our sights higher. I’d like to have a function that takes a given sample size along with a true mean and standard deviation, simulates data from a normal distribution, and returns the estimated mean and standard deviation. I’ll start from the code below.

sim_data = tibble(
  x = rnorm(30, mean = 2, sd = 3)
)

sim_data %>% 
  summarize(
    mu_hat = mean(x),
    sigma_hat = sd(x)
  )
## # A tibble: 1 x 2
##   mu_hat sigma_hat
##    <dbl>     <dbl>
## 1   1.63      3.06

You should take a few minutes to examine this code – make a plot of the simulated data to make sure it “makes sense”, take a look at the result of the lm call, etc.

Once you’re satisfied, it’s time to wrap things up in a function. I’d like to be able to change the sample size and parameters, so those will be my arguments; the code that simulates data and computes the sample mean and SD go in the body; and the return statement should include the estimates. A function that does all this, using default values for the mean and standard deviation, is below.

sim_mean_sd = function(n, mu = 2, sigma = 3) {
  
  sim_data = tibble(
    x = rnorm(n, mean = mu, sd = sigma),
  )
  
  sim_data %>% 
    summarize(
      mu_hat = mean(x),
      sigma_hat = sd(x)
    )
}

Repeated calls to sim_mean_sd() will give a sense of sampling variability in estimating the mean and standard deviation from a sample; take a few minutes to run sim_mean_sd(30) a few times, and then to run sim_mean_sd(300), and think about the results. We’ll examine this more formally in iteration and in simulation.

This is also a good time to point out how R handles argument matching. We can use positional matching, meaning the first value supplied is taken to be the first argument, the second value is the second argument, and so on. We do this with tidyverse functions a lot; the first argument is always a dataframe, and we just supply that dataframe in the first position. We also use positional matching when we call mean(x) or sim_mean_sd(30, 5, 1).

Alternatively, we can use named matching, which uses the argument name in the function call. Named matching can be a bit more stable when you’re writing your own functions (in case you decide to change the order of the inputs, for example) but isn’t strictly necessary. Named arguments can be supplied in any order: sim_mean_sd(n = 30, mu = 5, sd = 1) is equivalent to sim_mean_sd(sd = 1, n = 30, mu = 5).

Revisiting past examples

There have been a couple of times in this class that I’ve had to write code I didn’t like, because it would have made sense to write a function. We’ll revisit those quickly to see how we could improve our code.

Scraping Amazon

In reading data from the web, we wrote code that allowed us to scrape information in Amazon reviews. That code is below.

url = "https://www.amazon.com/product-reviews/B00005JNBQ/ref=cm_cr_arp_d_viewopt_rvwer?ie=UTF8&reviewerType=avp_only_reviews&sortBy=recent&pageNumber=1"

dynamite_html = read_html(url)

review_titles = 
  dynamite_html %>%
  html_nodes(".a-text-bold span") %>%
  html_text()

review_stars = 
  dynamite_html %>%
  html_nodes("#cm_cr-review_list .review-rating") %>%
  html_text() %>%
  str_extract("^\\d") %>%
  as.numeric()

review_text = 
  dynamite_html %>%
  html_nodes(".review-text-content span") %>%
  html_text() %>% 
  str_replace_all("\n", "") %>% 
  str_trim()

reviews = tibble(
  title = review_titles,
  stars = review_stars,
  text = review_text
)

Let’s write a quick function to scrape review information for any URL to an Amazon review page.

read_page_reviews <- function(url) {
  
  html = read_html(url)
  
  review_titles = 
    html %>%
    html_nodes(".a-text-bold span") %>%
    html_text()
  
  review_stars = 
    html %>%
    html_nodes("#cm_cr-review_list .review-rating") %>%
    html_text() %>%
    str_extract("^\\d") %>%
    as.numeric()
  
  review_text = 
    html %>%
    html_nodes(".review-text-content span") %>%
    html_text() %>% 
    str_replace_all("\n", "") %>% 
    str_trim()
  
  tibble(
    title = review_titles,
    stars = review_stars,
    text = review_text
  )
}

Next we’ll use this to read in reviews from a few pages and combine the results.

url_base = "https://www.amazon.com/product-reviews/B00005JNBQ/ref=cm_cr_arp_d_viewopt_rvwer?ie=UTF8&reviewerType=avp_only_reviews&sortBy=recent&pageNumber="
vec_urls = str_c(url_base, 1:5)

dynamite_reviews = bind_rows(
  read_page_reviews(vec_urls[1]),
  read_page_reviews(vec_urls[2]),
  read_page_reviews(vec_urls[3]),
  read_page_reviews(vec_urls[4]),
  read_page_reviews(vec_urls[5])
)

dynamite_reviews
## # A tibble: 50 x 3
##    title                               stars text                               
##    <chr>                               <dbl> <chr>                              
##  1 Great Value                             5 "Great Value"                      
##  2 I LOVE THIS MOVIE                       5 "THIS MOVIE IS SO FUNNY ONE OF MY …
##  3 Don't you wish you could go back?       5 "Watch it 100 times. Never. Gets. …
##  4 Stupid, but very funny!                 5 "If you like stupidly funny '90s t…
##  5 The beat                                5 "The best"                         
##  6 Hilarious                               5 "Super funny! Loved the online ren…
##  7 Love this movie                         5 "We love this product.  It came in…
##  8 Entertaining, limited quality           4 "Entertainment level gets a 5 star…
##  9 Boo                                     1 "We rented this movie because our …
## 10 Movie is still silly fun....amazon…     1 "We are getting really frustrated …
## # … with 40 more rows

Loading LoTR data

In tidy data, we broke the “only copy code twice” rule when we used the code below to process the LoTR words data:

fellowship_ring = readxl::read_excel("./data/LotR_Words.xlsx", range = "B3:D6") %>%
  mutate(movie = "fellowship_ring")

two_towers = readxl::read_excel("./data/LotR_Words.xlsx", range = "F3:H6") %>%
  mutate(movie = "two_towers")

return_king = readxl::read_excel("./data/LotR_Words.xlsx", range = "J3:L6") %>%
  mutate(movie = "return_king")

lotr_tidy = bind_rows(fellowship_ring, two_towers, return_king) %>%
  janitor::clean_names() %>%
  gather(key = sex, value = words, female:male) %>%
  mutate(race = str_to_lower(race)) %>% 
  select(movie, everything()) 

Learning Assessment: Try to write a function that can be used to abstract the data loading and cleaning process. Use this function to recreate the tidied LoTR dataset.

Solution

The function below will read in and clean LoTR data – it differs from the previous code by including some data tidying steps in the function rather than after data have been combined, but produces the same result.

lotr_load_and_tidy = function(path, range, movie_name) {
  
  df = readxl::read_excel(path, range = range) %>%
    janitor::clean_names() %>%
    gather(key = sex, value = words, female:male) %>%
    mutate(race = str_to_lower(race),
           movie = movie_name)
  
  df
  
}

lotr_tidy = 
  bind_rows(
    lotr_load_and_tidy("./data/LotR_Words.xlsx", "B3:D6", "fellowship_ring"),
    lotr_load_and_tidy("./data/LotR_Words.xlsx", "F3:H6", "two_towers"),
    lotr_load_and_tidy("./data/LotR_Words.xlsx", "J3:L6", "return_king")) %>%
  select(movie, everything()) 

Having a function that handles the loading and cleaning is great – if I decide I want to change the tidying step, I only have to do it once, and I don’t have to worry about mistakes creeping in through copy-and-paste errors!

Functions as arguments

One powerful tool is the ability to pass functions as arguments into functions. This might seem like a weird thing to do, but it has a lot of handy applications – we’ll see just how far it goes in the next modules in this topic.

As a quick example, suppose we wanted to get a sense of how similar or different values in a vector are to each other. There are lots of ways to measure this – variance, standard deviation, range, inter-quartile range – and some are more appropriate in some cases than in others. The function below allows you to input a vector and a function, and returns the result of applying the specified function to the vector input.

x_vec = rnorm(25, 0, 1)

my_summary = function(x, summ_func) {
  summ_func(x)
}

my_summary(x_vec, sd)
## [1] 0.753654
my_summary(x_vec, IQR)
## [1] 1.083116
my_summary(x_vec, var)
## [1] 0.5679943

This example is pretty trivial – you could just apply those functions directly to x and skip the hassle – but in many cases the idea of passing functions as arguments is really powerful. As a practical example, remember that you can reorder factors according to different summaries in fct_reorder!

Scoping and names

Take a look at the code below. Will the call f(x = y) work? If so, what will it produce? What is the current value of x, y, and z?

f = function(x) {
  z = x + y
  z
}

x = 1
y = 2

f(x = y)

Examples like this are tricky, but emphasize an issue that comes up a lot in writing functions: you define a variable in your global environment and use it in your function, but it isn’t passed as an argument. This is easy to miss, especially when you go from code written in chunks to a function, and can be hard to track down if you empty your working directory or change a variable name. The best advice I have is to give your arguments useful names and think carefully about where everything is defined, and to periodically restart R and try everything again!

Other materials

The code that I produced working examples in lecture is here.