Once you’ve imported data, you’re going to need to do some cleaning up.

This is the secode module in the Data Wrangling I topic.

Overview

Learning Objectives

Clean and organize data using dplyr verbs and piping.

Video Lecture


Example

For this example, I’ll start a new R Markdown file to the repo / project I started for the Data Wrangling I topic; this will make it easy to load example data sets using the code I wrote in Data Import.

Once again we’re going to be using the tidyverse, so we’ll load that at the outset. We’re going to be looking at a lot of output, so I’ll print only three lines of each tibble by default. Lastly, we’ll focus on the data in FAS_litters.csv and FAS_pups.csv, so we’ll load those data and clean up the column names using what we learned in Data Import.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

options(tibble.print_min = 3)

litters_data = read_csv("./data/FAS_litters.csv",
  col_types = "ccddiiii")
litters_data = janitor::clean_names(litters_data)

pups_data = read_csv("./data/FAS_pups.csv",
  col_types = "ciiiii")
pups_data = janitor::clean_names(pups_data)

select

For a given analysis, you may only need a subset of the columns in a data table; extracting only what you need can helpfully de-clutter, especially when you have large datasets. Select columns using select.

You can specify the columns you want to keep by naming all of them:

select(litters_data, group, litter_number, gd0_weight, pups_born_alive)
## # A tibble: 49 × 4
##   group litter_number gd0_weight pups_born_alive
##   <chr> <chr>              <dbl>           <int>
## 1 Con7  #85                 19.7               3
## 2 Con7  #1/2/95/2           27                 8
## 3 Con7  #5/5/3/83/3-3       26                 6
## # ℹ 46 more rows

You can specify the specify a range of columns to keep:

select(litters_data, group:gd_of_birth)
## # A tibble: 49 × 5
##   group litter_number gd0_weight gd18_weight gd_of_birth
##   <chr> <chr>              <dbl>       <dbl>       <int>
## 1 Con7  #85                 19.7        34.7          20
## 2 Con7  #1/2/95/2           27          42            19
## 3 Con7  #5/5/3/83/3-3       26          41.4          19
## # ℹ 46 more rows

You can also specify columns you’d like to remove:

select(litters_data, -pups_survive)
## # A tibble: 49 × 7
##   group litter_number gd0_weight gd18_weight gd_of_birth pups_born_alive
##   <chr> <chr>              <dbl>       <dbl>       <int>           <int>
## 1 Con7  #85                 19.7        34.7          20               3
## 2 Con7  #1/2/95/2           27          42            19               8
## 3 Con7  #5/5/3/83/3-3       26          41.4          19               6
## # ℹ 46 more rows
## # ℹ 1 more variable: pups_dead_birth <int>

You can rename variables as part of this process:

select(litters_data, GROUP = group, LiTtEr_NuMbEr = litter_number)
## # A tibble: 49 × 2
##   GROUP LiTtEr_NuMbEr
##   <chr> <chr>        
## 1 Con7  #85          
## 2 Con7  #1/2/95/2    
## 3 Con7  #5/5/3/83/3-3
## # ℹ 46 more rows

If all you want to do is rename something, you can use rename instead of select. This will rename the variables you care about, and keep everything else:

rename(litters_data, GROUP = group, LiTtEr_NuMbEr = litter_number)
## # A tibble: 49 × 8
##   GROUP LiTtEr_NuMbEr gd0_weight gd18_weight gd_of_birth pups_born_alive
##   <chr> <chr>              <dbl>       <dbl>       <int>           <int>
## 1 Con7  #85                 19.7        34.7          20               3
## 2 Con7  #1/2/95/2           27          42            19               8
## 3 Con7  #5/5/3/83/3-3       26          41.4          19               6
## # ℹ 46 more rows
## # ℹ 2 more variables: pups_dead_birth <int>, pups_survive <int>

There are some handy helper functions for select; read about all of them using ?select_helpers. I use starts_with(), ends_with(), and contains() often, especially when there variables are named with suffixes or other standard patterns:

select(litters_data, starts_with("gd"))
## # A tibble: 49 × 3
##   gd0_weight gd18_weight gd_of_birth
##        <dbl>       <dbl>       <int>
## 1       19.7        34.7          20
## 2       27          42            19
## 3       26          41.4          19
## # ℹ 46 more rows

I also frequently use is everything(), which is handy for reorganizing columns without discarding anything:

select(litters_data, litter_number, pups_survive, everything())
## # A tibble: 49 × 8
##   litter_number pups_survive group gd0_weight gd18_weight gd_of_birth
##   <chr>                <int> <chr>      <dbl>       <dbl>       <int>
## 1 #85                      3 Con7        19.7        34.7          20
## 2 #1/2/95/2                7 Con7        27          42            19
## 3 #5/5/3/83/3-3            5 Con7        26          41.4          19
## # ℹ 46 more rows
## # ℹ 2 more variables: pups_born_alive <int>, pups_dead_birth <int>

relocate does a similar thing (and is sort of like rename in that it’s handy but not critical):

relocate(litters_data, litter_number, pups_survive)
## # A tibble: 49 × 8
##   litter_number pups_survive group gd0_weight gd18_weight gd_of_birth
##   <chr>                <int> <chr>      <dbl>       <dbl>       <int>
## 1 #85                      3 Con7        19.7        34.7          20
## 2 #1/2/95/2                7 Con7        27          42            19
## 3 #5/5/3/83/3-3            5 Con7        26          41.4          19
## # ℹ 46 more rows
## # ℹ 2 more variables: pups_born_alive <int>, pups_dead_birth <int>

In larger datasets,

Lastly, like other functions in dplyr, select will export a dataframe even if you only select one column. Mostly this is fine, but sometimes you want the vector stored in the column. To pull a single variable, use pull.

Learning Assessment: In the pups data, select the columns containing litter number, sex, and PD ears.

Solution

The chunk below performs this action:

select(pups_data, litter_number, sex, pd_ears)
## # A tibble: 313 × 3
##   litter_number   sex pd_ears
##   <chr>         <int>   <int>
## 1 #85               1       4
## 2 #85               1       4
## 3 #1/2/95/2         1       5
## # ℹ 310 more rows

filter

Some data tables will include rows you don’t need for your current analysis. Although you could remove specific row numbers using base R, you shouldn’t – this might break if the raw data are updated, and the thought process isn’t transparent. Instead, you should filter rows based on logical expressions using the filter function. Like select, the first argument to filter is the dataframe you’re filtering; all subsequent arguments are logical expressions.

You will often filter using comparison operators (>, >=, <, <=, ==, and !=). You may also use %in% to detect if values appear in a set, and is.na() to find missing values. The results of comparisons are logical – the statement is TRUE or FALSE depending on the values you compare – and can be combined with other comparisons using the logical operators & and |, or negated using !.

Some ways you might filter the litters data are:

  • gd_of_birth == 20
  • pups_born_alive >= 2
  • pups_survive != 4
  • !(pups_survive == 4)
  • group %in% c("Con7", "Con8")
  • group == "Con7" & gd_of_birth == 20

A very common filtering step requires you to omit missing observations. You can do this with filter, but I recommend using drop_na from the tidyr package:

  • drop_na(litters_data) will remove any row with a missing value
  • drop_na(litters_data, wt_increase) will remove rows for which wt_increase is missing.

Filtering can be helpful for limiting a dataset to only those observations needed for an analysis. However, I recommend against the creation of many data subsets (e.g. one for each group). This can clutter up your workspace, and we’ll see good tools for the analysis of subsets before long.

Learning Assessment: In the pups data:

  • Filter to include only pups with sex 1
  • Filter to include only pups with PD walk less than 11 and sex 2
Solution

The chunk below performs these actions:

filter(pups_data, sex == 1)
## # A tibble: 155 × 6
##   litter_number   sex pd_ears pd_eyes pd_pivot pd_walk
##   <chr>         <int>   <int>   <int>    <int>   <int>
## 1 #85               1       4      13        7      11
## 2 #85               1       4      13        7      12
## 3 #1/2/95/2         1       5      13        7       9
## # ℹ 152 more rows
filter(pups_data, sex == 2, pd_walk < 11)
## # A tibble: 127 × 6
##   litter_number   sex pd_ears pd_eyes pd_pivot pd_walk
##   <chr>         <int>   <int>   <int>    <int>   <int>
## 1 #1/2/95/2         2       4      13        7       9
## 2 #1/2/95/2         2       4      13        7      10
## 3 #1/2/95/2         2       5      13        8      10
## # ℹ 124 more rows

mutate

Sometimes you need to select columns; sometimes you need to change them or create new ones. You can do this using mutate.

The example below creates a new variable measuring the difference between gd18_weight and gd0_weight and modifies the existing group variable.

mutate(litters_data,
  wt_gain = gd18_weight - gd0_weight,
  group = str_to_lower(group)
)
## # A tibble: 49 × 9
##   group litter_number gd0_weight gd18_weight gd_of_birth pups_born_alive
##   <chr> <chr>              <dbl>       <dbl>       <int>           <int>
## 1 con7  #85                 19.7        34.7          20               3
## 2 con7  #1/2/95/2           27          42            19               8
## 3 con7  #5/5/3/83/3-3       26          41.4          19               6
## # ℹ 46 more rows
## # ℹ 3 more variables: pups_dead_birth <int>, pups_survive <int>, wt_gain <dbl>

A few things in this example are worth noting:

  • Your new variables can be functions of old variables
  • New variables appear at the end of the dataset in the order that they are created
  • You can overwrite old variables
  • You can create a new variable and immediately refer to (or change) it

Creating a new variable that does exactly what you need can be a challenge; the more functions you know about, the easier this gets.

Learning Assessment: In the pups data:

  • Create a variable that subtracts 7 from PD pivot
  • Create a variable that is the sum of all the PD variables
Solution

The chunk below performs these actions:

mutate(pups_data, pivot_minus7 = pd_pivot - 7)
## # A tibble: 313 × 7
##   litter_number   sex pd_ears pd_eyes pd_pivot pd_walk pivot_minus7
##   <chr>         <int>   <int>   <int>    <int>   <int>        <dbl>
## 1 #85               1       4      13        7      11            0
## 2 #85               1       4      13        7      12            0
## 3 #1/2/95/2         1       5      13        7       9            0
## # ℹ 310 more rows
mutate(pups_data, pd_sum = pd_ears + pd_eyes + pd_pivot + pd_walk)
## # A tibble: 313 × 7
##   litter_number   sex pd_ears pd_eyes pd_pivot pd_walk pd_sum
##   <chr>         <int>   <int>   <int>    <int>   <int>  <int>
## 1 #85               1       4      13        7      11     35
## 2 #85               1       4      13        7      12     36
## 3 #1/2/95/2         1       5      13        7       9     34
## # ℹ 310 more rows

arrange

In comparison to the preceding, arranging is pretty straightforward. You can arrange the rows in your data according to the values in one or more columns:

head(arrange(litters_data, group, pups_born_alive), 10)
## # A tibble: 10 × 8
##    group litter_number   gd0_weight gd18_weight gd_of_birth pups_born_alive
##    <chr> <chr>                <dbl>       <dbl>       <int>           <int>
##  1 Con7  #85                   19.7        34.7          20               3
##  2 Con7  #5/4/2/95/2           28.5        44.1          19               5
##  3 Con7  #5/5/3/83/3-3         26          41.4          19               6
##  4 Con7  #4/2/95/3-3           NA          NA            20               6
##  5 Con7  #2/2/95/3-2           NA          NA            20               6
##  6 Con7  #1/2/95/2             27          42            19               8
##  7 Con7  #1/5/3/83/3-3/2       NA          NA            20               9
##  8 Con8  #2/2/95/2             NA          NA            19               5
##  9 Con8  #1/6/2/2/95-2         NA          NA            20               7
## 10 Con8  #3/6/2/2/95-3         NA          NA            20               7
## # ℹ 2 more variables: pups_dead_birth <int>, pups_survive <int>

|>

We’ve seen several commands you will use regularly for data manipulation and cleaning. You will rarely use them in isolation. For example, suppose you want to load the data, clean the column names, remove pups_survive, and create wt_gain. There are a couple of options for this kind of multi-step data manipulation:

  • define intermediate datasets (or overwrite data at each stage)
  • nest function calls

The following is an example of the first option:

litters_data_raw = read_csv("./data/FAS_litters.csv",
  col_types = "ccddiiii")
litters_data_clean_names = janitor::clean_names(litters_data_raw)
litters_data_selected_cols = select(litters_data_clean_names, -pups_survive)
litters_data_with_vars = 
  mutate(
    litters_data_selected_cols, 
    wt_gain = gd18_weight - gd0_weight,
    group = str_to_lower(group))
litters_data_with_vars_without_missing = 
  drop_na(litters_data_with_vars, wt_gain)
litters_data_with_vars_without_missing
## # A tibble: 31 × 8
##   group litter_number gd0_weight gd18_weight gd_of_birth pups_born_alive
##   <chr> <chr>              <dbl>       <dbl>       <int>           <int>
## 1 con7  #85                 19.7        34.7          20               3
## 2 con7  #1/2/95/2           27          42            19               8
## 3 con7  #5/5/3/83/3-3       26          41.4          19               6
## # ℹ 28 more rows
## # ℹ 2 more variables: pups_dead_birth <int>, wt_gain <dbl>

Below, we try the second option:

litters_data_clean = 
  drop_na(
    mutate(
      select(
        janitor::clean_names(
          read_csv("./data/FAS_litters.csv", col_types = "ccddiiii")
          ), 
      -pups_survive
      ),
    wt_gain = gd18_weight - gd0_weight,
    group = str_to_lower(group)
    ),
  wt_gain
  )

litters_data_clean
## # A tibble: 31 × 8
##   group litter_number gd0_weight gd18_weight gd_of_birth pups_born_alive
##   <chr> <chr>              <dbl>       <dbl>       <int>           <int>
## 1 con7  #85                 19.7        34.7          20               3
## 2 con7  #1/2/95/2           27          42            19               8
## 3 con7  #5/5/3/83/3-3       26          41.4          19               6
## # ℹ 28 more rows
## # ℹ 2 more variables: pups_dead_birth <int>, wt_gain <dbl>

These are both confusing and bad: the first gets confusing and clutters our workspace, and the second has to be read inside out.

Piping solves this problem. It allows you to turn the nested approach into a sequential chain by passing the result of one function call as an argument to the next function call:

litters_data = 
  read_csv("./data/FAS_litters.csv", col_types = "ccddiiii") |> 
  janitor::clean_names() |> 
  select(-pups_survive) |> 
  mutate(
    wt_gain = gd18_weight - gd0_weight,
    group = str_to_lower(group)) |> 
  drop_na(wt_gain)

litters_data
## # A tibble: 31 × 8
##   group litter_number gd0_weight gd18_weight gd_of_birth pups_born_alive
##   <chr> <chr>              <dbl>       <dbl>       <int>           <int>
## 1 con7  #85                 19.7        34.7          20               3
## 2 con7  #1/2/95/2           27          42            19               8
## 3 con7  #5/5/3/83/3-3       26          41.4          19               6
## # ℹ 28 more rows
## # ℹ 2 more variables: pups_dead_birth <int>, wt_gain <dbl>

All three approaches result in the same dataset, but the piped commands are by far the most straightforward. The easiest way to read |> is “then”; the keyboard shortcuts are Ctrl + Shift + M (Windows) and Cmd + Shift + M (Mac). Note that by default RStudio will insert the “traditional” pipe %>%, and you can update to the native in pipe through Global Preferences > Code > Use Native Pipe Operator. More on the difference between |> and %>% below.

The functions in dplyr (and much of the tidyverse) are designed to work smoothly with the pipe operator. By default, the pipe will take the result of one function call and use that as the first argument of the next function call; by design, functions in dplyr will take a tibble as an input and return a tibble as a result. As a consequence, functions in dplyr are easy to connect in a data cleaning chain.

In the majority of cases (and everywhere in the tidyverse) you can trust that the first argument is the right one and be happy with life, but there are some cases where what you’re piping isn’t going into the first argument. Here, using the placeholder _ is necessary to indicate where the object being piped should go. For example, to regress wt_gain on pups_born_alive, you might use:

litters_data |>
  lm(wt_gain ~ pups_born_alive, data = _) |>
  broom::tidy()
## # A tibble: 2 × 5
##   term            estimate std.error statistic  p.value
##   <chr>              <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)       13.1       1.27      10.3  3.39e-11
## 2 pups_born_alive    0.605     0.173      3.49 1.55e- 3

There are limitations to the pipe. You shouldn’t have sequences that are too long; there isn’t a great way to deal with multiple inputs and outputs; and not everyone will know what |> means or does. That said, compared to days when R users only had the first two options, life is much better!

Learning Assessment: Write a chain of commands that:

  • loads the pups data
  • cleans the variable names
  • filters the data to include only pups with sex 1
  • removes the PD ears variable
  • creates a variable that indicates whether PD pivot is 7 or more days
Solution

The chunk below performs these actions:

read_csv("./data/FAS_pups.csv", col_types = "ciiiii") |>
  janitor::clean_names() |> 
  filter(sex == 1) |> 
  select(-pd_ears) |> 
  mutate(pd_pivot_gt7 = pd_pivot > 7)
## # A tibble: 155 × 6
##   litter_number   sex pd_eyes pd_pivot pd_walk pd_pivot_gt7
##   <chr>         <int>   <int>    <int>   <int> <lgl>       
## 1 #85               1      13        7      11 FALSE       
## 2 #85               1      13        7      12 FALSE       
## 3 #1/2/95/2         1      13        7       9 FALSE       
## # ℹ 152 more rows

|> vs %>%

There are some subtle differences between |> and %>% that it can be helpful to read about, but in the vast majority of cases these are effectively interchangeable. While the similarities and differences can matter, there are a few points that are likely to come up more often:

  • You’ll see %>% a lot in code online. It was introduced first, and was very prevalent in tidyverse code for a long time before |> came along.
  • The native pipe |> was introduced in R 4.1.0, and it changed some in 4.2.0 and 4.3.0; it’s pretty stable now and the main functionality is unlikely to change, but it won’t be familiar to everyone and can’t be used at all in older versions of R.

Other materials

There’s lots of stuff out there on how to clean your data using dplyr.

The code that I produced working examples in lecture is here.