The overarching goal of data wrangling is to have a tidy, easy-to-use dataset.
This is the third module in the Data Wrangling I topic.
Understanding principles of “tidy” data, using relational databases, and merging datasets.
I’ll keep using the same repo / project as in data import and data manipulation, but create a new .Rmd for tidying. I’m also going to load some relevant packages, and limit the number of lines printed in a tibble.
library(tidyverse) ## ── Attaching packages ──────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ── ## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4 ## ✓ tibble 3.0.3 ✓ dplyr 1.0.2 ## ✓ tidyr 1.1.2 ✓ stringr 1.4.0 ## ✓ readr 1.3.1 ✓ forcats 0.5.0 ## ── Conflicts ─────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ── ## x dplyr::filter() masks stats::filter() ## x dplyr::lag() masks stats::lag() options(tibble.print_min = 5)
In data import, we used the
haven package to load the PULSE biomarkers dataset from a .sas7bdat. Let’s reload those data and take a closer look:
pulse_data = haven::read_sas("./data/public_pulse_data.sas7bdat") %>% janitor::clean_names() pulse_data ## # A tibble: 1,087 x 7 ## id age sex bdi_score_bl bdi_score_01m bdi_score_06m bdi_score_12m ## <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 10003 48.0 male 7 1 2 0 ## 2 10015 72.5 male 6 NA NA NA ## 3 10022 58.5 male 14 3 8 NA ## 4 10026 72.7 male 20 6 18 16 ## 5 10035 60.4 male 4 0 1 2 ## # … with 1,082 more rows
With our new understanding of tidy data, we quickly recognize a problem: the BDI score is spread across four columns, which correspond to four observation times. We can fix this problem using
pulse_tidy_data = pivot_longer( pulse_data, bdi_score_bl:bdi_score_12m, names_to = "visit", values_to = "bdi") pulse_tidy_data ## # A tibble: 4,348 x 5 ## id age sex visit bdi ## <dbl> <dbl> <chr> <chr> <dbl> ## 1 10003 48.0 male bdi_score_bl 7 ## 2 10003 48.0 male bdi_score_01m 1 ## 3 10003 48.0 male bdi_score_06m 2 ## 4 10003 48.0 male bdi_score_12m 0 ## 5 10015 72.5 male bdi_score_bl 6 ## # … with 4,343 more rows
This looks much better! However, now
visit is an issue. The original column names were informative but we probably don’t need to keep the
bdi_score_ prefix in each case. I’ll use an additional option in
pivot_longer to address this:
pulse_tidy_data = pivot_longer( pulse_data, bdi_score_bl:bdi_score_12m, names_to = "visit", names_prefix = "bdi_score_", values_to = "bdi") pulse_tidy_data ## # A tibble: 4,348 x 5 ## id age sex visit bdi ## <dbl> <dbl> <chr> <chr> <dbl> ## 1 10003 48.0 male bl 7 ## 2 10003 48.0 male 01m 1 ## 3 10003 48.0 male 06m 2 ## 4 10003 48.0 male 12m 0 ## 5 10015 72.5 male bl 6 ## # … with 4,343 more rows
In the preceding I’ve saved intermediate datasets to make each step clear. While this can be a helpful crutch as you’re trying out code, it is generally bad practice. There are also some additional transformations needed to wrap up the data wrangling process, like changing
00m for consistency across visits and converting
visit to a factor variable. (It’s possible that you would want
visit to be a numeric variable instead, which could be done with a different call to
mutate.) Lastly, it’s nice to organize the data into a reasonable order.
Altogether, then, the code below will import, tidy, and transform the PULSE dataset into a usable format:
pulse_data = haven::read_sas("./data/public_pulse_data.sas7bdat") %>% janitor::clean_names() %>% pivot_longer( bdi_score_bl:bdi_score_12m, names_to = "visit", names_prefix = "bdi_score_", values_to = "bdi") %>% relocate(visit) %>% mutate( visit = replace(visit, visit == "bl", "00m"), visit = factor(visit, levels = str_c(c("00", "01", "06", "12"), "m"))) %>% arrange(id, visit) print(pulse_data, n = 12) ## # A tibble: 4,348 x 5 ## visit id age sex bdi ## <fct> <dbl> <dbl> <chr> <dbl> ## 1 00m 10003 48.0 male 7 ## 2 01m 10003 48.0 male 1 ## 3 06m 10003 48.0 male 2 ## 4 12m 10003 48.0 male 0 ## 5 00m 10015 72.5 male 6 ## 6 01m 10015 72.5 male NA ## 7 06m 10015 72.5 male NA ## 8 12m 10015 72.5 male NA ## 9 00m 10022 58.5 male 14 ## 10 01m 10022 58.5 male 3 ## 11 06m 10022 58.5 male 8 ## 12 12m 10022 58.5 male NA ## # … with 4,336 more rows
Now we’re in pretty okay shape :-).
Learning Assessment: In the litters data, the variables
gd18_weight give the weight of the mother mouse on gestational days 0 and 18. Write a data cleaning chain that retains only
litter_number and these columns; produces new variables
weight; and makes
gd a numeric variable taking values
18 (for the last part, you might want to use
recode …). Is this version “tidy”?
The code below shows one approach to this data cleaning process:
litters_wide = read_csv("./data/FAS_litters.csv", col_types = "ccddiiii") %>% janitor::clean_names() %>% select(litter_number, ends_with("weight")) %>% pivot_longer( gd0_weight:gd18_weight, names_to = "gd", values_to = "weight") %>% mutate(gd = recode(gd, "gd0_weight" = 0, "gd18_weight" = 18))
In one sense, this is “tidy” because I have a variable for day and a variable for weight rather that using values in my variable names. However, it’s less useful if I’m interested in computing or analyzing weight gain during pregnancy.
We’ve been exclusively interested in tidying data, but we’ve admitted that sometimes untidy is better for human consumption. For that reason we’re going to take a short digression into untidying your tidy data.
The code below creates a tidy dataset that could result from an analysis. This is the correct format for additional analysis or visualization, but doesn’t facilitate quick comparisons for human readers.
analysis_result = tibble( group = c("treatment", "treatment", "placebo", "placebo"), time = c("pre", "post", "pre", "post"), mean = c(4, 8, 3.5, 4) ) analysis_result ## # A tibble: 4 x 3 ## group time mean ## <chr> <chr> <dbl> ## 1 treatment pre 4 ## 2 treatment post 8 ## 3 placebo pre 3.5 ## 4 placebo post 4
An alternative presentation of the same data might have groups in rows, times in columns, and mean values in table cells. This is decidedly non-tidy; to get there from here we’ll need to use
pivot_wider, which is the inverse of
pivot_wider( analysis_result, names_from = "time", values_from = "mean") ## # A tibble: 2 x 3 ## group pre post ## <chr> <dbl> <dbl> ## 1 treatment 4 8 ## 2 placebo 3.5 4
We’re pretty much there now – in some cases you might use
select to reorder columns, and (depending on your goal) use
knitr::kable() to produce a nicer table for reading.
We’ve looked at single-table non-tidy data, but non-tidiness often stems from relevant data spread across multiple tables. In the simplest case, these tables are basically the same and can be stacked to produce a tidy dataset. That’s the setting in
LotR_words.xlsx, where the word counts for different races and genders in each movie in the trilogy are spread across distinct data rectangles (these data are based on this example).
To produce the desired tidy dataset, we first need to read each table and do some cleaning.
fellowship_ring = readxl::read_excel("./data/LotR_Words.xlsx", range = "B3:D6") %>% mutate(movie = "fellowship_ring") two_towers = readxl::read_excel("./data/LotR_Words.xlsx", range = "F3:H6") %>% mutate(movie = "two_towers") return_king = readxl::read_excel("./data/LotR_Words.xlsx", range = "J3:L6") %>% mutate(movie = "return_king")
Here it was necessary to add a variable to each dataframe indicating the movie; that information had stored elsewhere in the original spreadsheet. As an aside, the three code snippets above are all basically the same except for the range and the movie name – later we’ll see a better way to handle cases like this by writing our own functions, but this works for now.
Once each table is ready to go, we can stack them up using
bind_rows and tidy the result:
lotr_tidy = bind_rows(fellowship_ring, two_towers, return_king) %>% janitor::clean_names() %>% pivot_longer( female:male, names_to = "gender", values_to = "words") %>% mutate(race = str_to_lower(race)) %>% select(movie, everything()) lotr_tidy ## # A tibble: 18 x 4 ## movie race gender words ## <chr> <chr> <chr> <dbl> ## 1 fellowship_ring elf female 1229 ## 2 fellowship_ring elf male 971 ## 3 fellowship_ring hobbit female 14 ## 4 fellowship_ring hobbit male 3644 ## 5 fellowship_ring man female 0 ## 6 fellowship_ring man male 1995 ## 7 two_towers elf female 331 ## 8 two_towers elf male 513 ## 9 two_towers hobbit female 0 ## 10 two_towers hobbit male 2463 ## 11 two_towers man female 401 ## 12 two_towers man male 3589 ## 13 return_king elf female 183 ## 14 return_king elf male 510 ## 15 return_king hobbit female 2 ## 16 return_king hobbit male 2673 ## 17 return_king man female 268 ## 18 return_king man male 2459
Having the data in this form will make it easier to make comparisons across movies, aggregate within races across the trilogy, and perform other analyses.
Data can be spread across multiple related tables, in which case it is necessary to combine or join them prior to analysis. We’ll focus on the problem of combining two tables only; combining three or more is done step-by-step using the same ideas.
There are four major ways join dataframes
Left joins are the most common, because they add data from a smaller table
y into a larger table
x without removing anything from
As an example, consider the data tables in
FAS_litters.csv, which are related through the
Litter Number variable. The former contains data unique to each pup, and the latter contains data unique to each litter. We can combine these using a left join of litter data into pup data; doing so retains data on each pup and adds data in new columns.
(While revisiting this example, take a look at the
group variable in the litters dataset: this encodes both dose and day of treatment! We’ll fix that bit of untidiness as part of the processing pipeline. I’m also going to address a pet peeve of mine, which is coding sex as an ambiguous numeric variable.)
pup_data = read_csv("./data/FAS_pups.csv", col_types = "ciiiii") %>% janitor::clean_names() %>% mutate(sex = recode(sex, `1` = "male", `2` = "female")) litter_data = read_csv("./data/FAS_litters.csv", col_types = "ccddiiii") %>% janitor::clean_names() %>% separate(group, into = c("dose", "day_of_tx"), sep = 3) %>% relocate(litter_number) %>% mutate( wt_gain = gd18_weight - gd0_weight, dose = str_to_lower(dose)) fas_data = left_join(pup_data, litter_data, by = "litter_number") fas_data ## # A tibble: 313 x 15 ## litter_number sex pd_ears pd_eyes pd_pivot pd_walk dose day_of_tx ## <chr> <chr> <int> <int> <int> <int> <chr> <chr> ## 1 #85 male 4 13 7 11 con 7 ## 2 #85 male 4 13 7 12 con 7 ## 3 #1/2/95/2 male 5 13 7 9 con 7 ## 4 #1/2/95/2 male 5 13 8 10 con 7 ## 5 #5/5/3/83/3-3 male 5 13 8 10 con 7 ## # … with 308 more rows, and 7 more variables: gd0_weight <dbl>, ## # gd18_weight <dbl>, gd_of_birth <int>, pups_born_alive <int>, ## # pups_dead_birth <int>, pups_survive <int>, wt_gain <dbl>
We made the key explicit in the join. By default, the
*_join functions in
dplyr will try to determine the key(s) based on variable names in the datasets you want to join. This is often but not always sufficient, and an extra step to make the key clear will help you and others reading your code.
Note that joining is not particularly amenable to the
%>% operator because it is fundamentally non-linear: two separate datasets are coming together, rather than a single dataset being processed in a step-by-step fashion.
As a final point, the
*_join functions are very much related to SQL syntax, but emphasize operations common to data analysis.
Learning Assessment: The datasets in this zip file contain de-identified responses to surveys included in past years of this course. Both contain a unique student identifier; the first has reponses to a question about operating systems, and the second has responses to questions about degree program and git experience. Write a code chunk that imports and cleans both datasets, and then joins them.
I put both datasets in the
data directory in my repo / project. The code below imports both datasets, cleans up variable names, and joins the datasets using
surv_os = read_csv("data/surv_os.csv") %>% janitor::clean_names() %>% rename(id = what_is_your_uni, os = what_operating_system_do_you_use) surv_pr_git = read_csv("data/surv_program_git.csv") %>% janitor::clean_names() %>% rename( id = what_is_your_uni, prog = what_is_your_degree_program, git_exp = which_most_accurately_describes_your_experience_with_git) left_join(surv_os, surv_pr_git) inner_join(surv_os, surv_pr_git) anti_join(surv_os, surv_pr_git) anti_join(surv_pr_git, surv_os)
Up until very recently, folks were using
spread instead of
pivot_wider. The new functions were updated for good reasons;
spread will still exist, but they’re going to be less common over time.
The code that I produced working examples in lecture is here.