In data import, we saw how to load data from a variety of formats; this is a fairly standard way to get data that have been gathered as part of a study. In a lot of cases, though, you’re going to have to go out and get the data you want or need. That’s what we’re covering now.

This is the first module in the Data Wrangling II topic.

Overview

Learning Objectives

Gather data from online sources (i.e. “scrape”) using APIs, rvest, and httr.

Video Lecture


Example

As always, I’ll create a new GH Repo + local project called data_wrangling_ii, and open a new R Markdown file called reading_data_from_the_web.Rmd. Although we’ll mostly be getting data from the web, we’ll revisit some of these examples, so I’ll create a data subdirectory and put those in it.

There are some new additions to our standard packages (rvest and httr); I’m loading everything we need now. Now’s also the time to “install” the Selector Gadget.

library(tidyverse)
library(rvest)
library(httr)

Extracting tables

This page contains data from the National Survey on Drug Use and Health; it includes tables for drug use in the past year or month, separately for specific kinds of drug use. These data are potentially useful for analysis, and we’d like to be able to read in the first table.

First, let’s make sure we can load the data from the web.

url = "http://samhda.s3-us-gov-west-1.amazonaws.com/s3fs-public/field-uploads/2k15StateFiles/NSDUHsaeShortTermCHG2015.htm"
drug_use_html = read_html(url)

drug_use_html
## {html_document}
## <html lang="en">
## [1] <head>\n<link rel="P3Pv1" href="http://www.samhsa.gov/w3c/p3p.xml">\n<tit ...
## [2] <body>\r\n\r\n<noscript>\r\n<p>Your browser's Javascript is off. Hyperlin ...

Doesn’t look like much, but we’re there. Rather than trying to grab something using a CSS selector, let’s try our luck extracting the tables from the HTML.

drug_use_html |>
  html_table()
## [[1]]
## # A tibble: 57 × 16
##    State     `12+(2013-2014)` `12+(2014-2015)` `12+(P Value)` `12-17(2013-2014)`
##    <chr>     <chr>            <chr>            <chr>          <chr>             
##  1 "NOTE: S… "NOTE: State an… "NOTE: State an… "NOTE: State … "NOTE: State and …
##  2 "Total U… "12.90a"         "13.36"          "0.002"        "13.28b"          
##  3 "Northea… "13.88a"         "14.66"          "0.005"        "13.98"           
##  4 "Midwest" "12.40b"         "12.76"          "0.082"        "12.45"           
##  5 "South"   "11.24a"         "11.64"          "0.029"        "12.02"           
##  6 "West"    "15.27"          "15.62"          "0.262"        "15.53a"          
##  7 "Alabama" "9.98"           "9.60"           "0.426"        "9.90"            
##  8 "Alaska"  "19.60a"         "21.92"          "0.010"        "17.30"           
##  9 "Arizona" "13.69"          "13.12"          "0.364"        "15.12"           
## 10 "Arkansa… "11.37"          "11.59"          "0.678"        "12.79"           
## # ℹ 47 more rows
## # ℹ 11 more variables: `12-17(2014-2015)` <chr>, `12-17(P Value)` <chr>,
## #   `18-25(2013-2014)` <chr>, `18-25(2014-2015)` <chr>, `18-25(P Value)` <chr>,
## #   `26+(2013-2014)` <chr>, `26+(2014-2015)` <chr>, `26+(P Value)` <chr>,
## #   `18+(2013-2014)` <chr>, `18+(2014-2015)` <chr>, `18+(P Value)` <chr>
## 
## [[2]]
## # A tibble: 57 × 16
##    State     `12+(2013-2014)` `12+(2014-2015)` `12+(P Value)` `12-17(2013-2014)`
##    <chr>     <chr>            <chr>            <chr>          <chr>             
##  1 "NOTE: S… "NOTE: State an… "NOTE: State an… "NOTE: State … "NOTE: State and …
##  2 "Total U… "7.96a"          "8.34"           "0.001"        "7.22"            
##  3 "Northea… "8.58a"          "9.28"           "0.001"        "7.68"            
##  4 "Midwest" "7.50a"          "7.92"           "0.009"        "6.64"            
##  5 "South"   "6.74a"          "7.02"           "0.044"        "6.31"            
##  6 "West"    "9.84"           "10.08"          "0.324"        "8.85"            
##  7 "Alabama" "5.57"           "5.35"           "0.510"        "4.98"            
##  8 "Alaska"  "11.85a"         "14.38"          "0.002"        "9.19"            
##  9 "Arizona" "8.80"           "8.51"           "0.570"        "8.30"            
## 10 "Arkansa… "6.70"           "7.17"           "0.240"        "6.22"            
## # ℹ 47 more rows
## # ℹ 11 more variables: `12-17(2014-2015)` <chr>, `12-17(P Value)` <chr>,
## #   `18-25(2013-2014)` <chr>, `18-25(2014-2015)` <chr>, `18-25(P Value)` <chr>,
## #   `26+(2013-2014)` <chr>, `26+(2014-2015)` <chr>, `26+(P Value)` <chr>,
## #   `18+(2013-2014)` <chr>, `18+(2014-2015)` <chr>, `18+(P Value)` <chr>
## 
## [[3]]
## # A tibble: 57 × 16
##    State     `12+(2013-2014)` `12+(2014-2015)` `12+(P Value)` `12-17(2013-2014)`
##    <chr>     <chr>            <chr>            <chr>          <chr>             
##  1 "NOTE: S… "NOTE: State an… "NOTE: State an… "NOTE: State … "NOTE: State and …
##  2 "Total U… "1.91"           "1.95"           "0.276"        "5.60"            
##  3 "Northea… "2.01"           "2.04"           "0.634"        "5.85b"           
##  4 "Midwest" "1.95"           "1.96"           "0.854"        "5.31"            
##  5 "South"   "1.69"           "1.75"           "0.137"        "5.18"            
##  6 "West"    "2.20"           "2.21"           "0.868"        "6.37b"           
##  7 "Alabama" "1.42"           "1.49"           "0.383"        "4.46"            
##  8 "Alaska"  "3.01a"          "3.54"           "0.012"        "6.99"            
##  9 "Arizona" "2.16"           "2.15"           "0.934"        "6.58"            
## 10 "Arkansa… "1.82"           "1.84"           "0.794"        "5.78"            
## # ℹ 47 more rows
## # ℹ 11 more variables: `12-17(2014-2015)` <chr>, `12-17(P Value)` <chr>,
## #   `18-25(2013-2014)` <chr>, `18-25(2014-2015)` <chr>, `18-25(P Value)` <chr>,
## #   `26+(2013-2014)` <chr>, `26+(2014-2015)` <chr>, `26+(P Value)` <chr>,
## #   `18+(2013-2014)` <chr>, `18+(2014-2015)` <chr>, `18+(P Value)` <chr>
## 
## [[4]]
## # A tibble: 57 × 16
##    State     `12+(2013-2014)` `12+(2014-2015)` `12+(P Value)` `12-17(2013-2014)`
##    <chr>     <chr>            <chr>            <chr>          <chr>             
##  1 "NOTE: S… "NOTE: State an… "NOTE: State an… "NOTE: State … "NOTE: State and …
##  2 "Total U… "1.66a"          "1.76"           "0.040"        "0.60"            
##  3 "Northea… "1.94a"          "2.18"           "0.012"        "0.60"            
##  4 "Midwest" "1.37"           "1.43"           "0.282"        "0.48"            
##  5 "South"   "1.45b"          "1.56"           "0.067"        "0.53"            
##  6 "West"    "2.03"           "2.05"           "0.816"        "0.82"            
##  7 "Alabama" "1.23"           "1.22"           "0.995"        "0.42"            
##  8 "Alaska"  "1.54a"          "2.00"           "0.010"        "0.51"            
##  9 "Arizona" "2.25"           "2.29"           "0.861"        "1.01"            
## 10 "Arkansa… "0.93"           "1.07"           "0.208"        "0.41"            
## # ℹ 47 more rows
## # ℹ 11 more variables: `12-17(2014-2015)` <chr>, `12-17(P Value)` <chr>,
## #   `18-25(2013-2014)` <chr>, `18-25(2014-2015)` <chr>, `18-25(P Value)` <chr>,
## #   `26+(2013-2014)` <chr>, `26+(2014-2015)` <chr>, `26+(P Value)` <chr>,
## #   `18+(2013-2014)` <chr>, `18+(2014-2015)` <chr>, `18+(P Value)` <chr>
## 
## [[5]]
## # A tibble: 57 × 16
##    State     `12+(2013-2014)` `12+(2014-2015)` `12+(P Value)` `12-17(2013-2014)`
##    <chr>     <chr>            <chr>            <chr>          <chr>             
##  1 "NOTE: S… "NOTE: State an… "NOTE: State an… "NOTE: State … "NOTE: State and …
##  2 "Total U… "0.30"           "0.33"           "0.217"        "0.12"            
##  3 "Northea… "0.43a"          "0.54"           "0.007"        "0.13"            
##  4 "Midwest" "0.30"           "0.31"           "0.638"        "0.11"            
##  5 "South"   "0.27"           "0.26"           "0.444"        "0.12"            
##  6 "West"    "0.25"           "0.29"           "0.152"        "0.13"            
##  7 "Alabama" "0.22"           "0.27"           "0.171"        "0.10"            
##  8 "Alaska"  "0.70a"          "1.23"           "0.044"        "0.11"            
##  9 "Arizona" "0.32a"          "0.55"           "0.001"        "0.17"            
## 10 "Arkansa… "0.19"           "0.17"           "0.398"        "0.10"            
## # ℹ 47 more rows
## # ℹ 11 more variables: `12-17(2014-2015)` <chr>, `12-17(P Value)` <chr>,
## #   `18-25(2013-2014)` <chr>, `18-25(2014-2015)` <chr>, `18-25(P Value)` <chr>,
## #   `26+(2013-2014)` <chr>, `26+(2014-2015)` <chr>, `26+(P Value)` <chr>,
## #   `18+(2013-2014)` <chr>, `18+(2014-2015)` <chr>, `18+(P Value)` <chr>
## 
## [[6]]
## # A tibble: 57 × 16
##    State     `12+(2013-2014)` `12+(2014-2015)` `12+(P Value)` `12-17(2013-2014)`
##    <chr>     <chr>            <chr>            <chr>          <chr>             
##  1 "NOTE: S… "NOTE: State an… "NOTE: State an… "NOTE: State … "NOTE: State and …
##  2 "Total U… "52.42"          "52.18"          "0.337"        "11.55a"          
##  3 "Northea… "57.80a"         "56.66"          "0.009"        "13.19"           
##  4 "Midwest" "55.14a"         "54.36"          "0.032"        "11.31a"          
##  5 "South"   "48.74"          "48.85"          "0.759"        "10.87a"          
##  6 "West"    "51.67"          "52.07"          "0.383"        "11.71a"          
##  7 "Alabama" "44.72"          "43.94"          "0.533"        "10.53a"          
##  8 "Alaska"  "54.02"          "54.98"          "0.444"        "9.22"            
##  9 "Arizona" "51.80"          "51.19"          "0.613"        "11.90b"          
## 10 "Arkansa… "42.45"          "41.81"          "0.588"        "9.90"            
## # ℹ 47 more rows
## # ℹ 11 more variables: `12-17(2014-2015)` <chr>, `12-17(P Value)` <chr>,
## #   `18-25(2013-2014)` <chr>, `18-25(2014-2015)` <chr>, `18-25(P Value)` <chr>,
## #   `26+(2013-2014)` <chr>, `26+(2014-2015)` <chr>, `26+(P Value)` <chr>,
## #   `18+(2013-2014)` <chr>, `18+(2014-2015)` <chr>, `18+(P Value)` <chr>
## 
## [[7]]
## # A tibble: 57 × 4
##    State    Alcohol Use inPast M…¹ Alcohol Use inPast M…² Alcohol Use inPast M…³
##    <chr>    <chr>                  <chr>                  <chr>                 
##  1 "NOTE: … "NOTE: State and cens… "NOTE: State and cens… "NOTE: State and cens…
##  2 "Total … "22.76a"               "21.57"                "0.000"               
##  3 "Northe… "26.11"                "25.98"                "0.792"               
##  4 "Midwes… "23.73a"               "22.00"                "0.000"               
##  5 "South"  "20.68a"               "19.66"                "0.010"               
##  6 "West"   "22.73a"               "21.01"                "0.000"               
##  7 "Alabam… "19.25"                "18.19"                "0.305"               
##  8 "Alaska" "21.47b"               "23.91"                "0.058"               
##  9 "Arizon… "22.01a"               "19.25"                "0.009"               
## 10 "Arkans… "18.07"                "16.65"                "0.106"               
## # ℹ 47 more rows
## # ℹ abbreviated names: ¹​`Alcohol Use inPast Month(2013-2014)`,
## #   ²​`Alcohol Use inPast Month(2014-2015)`,
## #   ³​`Alcohol Use inPast Month(P Value)`
## 
## [[8]]
## # A tibble: 57 × 16
##    State     `12+(2013-2014)` `12+(2014-2015)` `12+(P Value)` `12-17(2013-2014)`
##    <chr>     <chr>            <chr>            <chr>          <chr>             
##  1 "NOTE: S… "NOTE: State an… "NOTE: State an… "NOTE: State … "NOTE: State and …
##  2 "Total U… "25.36a"         "24.56"          "0.000"        "7.42a"           
##  3 "Northea… "23.76"          "23.30"          "0.216"        "7.18a"           
##  4 "Midwest" "28.57a"         "27.39"          "0.000"        "8.58a"           
##  5 "South"   "26.91a"         "26.24"          "0.034"        "7.57a"           
##  6 "West"    "21.20a"         "20.29"          "0.015"        "6.31a"           
##  7 "Alabama" "31.62"          "30.46"          "0.295"        "8.43"            
##  8 "Alaska"  "29.30a"         "31.51"          "0.048"        "10.73"           
##  9 "Arizona" "22.14"          "22.51"          "0.678"        "6.81"            
## 10 "Arkansa… "35.57"          "34.05"          "0.188"        "10.89"           
## # ℹ 47 more rows
## # ℹ 11 more variables: `12-17(2014-2015)` <chr>, `12-17(P Value)` <chr>,
## #   `18-25(2013-2014)` <chr>, `18-25(2014-2015)` <chr>, `18-25(P Value)` <chr>,
## #   `26+(2013-2014)` <chr>, `26+(2014-2015)` <chr>, `26+(P Value)` <chr>,
## #   `18+(2013-2014)` <chr>, `18+(2014-2015)` <chr>, `18+(P Value)` <chr>
## 
## [[9]]
## # A tibble: 57 × 16
##    State     `12+(2013-2014)` `12+(2014-2015)` `12+(P Value)` `12-17(2013-2014)`
##    <chr>     <chr>            <chr>            <chr>          <chr>             
##  1 "NOTE: S… "NOTE: State an… "NOTE: State an… "NOTE: State … "NOTE: State and …
##  2 "Total U… "21.05a"         "20.12"          "0.000"        "5.24a"           
##  3 "Northea… "19.69"          "19.13"          "0.101"        "5.10a"           
##  4 "Midwest" "23.84a"         "22.50"          "0.000"        "6.16a"           
##  5 "South"   "22.37a"         "21.41"          "0.001"        "5.22a"           
##  6 "West"    "17.43a"         "16.66"          "0.026"        "4.56a"           
##  7 "Alabama" "25.90"          "24.25"          "0.106"        "5.66"            
##  8 "Alaska"  "22.00a"         "24.64"          "0.008"        "6.15"            
##  9 "Arizona" "18.73"          "19.23"          "0.576"        "4.95"            
## 10 "Arkansa… "28.51"          "27.81"          "0.519"        "7.13"            
## # ℹ 47 more rows
## # ℹ 11 more variables: `12-17(2014-2015)` <chr>, `12-17(P Value)` <chr>,
## #   `18-25(2013-2014)` <chr>, `18-25(2014-2015)` <chr>, `18-25(P Value)` <chr>,
## #   `26+(2013-2014)` <chr>, `26+(2014-2015)` <chr>, `26+(P Value)` <chr>,
## #   `18+(2013-2014)` <chr>, `18+(2014-2015)` <chr>, `18+(P Value)` <chr>
## 
## [[10]]
## # A tibble: 57 × 16
##    State     `12+(2013-2014)` `12+(2014-2015)` `12+(P Value)` `12-17(2013-2014)`
##    <chr>     <chr>            <chr>            <chr>          <chr>             
##  1 "NOTE: S… "NOTE: State an… "NOTE: State an… "NOTE: State … "NOTE: State and …
##  2 "Total U… "6.50a"          "6.14"           "0.001"        "2.76"            
##  3 "Northea… "6.64"           "6.39"           "0.193"        "2.88"            
##  4 "Midwest" "6.59a"          "6.25"           "0.033"        "2.70"            
##  5 "South"   "6.20a"          "5.76"           "0.003"        "2.65"            
##  6 "West"    "6.80"           "6.47"           "0.120"        "2.91"            
##  7 "Alabama" "5.76a"          "4.64"           "0.007"        "2.84b"           
##  8 "Alaska"  "6.72"           "7.43"           "0.153"        "2.12"            
##  9 "Arizona" "7.60b"          "6.66"           "0.062"        "3.37"            
## 10 "Arkansa… "5.23"           "4.88"           "0.347"        "2.63"            
## # ℹ 47 more rows
## # ℹ 11 more variables: `12-17(2014-2015)` <chr>, `12-17(P Value)` <chr>,
## #   `18-25(2013-2014)` <chr>, `18-25(2014-2015)` <chr>, `18-25(P Value)` <chr>,
## #   `26+(2013-2014)` <chr>, `26+(2014-2015)` <chr>, `26+(P Value)` <chr>,
## #   `18+(2013-2014)` <chr>, `18+(2014-2015)` <chr>, `18+(P Value)` <chr>
## 
## [[11]]
## # A tibble: 57 × 16
##    State     `12+(2013-2014)` `12+(2014-2015)` `12+(P Value)` `12-17(2013-2014)`
##    <chr>     <chr>            <chr>            <chr>          <chr>             
##  1 "NOTE: S… "NOTE: State an… "NOTE: State an… "NOTE: State … "NOTE: State and …
##  2 "Total U… "3.04"           "2.97"           "0.321"        "1.01"            
##  3 "Northea… "3.01"           "3.09"           "0.500"        "1.02"            
##  4 "Midwest" "3.06b"          "2.86"           "0.061"        "1.03"            
##  5 "South"   "2.92a"          "2.73"           "0.047"        "0.96"            
##  6 "West"    "3.25"           "3.37"           "0.409"        "1.05"            
##  7 "Alabama" "2.99a"          "2.34"           "0.026"        "0.98"            
##  8 "Alaska"  "3.21"           "3.67"           "0.200"        "0.77"            
##  9 "Arizona" "3.44"           "3.62"           "0.591"        "1.11"            
## 10 "Arkansa… "2.73"           "2.42"           "0.255"        "0.96"            
## # ℹ 47 more rows
## # ℹ 11 more variables: `12-17(2014-2015)` <chr>, `12-17(P Value)` <chr>,
## #   `18-25(2013-2014)` <chr>, `18-25(2014-2015)` <chr>, `18-25(P Value)` <chr>,
## #   `26+(2013-2014)` <chr>, `26+(2014-2015)` <chr>, `26+(P Value)` <chr>,
## #   `18+(2013-2014)` <chr>, `18+(2014-2015)` <chr>, `18+(P Value)` <chr>
## 
## [[12]]
## # A tibble: 57 × 10
##    State     `18+(2013-2014)` `18+(2014-2015)` `18+(P Value)` `18-25(2013-2014)`
##    <chr>     <chr>            <chr>            <chr>          <chr>             
##  1 "NOTE: S… "NOTE: State an… "NOTE: State an… "NOTE: State … "NOTE: State and …
##  2 "Total U… "4.15"           "4.05"           "0.325"        "4.52a"           
##  3 "Northea… "3.93"           "3.94"           "0.953"        "4.67a"           
##  4 "Midwest" "4.45"           "4.36"           "0.511"        "4.93a"           
##  5 "South"   "4.17"           "4.00"           "0.206"        "4.18a"           
##  6 "West"    "4.02"           "3.96"           "0.681"        "4.56b"           
##  7 "Alabama" "4.53"           "4.64"           "0.749"        "4.30"            
##  8 "Alaska"  "3.90"           "4.02"           "0.707"        "4.60a"           
##  9 "Arizona" "4.09"           "4.33"           "0.491"        "4.45"            
## 10 "Arkansa… "5.24"           "5.27"           "0.942"        "4.49"            
## # ℹ 47 more rows
## # ℹ 5 more variables: `18-25(2014-2015)` <chr>, `18-25(P Value)` <chr>,
## #   `26+(2013-2014)` <chr>, `26+(2014-2015)` <chr>, `26+(P Value)` <chr>
## 
## [[13]]
## # A tibble: 57 × 10
##    State     `18+(2013-2014)` `18+(2014-2015)` `18+(P Value)` `18-25(2013-2014)`
##    <chr>     <chr>            <chr>            <chr>          <chr>             
##  1 "NOTE: S… "NOTE: State an… "NOTE: State an… "NOTE: State … "NOTE: State and …
##  2 "Total U… "18.29"          "18.01"          "0.146"        "19.75a"          
##  3 "Northea… "17.87"          "17.76"          "0.735"        "20.80a"          
##  4 "Midwest" "18.61"          "18.34"          "0.356"        "20.45a"          
##  5 "South"   "18.01"          "17.82"          "0.519"        "18.22a"          
##  6 "West"    "18.79b"         "18.18"          "0.088"        "20.70a"          
##  7 "Alabama" "19.51"          "18.85"          "0.469"        "18.09"           
##  8 "Alaska"  "18.12"          "18.11"          "0.989"        "20.33a"          
##  9 "Arizona" "18.59"          "18.32"          "0.756"        "18.76"           
## 10 "Arkansa… "20.00"          "19.77"          "0.816"        "19.99"           
## # ℹ 47 more rows
## # ℹ 5 more variables: `18-25(2014-2015)` <chr>, `18-25(P Value)` <chr>,
## #   `26+(2013-2014)` <chr>, `26+(2014-2015)` <chr>, `26+(P Value)` <chr>
## 
## [[14]]
## # A tibble: 57 × 10
##    State     `18+(2013-2014)` `18+(2014-2015)` `18+(P Value)` `18-25(2013-2014)`
##    <chr>     <chr>            <chr>            <chr>          <chr>             
##  1 "NOTE: S… "NOTE: State an… "NOTE: State an… "NOTE: State … "NOTE: State and …
##  2 "Total U… "3.94"           "3.99"           "0.526"        "7.44a"           
##  3 "Northea… "3.81"           "3.93"           "0.407"        "7.60b"           
##  4 "Midwest" "4.11"           "4.14"           "0.791"        "7.83"            
##  5 "South"   "3.84"           "3.86"           "0.896"        "6.89a"           
##  6 "West"    "4.02"           "4.12"           "0.522"        "7.84"            
##  7 "Alabama" "3.98"           "4.02"           "0.885"        "7.31"            
##  8 "Alaska"  "4.21"           "4.68"           "0.237"        "8.30b"           
##  9 "Arizona" "4.23"           "4.34"           "0.775"        "7.04"            
## 10 "Arkansa… "4.58"           "4.41"           "0.682"        "6.67"            
## # ℹ 47 more rows
## # ℹ 5 more variables: `18-25(2014-2015)` <chr>, `18-25(P Value)` <chr>,
## #   `26+(2013-2014)` <chr>, `26+(2014-2015)` <chr>, `26+(P Value)` <chr>
## 
## [[15]]
## # A tibble: 57 × 13
##    State     `18+(2013-2014)` `18+(2014-2015)` `18+(P Value)` `12-17(2013-2014)`
##    <chr>     <chr>            <chr>            <chr>          <chr>             
##  1 "NOTE: S… "NOTE: State an… "NOTE: State an… "NOTE: State … "NOTE: State and …
##  2 "Total U… "6.63"           "6.64"           "0.915"        "11.01a"          
##  3 "Northea… "6.66"           "6.82"           "0.458"        "10.63a"          
##  4 "Midwest" "6.81"           "6.87"           "0.736"        "10.81a"          
##  5 "South"   "6.47"           "6.52"           "0.750"        "10.77a"          
##  6 "West"    "6.67"           "6.47"           "0.353"        "11.82a"          
##  7 "Alabama" "6.85"           "6.81"           "0.948"        "10.74"           
##  8 "Alaska"  "6.57"           "6.73"           "0.770"        "9.92a"           
##  9 "Arizona" "7.32"           "6.77"           "0.362"        "13.23"           
## 10 "Arkansa… "7.31"           "7.78"           "0.446"        "11.95"           
## # ℹ 47 more rows
## # ℹ 8 more variables: `12-17(2014-2015)` <chr>, `12-17(P Value)` <chr>,
## #   `18-25(2013-2014)` <chr>, `18-25(2014-2015)` <chr>, `18-25(P Value)` <chr>,
## #   `26+(2013-2014)` <chr>, `26+(2014-2015)` <chr>, `26+(P Value)` <chr>

This has extracted all of the tables on the original page; that’s why we have a list with 15 elements. (We haven’t really talked about lists yet, but for now you can think of them as a general collection of objects in R. As we proceed, syntax for extracting individual elements from a list will become clear, and we’ll talk lots about lists in list columns.)

We’re only focused on the first table for now, so let’s get the contents from the first list element.

table_marj = 
  drug_use_html |> 
  html_table() |> 
  first() 

I won’t print the table here, but if you look at it you’ll notice a problem: the “note” at the bottom of the table appears in every column in the first row. We need to remove that…

table_marj = 
  drug_use_html |> 
  html_table() |> 
  first() |>
  slice(-1) 

table_marj
## # A tibble: 56 × 16
##    State     `12+(2013-2014)` `12+(2014-2015)` `12+(P Value)` `12-17(2013-2014)`
##    <chr>     <chr>            <chr>            <chr>          <chr>             
##  1 Total U.… 12.90a           13.36            0.002          13.28b            
##  2 Northeast 13.88a           14.66            0.005          13.98             
##  3 Midwest   12.40b           12.76            0.082          12.45             
##  4 South     11.24a           11.64            0.029          12.02             
##  5 West      15.27            15.62            0.262          15.53a            
##  6 Alabama   9.98             9.60             0.426          9.90              
##  7 Alaska    19.60a           21.92            0.010          17.30             
##  8 Arizona   13.69            13.12            0.364          15.12             
##  9 Arkansas  11.37            11.59            0.678          12.79             
## 10 Californ… 14.49            15.25            0.103          15.03             
## # ℹ 46 more rows
## # ℹ 11 more variables: `12-17(2014-2015)` <chr>, `12-17(P Value)` <chr>,
## #   `18-25(2013-2014)` <chr>, `18-25(2014-2015)` <chr>, `18-25(P Value)` <chr>,
## #   `26+(2013-2014)` <chr>, `26+(2014-2015)` <chr>, `26+(P Value)` <chr>,
## #   `18+(2013-2014)` <chr>, `18+(2014-2015)` <chr>, `18+(P Value)` <chr>

Success!! At least, mostly. These data aren’t tidy, but we’ll worry about that soon.

Learning assessment: Create a data frame that contains the cost of living table for New York from this page.

Solution

The code below shows one approach to this data cleaning process:

nyc_cost = 
  read_html("https://www.bestplaces.net/cost_of_living/city/new_york/new_york") |>
  html_table(header = TRUE) |>
  first()

In case you hadn’t known, NYC is kind of expensive.

CSS Selectors

Suppose we’d like to scrape the data about the Star Wars Movies from the IMDB page. The first step is the same as before – we need to get the HTML.

swm_html = 
  read_html("https://www.imdb.com/list/ls070150896/")

The information isn’t stored in a handy table, so we’re going to isolate the CSS selector for elements we care about. A bit of clicking around gets me something like below.

For each element, I’ll use the CSS selector in html_elements() to extract the relevant HTML code, and convert it to text. Then I can combine these into a data frame.

title_vec = 
  swm_html |>
  html_elements(".lister-item-header a") |>
  html_text()

gross_rev_vec = 
  swm_html |>
  html_elements(".text-small:nth-child(7) span:nth-child(5)") |>
  html_text()

runtime_vec = 
  swm_html |>
  html_elements(".runtime") |>
  html_text()

swm_df = 
  tibble(
    title = title_vec,
    rev = gross_rev_vec,
    runtime = runtime_vec)

Some of these movies made real money …

Learning Assessment: This page contains some (made up) books. Use a process similar to the one above to extract the book titles, stars, and prices.

Solution

The code below will give me relevant information for the books on this page:

url = "http://books.toscrape.com"

books_html = read_html(url)

books_titles = 
  books_html |>
  html_elements("h3") |>
  html_text2()

books_stars = 
  books_html |>
  html_elements(".star-rating") |>
  html_attr("class")

books_price = 
  books_html |>
  html_elements(".price_color") |>
  html_text()

books = tibble(
  title = books_titles,
  stars = books_stars,
  price = books_price
)

Using an API

New York City has a great open data resource, and we’ll use that for our API examples. Although most (all?) of these datasets can be accessed by clicking through a website, we’ll access them directly using the API to improve reproducibility and make it easier to update results to reflect new data.

As a simple example, this page is about a dataset for annual water consumption in NYC, along with the population in that year. First, we’ll import this as a CSV and parse it.

nyc_water = 
  GET("https://data.cityofnewyork.us/resource/ia2d-e54m.csv") |> 
  content("parsed")
## Rows: 44 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (4): year, new_york_city_population, nyc_consumption_million_gallons_per...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

We can also import this dataset as a JSON file. This takes a bit more work (and this is, really, a pretty easy case), but it’s still doable.

nyc_water = 
  GET("https://data.cityofnewyork.us/resource/ia2d-e54m.json") |> 
  content("text") |>
  jsonlite::fromJSON() |>
  as_tibble()

Data.gov also has a lot of data available using their API; often this is available as CSV or JSON as well. For example, we might be interested in data coming from BRFSS. This is importable via the API as a CSV (JSON, in this example, is more complicated).

brfss_smart2010 = 
  GET("https://chronicdata.cdc.gov/resource/acme-vg9e.csv",
      query = list("$limit" = 5000)) |> 
  content("parsed")

By default, the CDC API limits data to the first 1000 rows. Here I’ve increased that by changing an element of the API query – I looked around the website describing the API to find the name of the argument, and then used the appropriate syntax for GET. To get the full data, I could increase this so that I get all the data at once or I could try iterating over chunks of a few thousand rows.

Both of the previous examples are, actually, pretty easy – we accessed data that is essentially a data table, and we had a very straightforward API (although updating queries isn’t obvious at first).

To get a sense of how this becomes complicated, let’s look at the Pokemon API (which is also pretty nice).

poke = 
  GET("http://pokeapi.co/api/v2/pokemon/1") |>
  content()

poke$name
## [1] "bulbasaur"
poke$height
## [1] 7
poke$abilities
## [[1]]
## [[1]]$ability
## [[1]]$ability$name
## [1] "overgrow"
## 
## [[1]]$ability$url
## [1] "https://pokeapi.co/api/v2/ability/65/"
## 
## 
## [[1]]$is_hidden
## [1] FALSE
## 
## [[1]]$slot
## [1] 1
## 
## 
## [[2]]
## [[2]]$ability
## [[2]]$ability$name
## [1] "chlorophyll"
## 
## [[2]]$ability$url
## [1] "https://pokeapi.co/api/v2/ability/34/"
## 
## 
## [[2]]$is_hidden
## [1] TRUE
## 
## [[2]]$slot
## [1] 3

To build a Pokemon dataset for analysis, you’d need to distill the data returned from the API into a useful format; iterate across all pokemon; and combine the results.

For both of the API examples we saw today, it wouldn’t be terrible to just download the CSV, document where it came from carefully, and move on. APIs are more helpful when the full dataset is complex and you only need pieces, or when the data are updated regularly.

Be reasonable

When you’re reading data from the web, remember you’re accessing resources on someone else’s server – either by reading HTML or by accessing data via an API. In some cases, those who make data public will take steps to limit bandwidth devoted to a small number of users. Amazon and IMDB, for example, probably won’t notice if you scrape small amounts of data but would notice if you tried to read data from thousands of pages every time you knitted a document.

Similarly, API developers can (and will) limit the number of database entries that can be accessed in a single request. In those cases you’d have to take some steps to iterate over “pages” and combine the results; as an example, our code for the NYC Restaurant Inspections does this. In some cases, API developers protect themselves from unreasonable use by requiring users to be authenticated – it’s still possible to use httr in these cases, but we won’t get into it.

Other materials

  • A recent short course presented similar topics to those above; a GitHub repo for the course is here
  • A lot of NYC data is public; this is a good place to start looking for interesting data
  • There are some cool projects based on scraped data; the RStudio community collected some here
  • Check out the R file used to create the starwars dataset (in the tidyverse) using the Star Wars API (from the maker of the Pokemon API).
  • Some really helpful R packages are wrappers for APIs – the rnoaa package we’ve used is an example, and so is rtweet

The code that I produced working examples in lecture is here.