Update 24-web_scraping.Rmd (#129)

* Update 24-web_scraping.Rmd Request from Andrew Pua of r4ds Cohort 09 to add presented material to Chapter 24 Hierarchical Data. * Update 24-web_scraping.Rmd Put in eval: FALSE in the relevant places so that the file could be rendered. * Update 24-web_scraping.Rmd * Update 24-web_scraping.Rmd Moved cohort videos to end of book slides and removed slide templates * Update 24-web_scraping.Rmd Tweaked the Learning Objectives section so they appear on title page. --------- Co-authored-by: Lydia Gibson, MS, GStat <[email protected]> Co-authored-by: Ken Đinh Vũ <[email protected]> Co-authored-by: Jon Harmon <[email protected]>
r4ds · Apr 10, 2024 · 498e959 · 498e959
1 parent 173ed30
commit 498e959
Showing 1 changed file with 187 additions and 6 deletions.
diff --git a/24-web_scraping.Rmd b/24-web_scraping.Rmd
@@ -1,14 +1,189 @@
 # Web scraping
 
-**Learning objectives:**
+** Learning objectives **
 
-- ADD ABOUT ONE THING PER SECTION DESCRIBING WHAT YOU LEARN IN THIS CHAPTER.
+1.    Learn a bit of HTML 
+2.    Use that bit of HTML to learn web scraping. 
 
-## SLIDE TITLE
+## Curious features of the chapter
 
-- ADD NEW SLIDES USING ##. 
-- TRY TO KEEP THE FEEL SLIDE-LIKE
-  - BULLETED LISTS, NOT PARAGRAPHS
+-   No exercises!
+-   Ethical implications of web scraping
+-   Using an external tool like SelectorGadget
+-   Limitations of scraping for dynamic websites
+
+## Typical HTML structure
+
+-   HTML has hierarchical structure.
+-   Structure is composed of elements. 
+-   Each element has
+    -   Start tag
+    -   Attributes
+    -   Content
+    -   End tag
+-   Each element can have children which are themselves elements.
+-   Consistency of structure enables one to do web scraping. 
+
+## Key commands to extract data from HTML
+
+-   Key package: `rvest`
+-   Load html to scrape: `read_html()`
+-   Extract data through selectors
+    -   `html_elements()`, `html_element()`
+    -   `html_attr()`
+    -   `html_text2()`
+-   Extract data from HTML tables: `html_table()`
+
+## Example: Loading IMDB data
+
+-   Load packages
+
+```{r}
+library(tidyverse)
+library(rvest)
+```
+
+-   This code chunk may be a bottleneck for people with slower internet connections and for websites which may guard against scraping.
+
+```{r}
+#| eval: FALSE
+# Problem directly using code from book
+url <- "https://web.archive.org/web/20220201012049/https://www.imdb.com/chart/top/"
+html <- read_html(url)
+```
+
+-   Download HTML file locally instead. Then use `read_html()` directly on the downloaded file.
+
+```{r}
+#| eval: FALSE
+# Suggestion from https://stackoverflow.com/questions/33295686/rvest-error-in-open-connectionx-rb-timeout-was-reached
+download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
+html.book <- read_html("scrapedpage.html")
+```
+
+-   Specific to the Internet Archive is that some of the website snapshots may be available as a link but not necessarily accessible.
+
+-   May encounter 403 Forbidden error. 
+
+```{r}
+#| eval: FALSE
+# Not just any URL will work
+url <- "https://web.archive.org/web/20240223185506/https://www.imdb.com/chart/top/"
+download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
+``` 
+
+-   But even if that were the case, the scraped file may part of a dynamic website. 
+-   A very old snapshot is available for comparison.
+
+```{r}
+#| eval: FALSE
+url <- "https://web.archive.org/web/20040704034814/https://www.imdb.com/chart/top/"
+download.file(url, destfile = "scrapedpage-old.html", quiet=TRUE)
+```
+
+-   But the structure of the website and the associated HTML have changed. 
+-   This means the code you see in the book will not work out of the box. 
+
+```{r}
+#| eval: FALSE
+html.old <- read_html("/home/apua/Documents/r4ds/scrapedpage-old.html")
+temp <- html.old |> 
+  html_elements("table") |> html_attr("border") 
+which(temp == "1")
+temp <- html.old |> 
+  html_elements("table") 
+table.old <- html_table(temp[21], header=TRUE)
+ratings <- table.old[[1]] |>
+  select(
+    rank = "Rank",
+    title_year = "Title",
+    rating = "Rating",
+    votes = "Votes"
+  ) |> 
+  mutate(votes = parse_number(votes)) |>
+  separate_wider_regex(
+    title_year,
+    patterns = c(
+      title = ".+", " +\\(",
+      year = "\\d+", "\\)"
+    )
+  )
+ratings
+```
+
+-   Let us compare to the book. 
+
+```{r}
+#| eval: FALSE
+html.book <- read_html("scrapedpage.html")
+table.book <- html.book |> 
+  html_element("table") |> 
+  html_table()
+ratings.book <- table.book |>
+  select(
+    rank_title_year = `Rank & Title`,
+    rating = `IMDb Rating`
+  ) |> 
+  mutate(
+    rank_title_year = str_replace_all(rank_title_year, "\n +", " "), 
+    rating_n = html.book |> html_elements("td strong") |> html_attr("title")
+  ) |> 
+  separate_wider_regex(
+    rank_title_year,
+    patterns = c(
+      rank = "\\d+", "\\. ",
+      title = ".+", " +\\(",
+      year = "\\d+", "\\)"
+    )
+  ) |>
+  separate_wider_regex(
+    rating_n,
+    patterns = c(
+      "[0-9.]+ based on ",
+      number = "[0-9,]+",
+      " user ratings"
+    )
+  ) |>
+    mutate(
+    number = parse_number(number)
+  )
+ratings.book$title
+```
+
+-   Let point out things about the two datasets which make post-processing challenging.  
+-   I don't resolve them here, but they require one to think about what questions you want answered first. 
+
+```{r}
+#| eval: FALSE
+ratings$title
+left_join(ratings, ratings.book, by="title")
+```
+
+## Example: Quarantine Zine Club
+
+-   Task is to create a spreadsheet where we have the titles of the zines, their authors, the links to download the zines, and the first social media link of the authors. 
+-   Used `xpath` here, but was not discussed in book. 
+-   The code chunk below should work but gives an error when processed at Github repo of bookclub-r4ds, but it works on my machine. I have yet to figure out why. 
+
+```{r}
+#| eval: FALSE
+url <- "https://quarantinezineclub.neocities.org/zinelibrary"
+html <- read_html(url) 
+zinelib <- tibble(authors = html |> html_elements("#zname") |> html_text2(), 
+       titles = html |> html_elements("#libraryheader") |> html_text2(),
+       pdflink = html |> 
+  html_elements(xpath = "//div[@class='column right']") |> html_element("a") |> html_attr("href"),
+       mark = 2:156
+       )
+get.first.social <- function(num)
+{
+  return(html |> html_elements(xpath = paste("/html/body/div[2]/div[", num, "]/div[2]/a[1]", sep = ""))|> html_attr("href"))
+}
+zinelib <- zinelib |> mutate(
+  pdflink = paste("https://quarantinezineclub.neocities.org/", pdflink, sep = ""),
+  social1 = get.first.social(mark)) |> select(-mark)
+zinelib
+```
 
 ## Meeting Videos
 
@@ -34,3 +209,9 @@
 ### Cohort 8
 
 `r knitr::include_url("https://www.youtube.com/embed/dOVWSSqUvt0")`
+
+### Cohort 9
+
+`r knitr::include_url("https://www.youtube.com/embed/Hs928CH-_E4")`
+
+