Skip to content

Commit

Permalink
Update 24-web_scraping.Rmd (#129)
Browse files Browse the repository at this point in the history
* Update 24-web_scraping.Rmd

Request from Andrew Pua of r4ds Cohort 09 to add presented material to Chapter 24 Hierarchical Data.

* Update 24-web_scraping.Rmd

Put in eval: FALSE in the relevant places so that the file could be rendered.

* Update 24-web_scraping.Rmd

* Update 24-web_scraping.Rmd

Moved cohort videos to end of book slides and removed slide templates

* Update 24-web_scraping.Rmd

Tweaked the Learning Objectives section so they appear on title page.

---------

Co-authored-by: Lydia Gibson, MS, GStat <[email protected]>
Co-authored-by: Ken Đinh Vũ <[email protected]>
Co-authored-by: Jon Harmon <[email protected]>
  • Loading branch information
4 people authored Apr 10, 2024
1 parent 173ed30 commit 498e959
Showing 1 changed file with 187 additions and 6 deletions.
193 changes: 187 additions & 6 deletions 24-web_scraping.Rmd
Original file line number Diff line number Diff line change
@@ -1,14 +1,189 @@
# Web scraping

**Learning objectives:**
** Learning objectives **

- ADD ABOUT ONE THING PER SECTION DESCRIBING WHAT YOU LEARN IN THIS CHAPTER.
1. Learn a bit of HTML
2. Use that bit of HTML to learn web scraping.

## SLIDE TITLE
## Curious features of the chapter

- ADD NEW SLIDES USING ##.
- TRY TO KEEP THE FEEL SLIDE-LIKE
- BULLETED LISTS, NOT PARAGRAPHS
- No exercises!
- Ethical implications of web scraping
- Using an external tool like SelectorGadget
- Limitations of scraping for dynamic websites

## Typical HTML structure

- HTML has hierarchical structure.
- Structure is composed of elements.
- Each element has
- Start tag
- Attributes
- Content
- End tag
- Each element can have children which are themselves elements.
- Consistency of structure enables one to do web scraping.

## Key commands to extract data from HTML

- Key package: `rvest`
- Load html to scrape: `read_html()`
- Extract data through selectors
- `html_elements()`, `html_element()`
- `html_attr()`
- `html_text2()`
- Extract data from HTML tables: `html_table()`

## Example: Loading IMDB data

- Load packages

```{r}
library(tidyverse)
library(rvest)
```

- This code chunk may be a bottleneck for people with slower internet connections and for websites which may guard against scraping.

```{r}
#| eval: FALSE
# Problem directly using code from book
url <- "https://web.archive.org/web/20220201012049/https://www.imdb.com/chart/top/"
html <- read_html(url)
```

- Download HTML file locally instead. Then use `read_html()` directly on the downloaded file.

```{r}
#| eval: FALSE
# Suggestion from https://stackoverflow.com/questions/33295686/rvest-error-in-open-connectionx-rb-timeout-was-reached
download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
html.book <- read_html("scrapedpage.html")
```

- Specific to the Internet Archive is that some of the website snapshots may be available as a link but not necessarily accessible.

- May encounter 403 Forbidden error.

```{r}
#| eval: FALSE
# Not just any URL will work
url <- "https://web.archive.org/web/20240223185506/https://www.imdb.com/chart/top/"
download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
```

- But even if that were the case, the scraped file may part of a dynamic website.
- A very old snapshot is available for comparison.

```{r}
#| eval: FALSE
url <- "https://web.archive.org/web/20040704034814/https://www.imdb.com/chart/top/"
download.file(url, destfile = "scrapedpage-old.html", quiet=TRUE)
```

- But the structure of the website and the associated HTML have changed.
- This means the code you see in the book will not work out of the box.

```{r}
#| eval: FALSE
html.old <- read_html("/home/apua/Documents/r4ds/scrapedpage-old.html")
temp <- html.old |>
html_elements("table") |> html_attr("border")
which(temp == "1")
temp <- html.old |>
html_elements("table")
table.old <- html_table(temp[21], header=TRUE)
ratings <- table.old[[1]] |>
select(
rank = "Rank",
title_year = "Title",
rating = "Rating",
votes = "Votes"
) |>
mutate(votes = parse_number(votes)) |>
separate_wider_regex(
title_year,
patterns = c(
title = ".+", " +\\(",
year = "\\d+", "\\)"
)
)
ratings
```

- Let us compare to the book.

```{r}
#| eval: FALSE
html.book <- read_html("scrapedpage.html")
table.book <- html.book |>
html_element("table") |>
html_table()
ratings.book <- table.book |>
select(
rank_title_year = `Rank & Title`,
rating = `IMDb Rating`
) |>
mutate(
rank_title_year = str_replace_all(rank_title_year, "\n +", " "),
rating_n = html.book |> html_elements("td strong") |> html_attr("title")
) |>
separate_wider_regex(
rank_title_year,
patterns = c(
rank = "\\d+", "\\. ",
title = ".+", " +\\(",
year = "\\d+", "\\)"
)
) |>
separate_wider_regex(
rating_n,
patterns = c(
"[0-9.]+ based on ",
number = "[0-9,]+",
" user ratings"
)
) |>
mutate(
number = parse_number(number)
)
ratings.book$title
```

- Let point out things about the two datasets which make post-processing challenging.
- I don't resolve them here, but they require one to think about what questions you want answered first.

```{r}
#| eval: FALSE
ratings$title
left_join(ratings, ratings.book, by="title")
```

## Example: Quarantine Zine Club

- Task is to create a spreadsheet where we have the titles of the zines, their authors, the links to download the zines, and the first social media link of the authors.
- Used `xpath` here, but was not discussed in book.
- The code chunk below should work but gives an error when processed at Github repo of bookclub-r4ds, but it works on my machine. I have yet to figure out why.

```{r}
#| eval: FALSE
url <- "https://quarantinezineclub.neocities.org/zinelibrary"
html <- read_html(url)
zinelib <- tibble(authors = html |> html_elements("#zname") |> html_text2(),
titles = html |> html_elements("#libraryheader") |> html_text2(),
pdflink = html |>
html_elements(xpath = "//div[@class='column right']") |> html_element("a") |> html_attr("href"),
mark = 2:156
)
get.first.social <- function(num)
{
return(html |> html_elements(xpath = paste("/html/body/div[2]/div[", num, "]/div[2]/a[1]", sep = ""))|> html_attr("href"))
}
zinelib <- zinelib |> mutate(
pdflink = paste("https://quarantinezineclub.neocities.org/", pdflink, sep = ""),
social1 = get.first.social(mark)) |> select(-mark)
zinelib
```

## Meeting Videos

Expand All @@ -34,3 +209,9 @@
### Cohort 8

`r knitr::include_url("https://www.youtube.com/embed/dOVWSSqUvt0")`

### Cohort 9

`r knitr::include_url("https://www.youtube.com/embed/Hs928CH-_E4")`


0 comments on commit 498e959

Please sign in to comment.