Hardcover Web Scrape

Goals of the notebook

In this notebook, I’m going to

  • import the Post45 Data Collective’s NYT bestseller data set
  • web scrape the NYT Hardcover Fiction List
  • clean the data
  • join the data sets
  • save the data

I’ll load any libraries I might need.

library(readr)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ purrr     1.0.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)

Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test
library(lubridate)
library(rvest)

Attaching package: 'rvest'

The following object is masked from 'package:readr':

    guess_encoding

Main Data Import

I’ll import the years I already have. I downloaded this from the Post45 Data Collective which is run by Emory University. It’s every week of the New York Times Bestsellers list going back to when it started in 1931 up to December 2020.

bestsellers_post_45 <- read_tsv("data-raw/nyt_full.tsv") |> 
  glimpse()
Rows: 60386 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr  (2): title, author
dbl  (3): year, rank, title_id
date (1): week

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 60,386
Columns: 6
$ year     <dbl> 1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931, 1…
$ week     <date> 1931-10-12, 1931-10-12, 1931-10-12, 1931-10-12, 1931-10-12, …
$ rank     <dbl> 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2…
$ title_id <dbl> 6477, 1808, 5304, 4038, 3946, 2878, 7031, 413, 859, 4235, 287…
$ title    <chr> "THE TEN COMMANDMENTS", "FINCHE'S FORTUNE", "THE GOOD EARTH",…
$ author   <chr> "Warwick Deeping", "Mazo de la Roche", "Pearl S. Buck", "Will…

Hardcover Fiction Web Scrape

Now, I’m going to get the missing years of the hardcover fiction data by web scraping directly from the NYT’s website. You might might need to run the code multiple times before it will successfully complete. I also recommend you comment the code after completion since it takes a bit to run.

# start_date = as.Date("2020-12-13")
# current_date = Sys.Date()
# day_of_week = wday(current_date)
# 
# # note: the list for the following week is published on Wednesdays at 7pm EST
# 
# if (day_of_week == 1) {
#   end_date = current_date + 7
# } else if (day_of_week >= 2 & day_of_week <= 4) {
#   days_until_sunday = 8 - day_of_week
#   end_date = current_date + days_until_sunday # end_date is equal to the sunday after the current_date
# } else if (day_of_week >= 5 & day_of_week <= 7) {
#   days_until_sunday = 8 - day_of_week
#   end_date = current_date + days_until_sunday + 7 # end_date is equal to 2 sundays after the current_date
# }
# 
# dates <- seq(start_date, end_date, by = "week")
# 
# # Create output directory if it doesn't exist
# output_dir <- "data-raw/bestsellers-hardcover-weeks"
# if (!dir.exists(output_dir)) dir.create(output_dir, recursive = TRUE)
# 
# # Function to scrape and save data for each date
# map(dates, function(date) {
#   url <- sprintf("https://www.nytimes.com/books/best-sellers/%s/hardcover-fiction/", format(date, "%Y/%m/%d"))
#   page <- read_html(url)
#   
#   titles <- page %>% html_nodes(".css-2jegzb") %>% html_text()
#   authors <- page %>% html_nodes(".css-1aaqvca") %>% html_text()
#   publishers <- page %>% html_nodes(".css-1w6oav3") %>% html_text()
#   descriptions <- page %>% html_nodes(".css-17af87k") %>% html_text()
#   
#   ranks <- seq_along(titles)
# 
#   df = data.frame(title = titles, author = authors, rank = ranks, date = date, publisher = publishers, description = descriptions)
#   
#   # Save each week's data as an RDS file
#   file_path <- file.path(output_dir, paste0("bestsellers-hardcover-", date, ".rds"))
#   write_rds(df, file_path)
#   
#   #print(date) # for status
# })

Now I’ll combine all the weeks into one file.

file_list <- list.files("data-raw/bestsellers-hardcover-weeks", full.names = TRUE)
dfs <- map_dfr(file_list, read_rds)
write_rds(dfs, "data-raw/bestsellers-hardcover-all-weeks.rds")

I’ll save the new file into a new object and glimpse it.

bestsellers_web_scrape <- read_rds("data-raw/bestsellers-hardcover-all-weeks.rds") # creating a new object with a file 

bestsellers_web_scrape |> glimpse() # glimpsing the data
Rows: 3,345
Columns: 6
$ title       <chr> "READY PLAYER TWO", "DEADLY CROSS", "THE RETURN", "A TIME …
$ author      <chr> "by Ernest Cline", "by James Patterson", "by Nicholas Spar…
$ rank        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1, 2, 3…
$ date        <date> 2020-12-13, 2020-12-13, 2020-12-13, 2020-12-13, 2020-12-1…
$ publisher   <chr> "Ballantine", "Little, Brown", "Grand Central", "Doubleday…
$ description <chr> "In a sequel to “Ready Player One,” Wade Watts discovers a…

Cleaning

First, I’ll clean the main data file from the Post45 data collective. The only thing I need to do is remove the title_id column since I won’t be using it.

bestsellers_post_45_clean <- bestsellers_post_45 |> # saving this chunk into a new object and starting with the data
  select(-title_id) |> # removing the title_id column
  glimpse() # glimpsing the data
Rows: 60,386
Columns: 5
$ year   <dbl> 1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931, 193…
$ week   <date> 1931-10-12, 1931-10-12, 1931-10-12, 1931-10-12, 1931-10-12, 19…
$ rank   <dbl> 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, …
$ title  <chr> "THE TEN COMMANDMENTS", "FINCHE'S FORTUNE", "THE GOOD EARTH", "…
$ author <chr> "Warwick Deeping", "Mazo de la Roche", "Pearl S. Buck", "Willa …

Now, I’ll clean the web scrape to make it match the first dataset.

bestsellers_web_scrape_clean <- bestsellers_web_scrape |> # saving this chunk into a new object and starting with the data
  mutate(year = year(date), # making a year column
         week = date, # making a new date column called "week" to match the first dataset
         author = str_remove_all(author, "by "), # removing the "by " from the author column
         rank = as.numeric(rank)) |> # changing the rank column from int to dbl
  select(year,
         week,
         rank,
         title,
         author,
         publisher,
         description) |> # putting the columns in the same order as the first dataset
  glimpse() # glimpsing the data
Rows: 3,345
Columns: 7
$ year        <dbl> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020…
$ week        <date> 2020-12-13, 2020-12-13, 2020-12-13, 2020-12-13, 2020-12-1…
$ rank        <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1, 2, 3…
$ title       <chr> "READY PLAYER TWO", "DEADLY CROSS", "THE RETURN", "A TIME …
$ author      <chr> "Ernest Cline", "James Patterson", "Nicholas Sparks", "Joh…
$ publisher   <chr> "Ballantine", "Little, Brown", "Grand Central", "Doubleday…
$ description <chr> "In a sequel to “Ready Player One,” Wade Watts discovers a…

Combining the data

Now that both of the hardcover fiction datasets have the same columns, I’ll combine them. All of the rows from the Post45 data set will have NA for the publisher and description, but I’m not worried about it.

bestsellers_full <- bestsellers_post_45_clean |> # saving this chunk into a new object and starting with the data
  bind_rows(bestsellers_web_scrape_clean) |> # binding the two datasets together
  glimpse()
Rows: 63,731
Columns: 7
$ year        <dbl> 1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931…
$ week        <date> 1931-10-12, 1931-10-12, 1931-10-12, 1931-10-12, 1931-10-1…
$ rank        <dbl> 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1…
$ title       <chr> "THE TEN COMMANDMENTS", "FINCHE'S FORTUNE", "THE GOOD EART…
$ author      <chr> "Warwick Deeping", "Mazo de la Roche", "Pearl S. Buck", "W…
$ publisher   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ description <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…

Fixing non-split columns

There are actually some columns from the Post 45 data set where the title and author aren’t split properly.

I’ll start by getting rid of those columns from the main data set, then creating a new object with just those columns to fix them.

Then, I separate the title and author into new columns using the , in between them. I’ll also remove the by and the publisher information

bestsellers_no_na <- bestsellers_full |> # saving the data into a new object
  filter(!is.na(author)) # removing columns where author is na

bestsellers_na <- bestsellers_full |> # saving the data into a new object
  filter(is.na(author)) # only including columns where author is na

bestsellers_na_clean <- bestsellers_na |> # saving the data into a new object
  separate(title, sep = ",", into = c("title", "author")) |> # separate the title column into two columns named title and author based on the comma
  mutate(author = str_remove_all(author, "by ")) |> # removing the by
  mutate(author = str_remove_all(author, "\\(.*")) |> # removing the publisher information using the ()
  mutate(author = str_sub(author, end = -3))# removing the period and spaces by removing the last 3 characters of the column
Warning: Expected 2 pieces. Additional pieces discarded in 7 rows [1, 2, 3, 4,
5, 6, 7].
bestsellers_na_clean

Joining the data again

Now that I’ve fixed the rows that didn’t separate properly, I’ll join everything back together.

bestsellers_full_clean <- bestsellers_no_na |> 
  bind_rows(bestsellers_na_clean) |> 
  glimpse()
Rows: 63,731
Columns: 7
$ year        <dbl> 1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931…
$ week        <date> 1931-10-12, 1931-10-12, 1931-10-12, 1931-10-12, 1931-10-1…
$ rank        <dbl> 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1…
$ title       <chr> "THE TEN COMMANDMENTS", "FINCHE'S FORTUNE", "THE GOOD EART…
$ author      <chr> "Warwick Deeping", "Mazo de la Roche", "Pearl S. Buck", "W…
$ publisher   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ description <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…

Exporting the data

I’ll export the data to my computer as an rds and as a csv

bestsellers_full_clean |> write_rds("data-processed/bestsellers-hardcover.rds")
bestsellers_full_clean |> write_csv("data-processed/bestsellers-hardcover.csv")