Combined Print and E-Book Web Scrape

Goals of the notebook

In this notebook, I’m going to

web scrape the NYT Combined Print and E-Book Fiction List
clean the data
save the data

I’ll load any libraries I might need.

library(readr)
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ purrr     1.1.0
✔ forcats   1.0.0     ✔ stringr   1.5.2
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(janitor)


Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test

library(lubridate)
library(rvest)


Attaching package: 'rvest'

The following object is masked from 'package:readr':

    guess_encoding

Combined Print and E-book Web Scrape

Now I’ll do a similar web scrape to get the combined print and ebook list. This is going through over a decade of data, so you may have to try running it a few times before it successfully completes. It will take quite a while to run. I also recommend commenting it after you finish running it.

# start_date <- as.Date("2011-02-13")
# current_date = Sys.Date()
# day_of_week = wday(current_date)
# 
# # note: the list for the following week is published on Wednesdays at 7pm EST
# 
# if (day_of_week == 1) {
#   end_date = current_date + 7
# } else if (day_of_week >= 2 & day_of_week <= 4) {
#   days_until_sunday = 8 - day_of_week
#   end_date = current_date + days_until_sunday # end_date is equal to the sunday after the current_date
# } else if (day_of_week >= 5 & day_of_week <= 7) {
#   days_until_sunday = 8 - day_of_week
#   end_date = current_date + days_until_sunday + 7 # end_date is equal to 2 sundays after the current_date
# }
# 
# dates <- seq(start_date, end_date, by = "week")
# 
# # Create output directory if it doesn't exist
# output_dir <- "data-raw/bestsellers-combined-weeks"
# if (!dir.exists(output_dir)) dir.create(output_dir, recursive = TRUE)
# 
# # Function to scrape and save data for each date
# map(dates, function(date) {
#   url <- sprintf("https://www.nytimes.com/books/best-sellers/%s/combined-print-and-e-book-fiction/", format(date, "%Y/%m/%d"))
#   page <- read_html(url)
#   
#   titles <- page %>% html_nodes(".css-2jegzb") %>% html_text()
#   authors <- page %>% html_nodes(".css-1aaqvca") %>% html_text()
#   publishers <- page %>% html_nodes(".css-1w6oav3") %>% html_text()
#   descriptions <- page %>% html_nodes(".css-17af87k") %>% html_text()
#   
#   ranks <- seq_along(titles)
# 
#   df = data.frame(title = titles, author = authors, rank = ranks, date = date, publisher = publishers, description = descriptions)
#   
#   # Save each week's data as an RDS file
#   file_path <- file.path(output_dir, paste0("bestsellers-combined-", date, ".rds"))
#   write_rds(df, file_path)
#   
#   #print(date) # for status
# })

Now I’ll combine all the weeks files into one file.

file_list <- list.files("data-raw/bestsellers-combined-weeks", full.names = TRUE)
dfs <- map_dfr(file_list, read_rds)
write_rds(dfs, "data-raw/bestsellers-combined-all-weeks.rds")

I’ll save the new file into a new object and glimpse it.

bestsellers_combined <- read_rds("data-raw/bestsellers-combined-all-weeks.rds") # creating a new object with a file 

bestsellers_combined |> glimpse() # glimpsing the data

Rows: 7,125
Columns: 6
$ title       <chr> "NYPD RED 4", "SPIDER GAME", "THE BANDS OF MOURNING", "THE…
$ author      <chr> "by James Patterson and Marshall Karp", "by Christine Feeh…
$ rank        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1, 2, 3…
$ date        <date> 2016-02-14, 2016-02-14, 2016-02-14, 2016-02-14, 2016-02-1…
$ publisher   <chr> "Little, Brown", "Jove", "Tor/Tom Doherty", "Grand Central…
$ description <chr> "Detective Zach Gordon and his partner, members of an elit…

I’m going to clean the combined list to make it match the hardcover list.

bestsellers_combined_clean <- bestsellers_combined |> # saving this chunk into a new object and starting with the data
  mutate(year = year(date), # making a year column
         week = date, # making a new date column called "week" to match the first dataset
         author = str_remove_all(author, "by "), # removing the "by " from the author column
         rank = as.numeric(rank)) |> # changing the rank column from int to dbl
  select(year,
         week,
         rank,
         title,
         author,
         publisher,
         description) |> # putting the columns in the same order as the first dataset
  glimpse() # glimpsing the data

Rows: 7,125
Columns: 7
$ year        <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016…
$ week        <date> 2016-02-14, 2016-02-14, 2016-02-14, 2016-02-14, 2016-02-1…
$ rank        <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1, 2, 3…
$ title       <chr> "NYPD RED 4", "SPIDER GAME", "THE BANDS OF MOURNING", "THE…
$ author      <chr> "James Patterson and Marshall Karp", "Christine Feehan", "…
$ publisher   <chr> "Little, Brown", "Jove", "Tor/Tom Doherty", "Grand Central…
$ description <chr> "Detective Zach Gordon and his partner, members of an elit…

Exporting the data

Everything parsed correctly with this data, so I can go right to exporting it. I’ll export the data to my computer as an rds and as a csv

bestsellers_combined_clean |> write_rds("data-processed/bestsellers-combined.rds")
bestsellers_combined_clean |> write_csv("data-processed/bestsellers-combined.csv")