web scrape the NYT Combined Print and E-Book Fiction List
clean the data
save the data
I’ll load any libraries I might need.
library(readr)library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ purrr 1.0.4
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)
Attaching package: 'janitor'
The following objects are masked from 'package:stats':
chisq.test, fisher.test
library(lubridate)library(rvest)
Attaching package: 'rvest'
The following object is masked from 'package:readr':
guess_encoding
Combined Print and E-book Web Scrape
Now I’ll do a similar web scrape to get the combined print and ebook list. This is going through over a decade of data, so you may have to try running it a few times before it successfully completes. It will take quite a while to run. I also recommend commenting it after you finish running it.
# start_date <- as.Date("2011-02-13")# current_date = Sys.Date()# day_of_week = wday(current_date)# # # note: the list for the following week is published on Wednesdays at 7pm EST# # if (day_of_week == 1) {# end_date = current_date + 7# } else if (day_of_week >= 2 & day_of_week <= 4) {# days_until_sunday = 8 - day_of_week# end_date = current_date + days_until_sunday # end_date is equal to the sunday after the current_date# } else if (day_of_week >= 5 & day_of_week <= 7) {# days_until_sunday = 8 - day_of_week# end_date = current_date + days_until_sunday + 7 # end_date is equal to 2 sundays after the current_date# }# # dates <- seq(start_date, end_date, by = "week")# # # Create output directory if it doesn't exist# output_dir <- "data-raw/bestsellers-combined-weeks"# if (!dir.exists(output_dir)) dir.create(output_dir, recursive = TRUE)# # # Function to scrape and save data for each date# map(dates, function(date) {# url <- sprintf("https://www.nytimes.com/books/best-sellers/%s/combined-print-and-e-book-fiction/", format(date, "%Y/%m/%d"))# page <- read_html(url)# # titles <- page %>% html_nodes(".css-2jegzb") %>% html_text()# authors <- page %>% html_nodes(".css-1aaqvca") %>% html_text()# publishers <- page %>% html_nodes(".css-1w6oav3") %>% html_text()# descriptions <- page %>% html_nodes(".css-17af87k") %>% html_text()# # ranks <- seq_along(titles)# # df = data.frame(title = titles, author = authors, rank = ranks, date = date, publisher = publishers, description = descriptions)# # # Save each week's data as an RDS file# file_path <- file.path(output_dir, paste0("bestsellers-combined-", date, ".rds"))# write_rds(df, file_path)# # #print(date) # for status# })
Now I’ll combine all the weeks files into one file.
I’ll save the new file into a new object and glimpse it.
bestsellers_combined <-read_rds("data-raw/bestsellers-combined-all-weeks.rds") # creating a new object with a file bestsellers_combined |>glimpse() # glimpsing the data
Rows: 7,125
Columns: 6
$ title <chr> "NYPD RED 4", "SPIDER GAME", "THE BANDS OF MOURNING", "THE…
$ author <chr> "by James Patterson and Marshall Karp", "by Christine Feeh…
$ rank <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1, 2, 3…
$ date <date> 2016-02-14, 2016-02-14, 2016-02-14, 2016-02-14, 2016-02-1…
$ publisher <chr> "Little, Brown", "Jove", "Tor/Tom Doherty", "Grand Central…
$ description <chr> "Detective Zach Gordon and his partner, members of an elit…
I’m going to clean the combined list to make it match the hardcover list.
bestsellers_combined_clean <- bestsellers_combined |># saving this chunk into a new object and starting with the datamutate(year =year(date), # making a year columnweek = date, # making a new date column called "week" to match the first datasetauthor =str_remove_all(author, "by "), # removing the "by " from the author columnrank =as.numeric(rank)) |># changing the rank column from int to dblselect(year, week, rank, title, author, publisher, description) |># putting the columns in the same order as the first datasetglimpse() # glimpsing the data
Rows: 7,125
Columns: 7
$ year <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016…
$ week <date> 2016-02-14, 2016-02-14, 2016-02-14, 2016-02-14, 2016-02-1…
$ rank <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1, 2, 3…
$ title <chr> "NYPD RED 4", "SPIDER GAME", "THE BANDS OF MOURNING", "THE…
$ author <chr> "James Patterson and Marshall Karp", "Christine Feehan", "…
$ publisher <chr> "Little, Brown", "Jove", "Tor/Tom Doherty", "Grand Central…
$ description <chr> "Detective Zach Gordon and his partner, members of an elit…
Exporting the data
Everything parsed correctly with this data, so I can go right to exporting it. I’ll export the data to my computer as an rds and as a csv