import the Post45 Data Collective’s NYT bestseller data set
web scrape the NYT Hardcover Fiction List
clean the data
join the data sets
save the data
I’ll load any libraries I might need.
library(readr)library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ purrr 1.0.4
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)
Attaching package: 'janitor'
The following objects are masked from 'package:stats':
chisq.test, fisher.test
library(lubridate)library(rvest)
Attaching package: 'rvest'
The following object is masked from 'package:readr':
guess_encoding
Main Data Import
I’ll import the years I already have. I downloaded this from the Post45 Data Collective which is run by Emory University. It’s every week of the New York Times Bestsellers list going back to when it started in 1931 up to December 2020.
Rows: 60386 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (2): title, author
dbl (3): year, rank, title_id
date (1): week
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 60,386
Columns: 6
$ year <dbl> 1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931, 1…
$ week <date> 1931-10-12, 1931-10-12, 1931-10-12, 1931-10-12, 1931-10-12, …
$ rank <dbl> 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2…
$ title_id <dbl> 6477, 1808, 5304, 4038, 3946, 2878, 7031, 413, 859, 4235, 287…
$ title <chr> "THE TEN COMMANDMENTS", "FINCHE'S FORTUNE", "THE GOOD EARTH",…
$ author <chr> "Warwick Deeping", "Mazo de la Roche", "Pearl S. Buck", "Will…
Hardcover Fiction Web Scrape
Now, I’m going to get the missing years of the hardcover fiction data by web scraping directly from the NYT’s website. You might might need to run the code multiple times before it will successfully complete. I also recommend you comment the code after completion since it takes a bit to run.
# start_date = as.Date("2020-12-13")# current_date = Sys.Date()# day_of_week = wday(current_date)# # # note: the list for the following week is published on Wednesdays at 7pm EST# # if (day_of_week == 1) {# end_date = current_date + 7# } else if (day_of_week >= 2 & day_of_week <= 4) {# days_until_sunday = 8 - day_of_week# end_date = current_date + days_until_sunday # end_date is equal to the sunday after the current_date# } else if (day_of_week >= 5 & day_of_week <= 7) {# days_until_sunday = 8 - day_of_week# end_date = current_date + days_until_sunday + 7 # end_date is equal to 2 sundays after the current_date# }# # dates <- seq(start_date, end_date, by = "week")# # # Create output directory if it doesn't exist# output_dir <- "data-raw/bestsellers-hardcover-weeks"# if (!dir.exists(output_dir)) dir.create(output_dir, recursive = TRUE)# # # Function to scrape and save data for each date# map(dates, function(date) {# url <- sprintf("https://www.nytimes.com/books/best-sellers/%s/hardcover-fiction/", format(date, "%Y/%m/%d"))# page <- read_html(url)# # titles <- page %>% html_nodes(".css-2jegzb") %>% html_text()# authors <- page %>% html_nodes(".css-1aaqvca") %>% html_text()# publishers <- page %>% html_nodes(".css-1w6oav3") %>% html_text()# descriptions <- page %>% html_nodes(".css-17af87k") %>% html_text()# # ranks <- seq_along(titles)# # df = data.frame(title = titles, author = authors, rank = ranks, date = date, publisher = publishers, description = descriptions)# # # Save each week's data as an RDS file# file_path <- file.path(output_dir, paste0("bestsellers-hardcover-", date, ".rds"))# write_rds(df, file_path)# # #print(date) # for status# })
I’ll save the new file into a new object and glimpse it.
bestsellers_web_scrape <-read_rds("data-raw/bestsellers-hardcover-all-weeks.rds") # creating a new object with a file bestsellers_web_scrape |>glimpse() # glimpsing the data
Rows: 3,345
Columns: 6
$ title <chr> "READY PLAYER TWO", "DEADLY CROSS", "THE RETURN", "A TIME …
$ author <chr> "by Ernest Cline", "by James Patterson", "by Nicholas Spar…
$ rank <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1, 2, 3…
$ date <date> 2020-12-13, 2020-12-13, 2020-12-13, 2020-12-13, 2020-12-1…
$ publisher <chr> "Ballantine", "Little, Brown", "Grand Central", "Doubleday…
$ description <chr> "In a sequel to “Ready Player One,” Wade Watts discovers a…
Cleaning
First, I’ll clean the main data file from the Post45 data collective. The only thing I need to do is remove the title_id column since I won’t be using it.
bestsellers_post_45_clean <- bestsellers_post_45 |># saving this chunk into a new object and starting with the dataselect(-title_id) |># removing the title_id columnglimpse() # glimpsing the data
Rows: 60,386
Columns: 5
$ year <dbl> 1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931, 193…
$ week <date> 1931-10-12, 1931-10-12, 1931-10-12, 1931-10-12, 1931-10-12, 19…
$ rank <dbl> 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, …
$ title <chr> "THE TEN COMMANDMENTS", "FINCHE'S FORTUNE", "THE GOOD EARTH", "…
$ author <chr> "Warwick Deeping", "Mazo de la Roche", "Pearl S. Buck", "Willa …
Now, I’ll clean the web scrape to make it match the first dataset.
bestsellers_web_scrape_clean <- bestsellers_web_scrape |># saving this chunk into a new object and starting with the datamutate(year =year(date), # making a year columnweek = date, # making a new date column called "week" to match the first datasetauthor =str_remove_all(author, "by "), # removing the "by " from the author columnrank =as.numeric(rank)) |># changing the rank column from int to dblselect(year, week, rank, title, author, publisher, description) |># putting the columns in the same order as the first datasetglimpse() # glimpsing the data
Rows: 3,345
Columns: 7
$ year <dbl> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020…
$ week <date> 2020-12-13, 2020-12-13, 2020-12-13, 2020-12-13, 2020-12-1…
$ rank <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1, 2, 3…
$ title <chr> "READY PLAYER TWO", "DEADLY CROSS", "THE RETURN", "A TIME …
$ author <chr> "Ernest Cline", "James Patterson", "Nicholas Sparks", "Joh…
$ publisher <chr> "Ballantine", "Little, Brown", "Grand Central", "Doubleday…
$ description <chr> "In a sequel to “Ready Player One,” Wade Watts discovers a…
Combining the data
Now that both of the hardcover fiction datasets have the same columns, I’ll combine them. All of the rows from the Post45 data set will have NA for the publisher and description, but I’m not worried about it.
bestsellers_full <- bestsellers_post_45_clean |># saving this chunk into a new object and starting with the databind_rows(bestsellers_web_scrape_clean) |># binding the two datasets togetherglimpse()
There are actually some columns from the Post 45 data set where the title and author aren’t split properly.
I’ll start by getting rid of those columns from the main data set, then creating a new object with just those columns to fix them.
Then, I separate the title and author into new columns using the , in between them. I’ll also remove the by and the publisher information
bestsellers_no_na <- bestsellers_full |># saving the data into a new objectfilter(!is.na(author)) # removing columns where author is nabestsellers_na <- bestsellers_full |># saving the data into a new objectfilter(is.na(author)) # only including columns where author is nabestsellers_na_clean <- bestsellers_na |># saving the data into a new objectseparate(title, sep =",", into =c("title", "author")) |># separate the title column into two columns named title and author based on the commamutate(author =str_remove_all(author, "by ")) |># removing the bymutate(author =str_remove_all(author, "\\(.*")) |># removing the publisher information using the ()mutate(author =str_sub(author, end =-3))# removing the period and spaces by removing the last 3 characters of the column