TDM 20200: Project 5 — 2024
Motivation: We practice the skill of scraping information about books from a real publisher’s website. (We only do this for academic purposes and not for commercial purposes.)
Context: We use the Selenium skills that we learned in Project 3 to extract information about books from the publisher O’Reilly’s website.
Scope: Python, XML, Selenium
Dataset(s)
The following questions will examine the No Starch Press books that are available from this website:
No Starch Press is one of the Publishers available from the dropdown menu on the left-hand-side of the website. At the moment, there are 350 books available from No Starch Press. (This might change during the project, if more books are published in the next week or two.) |
The order of the books are dynamic. For this reason, if you look at the books in a browser on your computer, at the same time that you are scraping books from the O’Reilly website, the order of the books is likely to change. |
Since there are 350 No Starch Press books available at present, if we load 100 books per webpage, we can see all 350 books by scraping these 4 webpages: |
We prepare to work with Selenium in Python as follows:
|
and we can load the pages, for instance, here is the second page, like this:
driver.get("https://www.oreilly.com/search/?publishers=No%20Starch%20Press&type=book&rows=100&page=2")
Each book entry is wrapped in the following (example) XML code. Some XML entries have been removed, so that there might be other siblings and/or children that are not shown here. This example is from the XML for the book
|
It is necessary to give each page a few seconds to load. Otherwise, the query for a page might end up blank. Therefore, it is advisable to SLOWLY load one cell at a time, when you are checking your work, waiting a few seconds in between each cell. This allows the O’Reilly pages to load in the browser. |
Dr Ward created 9 videos to help with this project. |
Questions
Question 1 (2 points)
-
Load the formats for all 350 entries into a list of length 350, and make sure that each entry says
Format: Book
. -
Load the publisher for all 350 entries into a list of length 350, and make sure that each entry says
Publisher: No Starch Press
.
For question 1a, you can use the XPath For question 1b, you can use the XPath |
Question 2 (2 points)
From the first page of 100 entries:
-
Extract a list of the 100 URLs
-
Extract a list of the 100 titles
-
Extract a list of the 100 authors (it is OK if the word "By" is included in each author result)
-
Extract a list of the 100 dates
-
Extract a list of the 100 pages
For the URLs, use XPath For the titles, use XPath For the authors, use XPath For the dates, use XPath For the pages, use XPath |
Question 3 (2 points)
Extract the content from pages 2, 3, and 4 (i.e., from the next 250 entries), and add this content to the lists from question 2, so that you have altogether:
-
A list of the 350 URLs
-
A list of the 350 titles
-
A list of the 350 authors (it is OK if the word "By" is included in each author result)
-
A list of the 350 dates
-
A list of the 350 pages
You might want to use a for loop, but if you do, it is worthwhile to |
Question 4 (2 points)
-
For the list of pages, remove the phrase " pages" (including the space) and the remove the commas, and then convert from strings to integers.
-
Now make a data frame of the URLs, titles, authors, dates, and (the new numeric) pages.
Question 5 (2 points)
-
If you drop the duplicates from your data frame in Question 4b, you will likely not (yet) have 350 distinct No Starch Press books. Repeat the steps above, building (say) one or two more data frames, until you have all 350 distinct titles.
-
Once you have all 350 distinct titles in a data frame, sort the results by the date column, and find which month-and-year pair had the largest number of pages written.
You should find that, in June 2021, there were a total of 3096 pages written, in these 7 books:
|
Project 04 Assignment Checklist
-
Jupyter Lab notebook with your code, comments and output for the assignment
-
firstname-lastname-project05.ipynb
-
-
Python file with code and comments for the assignment
-
firstname-lastname-project05.py
-
-
Submit files through Gradescope
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |