Snoek Quotes Scraping
Scraping Your Favorite Quotes from BrainyQuote using Python
Web scraping is the extraction of information from web pages, typically in an automated fashion. There are several approaches to accomplish this. In this project, I will demonstrate the use of Python to scrape information from a website that harbors thousands of quotes. The method I outline relies primarily on the Python libraries
BeautifulSoup. The output of the project is one master function (and several underlying helper functions) in which simply the topic of interest is entered as an argument, resulting in a CSV file as output that harbors all the quotes belonging to the topic, together with the respective authors and links that lead directly to each quote.
The website BrainyQuote claims to be the world's largest quotation site, and indeed forms an extensive reservoir of quotes. As put on its website:
Originally published in 2001, BrainyQuote is one of the oldest and most established quotation sites on the web. Our site was built from scratch into the behemoth it is today. In the beginning, we used library books to enter famous quotations by hand. Armed with eyedrops and comfy wrist-rests at our computers, we typed, and typed, and typed! Today, you can enjoy the fruits of our labors; we are a shining example of the little engine that could.
Despite the large amounts of data that can be harnessed to provide novel insights, quotes remain a powerful way of capturing the essence of a phenomenon in a concise and appealing way. For that reason, book authors often use one or several quotes to start a chapter. A quote, in essence, consists of two parts: the exact quote, and the author of that quote. Although in theory a good quote stands on its own, in practice it is the combination of what is said and who said it that makes a quote powerful. Therefore, in this project we will extract both the exact quote as well as the person to whom the quote can be attributed to (the author).
On BrainyQuote.com, quotes are categorized by author, by topic, and there are also the options to view the quote of the day or to use the search bar, as shown below:
BrainyQuote is a great resource for browsing through quotes. However, it can be valuable to collect quotes for documentation, inspiration, or for further analysis. On the page https://www.brainyquote.com/topics an overview of all the available topics on the site can be found. In this project we will use web scraping to extract a subset of quotes of interest from this site using the Python libraries Requests and Beautiful Soup.
The goal of this project is: to use web scraping to download all the quotes that belong to a certain topic. As an example, we will focus on the topic 'motivational' in order to find out the steps to be taken. Afterwards, we will derive a set of functions that can subsequently be used to scrape any topic of interest.
The outline of the steps is given below:
- Identify the webpages
- Download a webpage using Requests
- Use Beautiful Soup to parse the HTML source code
- Extract author, quote text and url for each quote on the page
- Collect the downloaded data into Python lists
- Extract and combine data from multiple pages
- Create CSV file with the extracted information
Ultimately, the results will be exported to a CSV file in the following format:
author, quote, url St. Jerome, Good, better, best. Never let it rest. 'Til your good is better and your better is best., https://www.brainyquote.com/quotes/st_jerome_389605?src=t_motivational Charles R. Swindoll, Life is 10% what happens to you and 90% how you react to it., https://www.brainyquote.com/quotes/charles_r_swindoll_388332?src=t_motivational
How to Run the Code
In order to execute the code, please use the "Run" button at the top of this page and select "Run on Binder". You can edit the notebook and save a personal version to Jovian by executing the cells below:
!pip install jovian --upgrade --quiet
# Execute this to save new versions of the notebook jovian.commit(project="snoek-quotes-scraping")