Scraping Books Website
Scraping 'Book To Scrape' Website using python
What we do:
- Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. The web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis. according to wikipedia.
- I am going to use http://books.toscrape.com/ website
- I will be using Python, requests, BeautifulSoup, Pandas.
Here are the steps to follow
- We are going to scrape http://books.toscrape.com/ .
- We'll get a list of books.For each book we'll get book title, book page url and book description.
- For each book, we'll get the top 25 pages in the book from the book page.
- For each page, we'll grab the name of the UPC, product type, Price, Tax, Availability, Number of reviews.
- For each book we'll create a CSV file.
Scraping Home Page of the website
How to do it:
- use requests to download the page
- user BS4 to parse and extract information
- convert to a Pandas DataFrame
Let's write a function to download the home page.
import requests from bs4 import BeautifulSoup def get_topics_page(): topics_url = 'http://books.toscrape.com/' response = requests.get(topics_url) page_contents = response.text doc = BeautifulSoup(page_contents, 'html.parser') return doc