Final Web Scraping Project
Q1. What is Web Scraping?
In the most simple terms, Web Scraping is the process through which we extract data from a website, and save it in a form which is easy to read, to understand and to work on.
When we say 'Easy to work on', we mean to say that the data thus extracted can be used to get a lot of useful insights and answer a lot of questions, finding answers to which would not be such an easy task, if we did not have that data stored with us in a simple and sorted manner, i.e. generally in a
CSV File, an Excel File or a Database.
Q2. How does web scraping work?
To understand web scraping, it’s important to first understand that web pages are built with text-based mark-up languages – the most common being
A mark-up language defines the structure of a website’s content. Since there are universal components and tags of mark-up languages, this makes it much easier for web scrapers to pull all the information that it needs.
Once the HTML is parsed, the scraper then extracts the necessary data and stores it.
Note : Not all websites allow Web Scraping, especially when personal information of the users is involved, so we should always ensure that we do not explore too much, and don't get our hands on information which might belong to someone else.
Websites generally have protections at place, and they would block our access to the website if they see us scraping a large amount of data from their website.
IMDb is an online database of information related to films, television programs, home videos, video games, and streaming content online – including cast, production crew and personal biographies, plot summaries, trivia, ratings, and fan and critical reviews.
Almost all of us, at some point in time have looked up for a movie's/show's reviews and ratings on IMDB, to decide if we want to go ahead with watching it or not.
As of December 2020, IMDb has approximately 7.5 million titles (including episodes) and 10.4 million personalities in its database, as well as 83 million registered users.
In this project, we will parse through the IMDB's Top rated Movies page to get details about the top rated movies from around the world.
We will retrieve information from the page ’Top Rated Movies’ using web scraping: a process of extracting information from a website programmatically. Web scraping isn’t magic, and yet some readers may grab information on a daily basis. For example, a recent graduate may copy and paste information about companies they applied for into a spreadsheet for job application management.
The project goal is to build a web scraper that withdraws all desirable information and assemble them into a single CSV. The format of the output CSV file is shown below:
|#||Movie Name||Summary||Year of Release||Genre||Rating||Number of Reviews||Director||Lead Actors||Poster||Movie Page URL|
|1||Shawshank R||Two imprisoned..||1994||Drama||9.4||2411874||Frank Darabont||Morgan Freeman, Tim Robbins||https://w||.. https://w..|
|2||The Good Doctor||The doctor...||86%||...||....|