Web Scraping Imdb Tv Shows
Scraping IMDb for Top 250 TV Shows using Python
Web Scraping is process of collecting information from a website in an automated manner using code. This information is collected and then exported into a format that is more useful for the user. I will be using the libraries Requests and BeautifulSoup in Python to scrape data from the website.
IMDb is the world's most popular website that contains information about TV Shows, movies, video games, and streaming content online – including cast, production crew, plot summaries, trivia, ratings, fan and critical reviews and a lot more. It is designed to help fans explore the world of movies and shows and decide what to watch.
The page https://www.imdb.com/chart/toptv/ provides a list of the top 250 TV Shows as rated by the IMDb users. In this project I will retrive information from this page using web scraping. This is how the IMDb site looks:
Here is the outline of the steps that I am planning to follow:
- Import the required libraries and download the website using
requests
- Parse the HTML source code using
BeautifulSoup
- Extract the top 250 TV Shows rank, title, release year, IMDb ratings, genre, runtime, and plot summary from the website
- Complie extracted information into Python dictionaries and
Pandas
DataFrame - Save the extracted information to a CSV file.
By the end of the project I will create a CSV file in the following format:
Rank, Title, Release Year, IMDb Ratings, Genre, Runtime, Plot Summary
1, Planet Earth II, 2016, 9.5, Documentary, 4h 58min, Wildlife documentary series with David Attenborough, beginning with a look at the remote islands which offer sanctuary to some of the planet's rarest creatures, to the beauty of cities, which are home to humans, and animals..
2, Planet Earth, 2006, 9.4, Documentary, 8h 58min, Emmy Award-winning, 11 episodes, five years in the making, the most expensive nature documentary series ever commissioned by the BBC, and the first to be filmed in high definition.
....
Download the webpage using requests
These are the libraries that I will be using:
- Requests Library to use built-in methods for making HTTP requests
- BeautifulSoup Library to pull data out of HTML File
- Pandas Library for Data Manipulation
The libraries can be installed using pip
.
# Installing the libraries
!pip install jovian requests beautifulsoup4 pandas --upgrade --quiet
# Importing the libraries
import jovian
import requests
from bs4 import BeautifulSoup
import pandas as pd