Learn practical skills, build real-world projects, and advance your career

Web Scraping Oscar Winning movies from IMDB (www.imdb.com)

oscar_banner

A project work as part of the course zero to data analyst from Jovian.
IMDB: International Movie Database - is the world's most popular and authoritative source for movie, TV and celebrity content. Find ratings and reviews for the newest movie and TV shows.
What is web scraping:
web scraping is a process extracting information from a website programmitcally using code. In this initiative we will attempt to extract information using python and some of the libraries known for web scraping called requests and beautiful soup.

Following are the steps on how to extract the information:

  1. Load the libraries needed for the project
  2. Download the page using requests
  3. Parse the html source code using BeautifulSoup4
  4. Extract the information and the data that we are trying to about the movies
  5. Extract and combine data from multiple pages from the portal
  6. Compile extracted information into python data objects
  7. Finally, write and save the info into CSV files

By the end of this project we will have the information about Oscar Winning Movies in a certain period into files This is the info we will gather about the movies: movie title,year,IMDB Rating,movie rating,movie duration, movie Genre, Metascore, Votes, Gross USD.

So, lets get to it!!

How to run the code

Content present in this Jupyter notebook is executable by any user and is hosted on Jovian.
Running using free online resources (1-click, recommended):
The easiest way to start executing the code is to click the Run button at the top of this page and select Run on Binder. You can also select "Run on Colab" or "Run on Kaggle", but you'll need to create an account on Google Colab or Kaggle to use these platforms.

Load the libraries needed for the project

import requests #packages that is used to download the content from web
import urllib # packages that is used to work with URL libraries
import requests #package built to make HTTP requests user friendly
import os # package used for file process
import re # package for regular expression - best to have it dont know if it is required or not
from bs4 import BeautifulSoup #a Python library for pulling data out of HTML and XML files
import pandas as pd # the omnipresent of all python to work with dataframes
requests.__version__ # check the version of the package
'2.25.1'

Download the page using requests

With the goal of getting Oscars dataset with as minimal requests as possible sent to IMDB, so not to bog down their servers with a lot of people like me trying to learn about web scraping - lets start slow.

Go to the portal and get retrieve using advanced search for all the movies that won Oscar for the year 2017 (test year) so we can walk thru each attribut that we want to collect and then build a final dataset