Zerotopandas Course Project Starter
Probing IMDb's Top 250 Movies: An Exploratory Data Analysis in Python
The IMDb Top 250 Movies dataset consists of 250 of the highest-rated films of all time as determined by IMDb users.
According to the dataset's description,
...the dataset was scraped from the IMDB website (www.imdb.com). The top 250 movies were selected based on their IMDB ratings, and information such as movie title, director, cast, rating, votes, and year of release was collected. No data was altered or modified in any way, and all data was collected in accordance with IMDB's terms of use (https://www.kaggle.com/datasets/rajugc/imdb-top-250-movies-dataset).
The dataset consists of a single CSV file, with each row representing a single movie and containing the following information:
- rank - Rank of the movie
- name - Name of the movie
- year - Release year
- rating - Rating of the movie
- genre - Genre of the movie
- certificate - Certificate of the movie
- run_time - Total movie run time
- tagline - Tagline of the movie
- budget - Budget of the movie
- box_office - Total box office collection across the world
- casts - All casts of the movie
- directors - Director of the movie
- writers - Writer of the movie
In this project, we will analyze the dataset using Numpy and Pandas to identify patterns and trends, while Matplotlib and Seaborn will help us create visualizations. By conducting this analysis, we hope to shed light on the characteristics that distinguish top-rated movies and provide valuable information for anyone interested in understanding the film industry.
This project was done in connection with "Data Analysis with Python: Zero to Pandas" from (https://www.jovian.ml), a practical and beginner-friendly introduction to data analysis covering the basics of Python, Numpy, Pandas, Data Visualization, and Exploratory Data Analysis.
How to run the code
This is an executable Jupyter notebook hosted on Jovian.ml, a platform for sharing data science projects. You can run and experiment with the code in a couple of ways: using free online resources (recommended) or on your own computer.
Option 1: Running using free online resources (1-click, recommended)
The easiest way to start executing this notebook is to click the "Run" button at the top of this page, and select "Run on Binder". This will run the notebook on mybinder.org, a free online service for running Jupyter notebooks. You can also select "Run on Colab" or "Run on Kaggle".
Option 2: Running on your computer locally
-
Install Conda by following these instructions. Add Conda binaries to your system
PATH
, so you can use theconda
command on your terminal. -
Create a Conda environment and install the required libraries by running these commands on the terminal:
conda create -n zerotopandas -y python=3.8
conda activate zerotopandas
pip install jovian jupyter numpy pandas matplotlib seaborn opendatasets --upgrade
- Press the "Clone" button above to copy the command for downloading the notebook, and run it on the terminal. This will create a new directory and download the notebook. The command will look something like this:
jovian clone notebook-owner/notebook-id
- Enter the newly created directory using
cd directory-name
and start the Jupyter notebook.
jupyter notebook
You can now access Jupyter's web interface by clicking the link that shows up on the terminal or by visiting http://localhost:8888 on your browser. Click on the notebook file (it has a .ipynb
extension) to open it.
Downloading the Dataset
To download the IMDb Top 250 Movies dataset from Kaggle:
-
Go to https://www.kaggle.com/datasets/rajugc/imdb-top-250-movies-dataset and sign in to Kaggle using your account credentials.
-
Scroll down to "Data Sources" and click "Download (8KB)".
-
Extract the downloaded ZIP file to access the dataset in CSV format.
!pip install jovian opendatasets --upgrade --quiet
Let's begin by downloading the data, and listing the files within the dataset.