Web Scraping Tmdb
Web Scraping Popular Television Shows on TMDB (themoviedb.org)
data source: TMBd website (https://www.themoviedb.org)
A disclaimer before beginning, many websites restrict or outright bar scraping of data from their pages. Users may be subject to legal ramifications depending on where and how they attempt to scrape information. Many sites have a devoted page to noting restrictions on data scraping at www.[site].com/robots.txt. Be extremely careful if looking at sites that house user data — places like facebook, linkedin, even craigslist, do not take kindly to data being scraped from their pages. When in doubt, please contact with teams at sites.
The Movie Database (TMDb) is a community-driven website about movies and television shows database. The community has added every piece of data since 2008. Users can search for their desired topics and discover what they like after browsing a large amount of data. Users can also contribute to the TMDb community by giving reviews and their scores to certain shows for the benefits of the community. In summary, TMDb is an excellent website for someone like me who wanted to practice web scraping skills.
Project motivation
For the purpose of this project, we will retrieve information from the page of ’Popular TV Shows’ using web scraping: a process of extracting information from a website programmatically. Web scraping isn’t magic, and yet some readers may grab information on a daily basis. For example, a recent graduate may copy and paste information about companies they applied for into a spreadsheet for job application management.
Project goals
The project goal is to build a web scraper that withdraws all desirable information and assemble them into a single CSV. The format of the output CSV file is shown below:
# | movie_title | released_date | score | image_link | detail-page |
---|---|---|---|---|---|
1 | The Falcon and the Winter Solider | 19 Mar 2021 | 78% | .. | ... |
2 | The Good Doctor | 25 Sep 2017 | 86% | ... | |
.... |
Project steps
Here is an outline of the steps we'll follow.
- Doanload the webpage using
requests
- Parse the HTML source code using
BeautifulSoup
library - Building the scraper components
- Complie extracted information into Python list and dictionaries
- Write information to CSV files
- Extract and combine data from multiple pages
- Future work and references
How to run the code
This tutorial is an executable Jupyter notebook hosted on Jovian. You can run this tutorial and experiment with the code examples in a couple of ways: using free online resources (recommended) or on your computer.
Option 1: Running using free online resources (1-click, recommended)
The easiest way to start executing the code is to click the Run button at the top of this page and select Run on Binder. You can also select "Run on Colab" or "Run on Kaggle", but you'll need to create an account on Google Colab or Kaggle to use these platforms.
Option 2: Running on your computer locally
To run the code on your computer locally, you'll need to set up Python, download the notebook and install the required libraries. We recommend using the Conda distribution of Python. Click the Run button at the top of this page, select the Run Locally option, and follow the instructions.
Jupyter Notebooks: This tutorial is a Jupyter notebook - a document made of cells. Each cell can contain code written in Python or explanations in plain English. You can execute code cells and view the results, e.g., numbers, messages, graphs, tables, files, etc., instantly within the notebook. Jupyter is a powerful platform for experimentation and analysis. Don't be afraid to mess around with the code & break things - you'll learn a lot by encountering and fixing errors. You can use the "Kernel > Restart & Clear Output" menu option to clear all outputs and start again from the top.