Building a Python Web Scraping Project From Scratch
This project guide is a part of the Zero to Data Analyst Bootcamp by Jovian.
Web scraping is the process of extracting and parsing data from websites in an automated fashion using a computer program. It's a useful technique for creating datasets for research and learning. Follow these steps to build a web scraping project from scratch using Python and its ecosystem of libraries:
-
Pick a website and describe your objective
- Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.
-
Use the requests library to download web pages
- Inspect the website's HTML source and identify the right URLs to download.
- Download and save web pages locally using the
requests
library. - Create a function to automate downloading for different topics/search queries.
-
Use Beautiful Soup to parse and extract information
- Parse and explore the structure of downloaded web pages using Beautiful soup.
- Use the right properties and methods to extract the required information.
- Create functions to extract from the page into lists and dictionaries.
- (Optional) Use a REST API to acquire additional information if required.
-
Create CSV file(s) with the extracted information
- Create functions for the end-to-end process of downloading, parsing, and saving CSVs.
- Execute the function with different inputs to create a dataset of CSV files.
- Verify the information in the CSV files by reading them back using Pandas.
-
Document and share your work
- Add proper headings and documentation in your Jupyter notebook.
- Publish your Jupyter notebook to your Jovian profile
- (Optional) Write a blog post about your project and share it online.
Notes
-
Use the "New" button on Jovian to create a new notebook, and select "Run on Binder" to get started.
-
Follow this tutorial to learn web scraping: https://jovian.ai/aakashns/python-web-scraping-and-rest-api
-
Check out 20-week bootcamp to learn Python programming, web scraping, data analysis and more: http://zerotoanalyst.com
Tweet your projects and tag @JovianML. We're retweeting 3 interesting proejcts everyday!
Project Ideas
Here are some project ideas to get you started. You can work of one of these ideas, or pick something entirely different.
-
Filmography of Actors/Directors (Wikipedia): The list of Films and TV shows an actor has been a part of is called their filmography. Here's an example filmography page on Wikipedia: https://en.wikipedia.org/wiki/Christian_Bale_filmography . Can you scrape this information and create a dataset of filmographies of famous actors/actresses/directors with information like film title, year of release, etc.?
-
Discography of an Artist (Wikipedia): The list of albums released by an artist is called their discography. Here's an example discography page on Wikipedia: https://en.wikipedia.org/wiki/Linkin_Park_discography . Can you scrape this information and create a dataset of discographies or music albums with information like the album title, release date etc.?
-
Dataset of Movies (TMDb): The Movie Database (TMDb) contains information about thousands of movies from around the world: https://www.themoviedb.org/movie . Can you scape the site to create a dataset of movies containing information like title, release date, cast, etc. ? You can also create datasets of movie actors/actresses/directors using this site.
-
Dataset of TV Shows (TMDb): The Movie Database (TMDb) contains information about thousands of TV shows from around the world: https://www.themoviedb.org/tv . Can you scape the site to create a dataset of TV shows containing information like title, release date, cast, crew, etc. ? You can also create datasets of TV actors/actresses/directors using this site.
-
Collections of Popular Repositories (GitHub): Scape GitHub collections ( https://github.com/collections ) to create a dataset of popular repositories organized by different use cases.
-
Dataset of Books (BooksToScrape): Create a dataset of popular books in different genres by scraping the site Books To Scrape: http://books.toscrape.com
-
Dataset of Quotes (QuotesToScrape): Create a dataset of popular quotes for different tags by scraping the site Quotes To Scrape: http://quotes.toscrape.com
-
Scrape a User's Repositories (GitHub): Given someone's GitHub username, can you scrape their GitHub profile to create a list of their repositories with information like repository name, no. of stars, no. of forks, etc.?
-
Bibliography of an Author (Wikipedia): The list of books/publications by an author is called their bibliography. Here's an example bibliography page on Wikipedia: https://en.wikipedia.org/wiki/Charles_Dickens_bibliography . Can you scrape this information and create a dataset of bibliographies for popular authors?
-
Country Demographics (Wikipedia): Wikipedia provides detailed demographics information for several countries e.g. https://en.wikipedia.org/wiki/Demographics_of_India . Can you scrape these pages to create a dataset of demographics for several countries containing information like population, density, life expectancy, fertility rate, infant mortality rate, age groups, etc.?
-
Stocks Prices (Yahoo Finance): Yahoo finance provides detailed information about stocks of publicly listed companies e.g. https://finance.yahoo.com/quote/TWTR . Can you scrape this information to create a dataset of stock prices for popular companies?
-
Create a Dataset of YouTube Videos (YouTube): Can you write a program to scrape information about videos from a YouTube channel page e.g. https://www.youtube.com/c/JovianML/videos ? Use this to create a dataset of top videos from popular channels.
-
Songs Dataset (AZLyrics): Create a dataset of songs by scraping AZLyrics: https://www.azlyrics.com/f.html . Capture information like song title, artist name, year of release and lyrics URL.
-
Scrape a Popular Blog: Create a dataset of blog posts on a popular blog e.g. https://m.signalvnoise.com/search/ . The dataset can contain information like the blog title, published date, tags, author, link to blog post, etc.
-
Weekly Top Songs (Top 40 Weekly): Create a dataset of the top 40 songs of each week in a given year by scraping the site https://top40weekly.com . Capture information like song title, artist, weekly rank, etc.
-
Video Games Dataset (Steam): Create a dataset of popular or trending video games by scraping the listing pages on platforms like Steam: https://store.steampowered.com/genre/Free%20to%20Play/ .
Also check out these projects and tutorials:
- https://medium.com/@msalmon00/web-scraping-job-postings-from-indeed-96bd588dcb4b
- https://medium.com/the-innovation/scraping-medium-with-python-beautiful-soup-3314f898bbf5
- https://medium.com/brainstation23/how-to-become-a-pro-with-scraping-youtube-videos-in-3-minutes-a6ac56021961
- https://www.freecodecamp.org/news/web-scraping-python-tutorial-how-to-scrape-data-from-a-website/
- https://www.freecodecamp.org/news/scraping-wikipedia-articles-with-python/
- https://towardsdatascience.com/web-scraping-yahoo-finance-477fe3daa852
!pip install jovian --upgrade --quiet
import jovian
jovian.commit(project="python-web-scraping-project-guide")
[jovian] Attempting to save notebook..