Building a Python Web Scraping Project From Scratch

This project guide is a part of the Zero to Data Analyst Bootcamp by Jovian.

alt

Web scraping is the process of extracting and parsing data from websites in an automated fashion using a computer program. It's a useful technique for creating datasets for research and learning. Follow these steps to build a web scraping project from scratch using Python and its ecosystem of libraries:

  1. Pick a website and describe your objective

    • Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
    • Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
    • Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.
  2. Use the requests library to download web pages

    • Inspect the website's HTML source and identify the right URLs to download.
    • Download and save web pages locally using the requests library.
    • Create a function to automate downloading for different topics/search queries.
  3. Use Beautiful Soup to parse and extract information

    • Parse and explore the structure of downloaded web pages using Beautiful soup.
    • Use the right properties and methods to extract the required information.
    • Create functions to extract from the page into lists and dictionaries.
    • (Optional) Use a REST API to acquire additional information if required.
  4. Create CSV file(s) with the extracted information

    • Create functions for the end-to-end process of downloading, parsing, and saving CSVs.
    • Execute the function with different inputs to create a dataset of CSV files.
    • Verify the information in the CSV files by reading them back using Pandas.
  5. Document and share your work

    • Add proper headings and documentation in your Jupyter notebook.
    • Publish your Jupyter notebook to your Jovian profile
    • (Optional) Write a blog post about your project and share it online.

Notes

Tweet your projects and tag @JovianML. We're retweeting 3 interesting proejcts everyday!

Project Ideas

Here are some project ideas to get you started. You can work of one of these ideas, or pick something entirely different.

  1. Filmography of Actors/Directors (Wikipedia): The list of Films and TV shows an actor has been a part of is called their filmography. Here's an example filmography page on Wikipedia: https://en.wikipedia.org/wiki/Christian_Bale_filmography . Can you scrape this information and create a dataset of filmographies of famous actors/actresses/directors with information like film title, year of release, etc.?

  2. Discography of an Artist (Wikipedia): The list of albums released by an artist is called their discography. Here's an example discography page on Wikipedia: https://en.wikipedia.org/wiki/Linkin_Park_discography . Can you scrape this information and create a dataset of discographies or music albums with information like the album title, release date etc.?

  3. Dataset of Movies (TMDb): The Movie Database (TMDb) contains information about thousands of movies from around the world: https://www.themoviedb.org/movie . Can you scape the site to create a dataset of movies containing information like title, release date, cast, etc. ? You can also create datasets of movie actors/actresses/directors using this site.

  4. Dataset of TV Shows (TMDb): The Movie Database (TMDb) contains information about thousands of TV shows from around the world: https://www.themoviedb.org/tv . Can you scape the site to create a dataset of TV shows containing information like title, release date, cast, crew, etc. ? You can also create datasets of TV actors/actresses/directors using this site.

  5. Collections of Popular Repositories (GitHub): Scape GitHub collections ( https://github.com/collections ) to create a dataset of popular repositories organized by different use cases.

  6. Dataset of Books (BooksToScrape): Create a dataset of popular books in different genres by scraping the site Books To Scrape: http://books.toscrape.com

  7. Dataset of Quotes (QuotesToScrape): Create a dataset of popular quotes for different tags by scraping the site Quotes To Scrape: http://quotes.toscrape.com

  8. Scrape a User's Repositories (GitHub): Given someone's GitHub username, can you scrape their GitHub profile to create a list of their repositories with information like repository name, no. of stars, no. of forks, etc.?

  9. Bibliography of an Author (Wikipedia): The list of books/publications by an author is called their bibliography. Here's an example bibliography page on Wikipedia: https://en.wikipedia.org/wiki/Charles_Dickens_bibliography . Can you scrape this information and create a dataset of bibliographies for popular authors?

  10. Country Demographics (Wikipedia): Wikipedia provides detailed demographics information for several countries e.g. https://en.wikipedia.org/wiki/Demographics_of_India . Can you scrape these pages to create a dataset of demographics for several countries containing information like population, density, life expectancy, fertility rate, infant mortality rate, age groups, etc.?

  11. Stocks Prices (Yahoo Finance): Yahoo finance provides detailed information about stocks of publicly listed companies e.g. https://finance.yahoo.com/quote/TWTR . Can you scrape this information to create a dataset of stock prices for popular companies?

  12. Create a Dataset of YouTube Videos (YouTube): Can you write a program to scrape information about videos from a YouTube channel page e.g. https://www.youtube.com/c/JovianML/videos ? Use this to create a dataset of top videos from popular channels.

  13. Songs Dataset (AZLyrics): Create a dataset of songs by scraping AZLyrics: https://www.azlyrics.com/f.html . Capture information like song title, artist name, year of release and lyrics URL.

  14. Scrape a Popular Blog: Create a dataset of blog posts on a popular blog e.g. https://m.signalvnoise.com/search/ . The dataset can contain information like the blog title, published date, tags, author, link to blog post, etc.

  15. Weekly Top Songs (Top 40 Weekly): Create a dataset of the top 40 songs of each week in a given year by scraping the site https://top40weekly.com . Capture information like song title, artist, weekly rank, etc.

  16. Video Games Dataset (Steam): Create a dataset of popular or trending video games by scraping the listing pages on platforms like Steam: https://store.steampowered.com/genre/Free%20to%20Play/ .

Also check out these projects and tutorials:

!pip install jovian --upgrade --quiet
import jovian
jovian.commit(project="python-web-scraping-project-guide")
[jovian] Attempting to save notebook..