Sign In

Web Scraping Project

Scraping Popular Topics on GitHub using Python


GitHub is a popular website for sharing open source projects and code repositories. For example, the tensorflow repository contains the entire source code of the Tensorflow deep learning framework.

Repositories in GitHub can be tagged using topics. For example, the tensorflow repository has the topics python, machine-learning, deep-learning etc.

The page provides a list of the top topics on Github. In this project, we'll retrive information from this page using web scraping: the process of extracting information from a website in an automated fashion using code. We'll use the Python libraries Requests and Beautiful Soup to scrape data from this page.

Here's an outline of the steps we'll follow:

  1. Download the webpage using requests
  2. Parse the HTML source code using beautiful soup
  3. Extract topic names, descriptions and URLs from page
  4. Compile extracted information into Python lists and dictionaries
  5. Extract and combine data from multiple pages
  6. Save the extracted information to a CSV file.

By the end of the project, we'll create a CSV file in the following format:

3d,3d modeling xyz,
Ajax,Ajax is a new xyz,

How to Run the Code

You can execute the code using the "Run" button at the top of this page and selecting "Run on Binder". You can make changes and save your own version of the notebook to Jovian by executing the following cells:

!pip install jovian --upgrade --quiet
import jovian
[jovian] Attempting to save notebook..
Aakash N S6 months ago