Scraping Top Repositories for Topics on GitHub

TODO (Intro):

  • Introduction about web scraping
  • Introduction about GitHub and the problem statement
  • Mention the tools you're using (Python, requests, Beautiful Soup, Pandas)

Here are the steps we'll follow:

  • We're going to scrape https://github.com/topics
  • We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
  • For each topic, we'll get the top 25 repositories in the topic from the topic page
  • For each repository, we'll grab the repo name, username, stars and repo URL
  • For each topic we'll create a CSV file in the following format:
Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx

Scrape the list of topics from Github

Explain how you'll do it.

  • use requests to downlaod the page
  • user BS4 to parse and extract information
  • convert to a Pandas dataframe

Let's write a function to download the page.

import requests
from bs4 import BeautifulSoup

def get_topics_page():
    # TODO - add comments
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc

Add some explanation