Web Scraping Project
Scraping Popular Topics on GitHub using Python
GitHub is a popular website for sharing open source projects and code repositories. For example, the tensorflow repository contains the entire source code of the Tensorflow deep learning framework.
Repositories in GitHub can be tagged using topics. For example, the tensorflow
repository has the topics python
, machine-learning
, deep-learning
etc.
The page https://github.com/topics provides a list of the top topics on Github. In this project, we'll retrive information from this page using web scraping: the process of extracting information from a website in an automated fashion using code. We'll use the Python libraries Requests and Beautiful Soup to scrape data from this page.
Here's an outline of the steps we'll follow:
- Download the webpage using
requests
- Parse the HTML source code using beautiful soup
- Extract topic names, descriptions and URLs from page
- Compile extracted information into Python lists and dictionaries
- Extract and combine data from multiple pages
- Save the extracted information to a CSV file.
By the end of the project, we'll create a CSV file in the following format:
title,description,url
3d,3d modeling xyz,https://github.com/topics/3d
Ajax,Ajax is a new xyz,https://github.com/topics/ajax
...
How to Run the Code
You can execute the code using the "Run" button at the top of this page and selecting "Run on Binder". You can make changes and save your own version of the notebook to Jovian by executing the following cells:
!pip install jovian --upgrade --quiet
import jovian
jovian.commit()
[jovian] Attempting to save notebook..