Web Scraping Project
Scraping Popular Topics on GitHub using Python
GitHub is a popular website for sharing open source projects and code repositories. For example, the tensorflow repository contains the entire source code of the Tensorflow deep learning framework.
Repositories in GitHub can be tagged using topics. For example, the
tensorflow repository has the topics
The page https://github.com/topics provides a list of the top topics on Github. In this project, we'll retrive information from this page using web scraping: the process of extracting information from a website in an automated fashion using code. We'll use the Python libraries Requests and Beautiful Soup to scrape data from this page.
Here's an outline of the steps we'll follow:
- Download the webpage using
- Parse the HTML source code using beautiful soup
- Extract topic names, descriptions and URLs from page
- Compile extracted information into Python lists and dictionaries
- Extract and combine data from multiple pages
- Save the extracted information to a CSV file.
By the end of the project, we'll create a CSV file in the following format:
title,description,url 3d,3d modeling xyz,https://github.com/topics/3d Ajax,Ajax is a new xyz,https://github.com/topics/ajax ...
How to Run the Code
You can execute the code using the "Run" button at the top of this page and selecting "Run on Binder". You can make changes and save your own version of the notebook to Jovian by executing the following cells:
!pip install jovian --upgrade --quiet
[jovian] Updating notebook "sridhar-thalari/web-scraping-project" on https://jovian.ai [jovian] Committed successfully! https://jovian.ai/sridhar-thalari/web-scraping-project