Zerotoanalyst Project1
Scraping the AUR with Python
Arch Linux
Arch Linux is a Linux distribution that sticks to the KISS principle, and is focused on simplicity, modernity, user centrality, and versatility. In practice, this means the project attempts to have minimal distribution-specific changes and therefore minimal breakage with updates while still delivering the latest software to users very quickly. Arch and Arch-based systems continue to be very popular in the Linux community, only overshadowed by larger corporate-backed distributions such as Ubuntu.
Arch User Repository
One of the best features of Arch Linux and its derivatives is the Arch User Repository, or AUR - a community-maintained repository which greatly expands the software available to Arch users. Where other Linux distributions like Ubuntu rely on third parties for each program not available in the main repos, the AUR provides a one-stop shop for nearly any software package available for Linux, everything from system tools and hardware drivers to proprietary programs like Slack, Discord, or Spotify.
While many Arch users install their software through a terminal, the AUR website is a great resource to search for available programs, which is sometimes difficult to do in the terminal.
The aur.archlinux.org website gives information about every package in the AUR, including the package's name, description, version, votes, popularity, and who maintains the package. For this project we'll be exploring the AUR and scraping the software packages and their information.
Project Outline
This project will use several Python libraries to scrape data from the Arch User Repository. We will use the Python libraries requests
and Beautiful Soup
to scrape data from the pages, then save our data in a CSV file.
Here's an outline of the steps we'll follow:
- Download the webpage using
requests
- Inspect the HTML in the Browser
- Parse the webpage's HTML code using
Beautiful Soup
- Extract the information we want from the code
- Use Python lists and dictionaries to organize the extracted information
- Extract and combine data from multiple pages
- Save the extracted information to a CSV file
- Conclusion
!pip install jovian --upgrade --quiet
import jovian
Download the webpage using requests
The Python requests
library, specifically requests.get()
, will allow us to extract the source code of a web page by passing in a URL. To keep our code clean, we'll assign the URL to a variable.
aur_url = 'https://aur.archlinux.org/packages/?SB=p&SO=d&O=0'