Webscraping Project Final
Jovian Project #1 Web Scraping
Author: Samantha Roberts
Date: April 21, 2021
Wiki Country Demographics Webscraping Project
-
So here is a link to the assignment:
https://jovian.ai/learn/zero-to-data-analyst-bootcamp/assignment/project-1-web-scraping-with-python -
And here is a link to a Wiki list of all countries by population:
https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population
... It looks like if I crawl the links in this table I can reach the particular spellings and format of each of the demographic webpages -
Some Wiki Demographics pages like these Have an easy table to access which has the info I want to scrape
-
Some Demographics pages like these have the info in a different format:
-
Some pages like these do not have these secific summary tables at all:
My plan of attack for this project
- Scrape the page that contains a table of countries and the demographics page links
- Make a list of dictionaries that contains the above country information
- Read this into a CSV file that can be handled by Pandas
- Crawl the links in the CSV file and see how many country pages have the table I am trying to scrape
- Write code specific to the majority of demographics page tables to scrape the data I want
- Handle instances where the relevant tables do not exist in the demographics page
- Look at the data in Pandas and see if there is some 1st order way to clean the data further
- Write the demographics info to a CSV file
(1) We first build a list of countries and their relative wiki links
- Scrape https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population
- Put the rows of the HTML table into a dictionary
- Convert the dictionary to a Pandas DataFrame and write it to a csv file
.
# now lets get all of the countries from the table
# ths is from the table at https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population
from bs4 import BeautifulSoup
import pandas as pd
import requests
def get_headers(table_header_row):
"""
Takes a HTML table header
returns a list of strings of the headers
"""
cols = []
for col in table_header_row:
cols.append(col.text.strip())
return cols[:4] # we dont want the date or the source references
def get_row_values(country):
"""
takes in HTML for a row of table data
returns a dictionary of that data
"""
info = {
'rank': country.th.text.strip(),
'link': country.a['href'],
'name': country.a.text.strip(),
'population': country.find('td', style='text-align:right').text.strip().replace(',',""),
'percent_of_world': country.find_all('td', style='text-align:right')[1].text \
.replace('"', '').strip().strip('%')
}
return info
def build_country_list(rows):
"""
take in a beautiful soup object containing rows of an HTML table
returns a list of dictionaries each containing the data from each row
"""
country_list = []
for row in rows:
country_list.append(get_row_values(row))
return country_list
def scrape_wiki_table():
"""
does what it says
returns a list of dictionaries
"""
world_countries_url = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
soup = BeautifulSoup(requests.get(world_countries_url).text)
tables = soup.find_all('table', class_='wikitable sortable plainrowheaders')
rows = tables[0].find_all('tr')
#headers = get_headers(rows[0].find_all('th'))
return build_country_list(rows[1:242])
def write_csv(path, list_of_dicts):
"""
writes a list of dicts to a csv
"""
df = pd.DataFrame(list_of_dicts)
df.to_csv(path, index=False, header=True)
Samantha Roberts6 months ago