Jovian
Sign In

Webscraping Project Final

Jovian Project #1 Web Scraping
Author: Samantha Roberts
Date: April 21, 2021

Wiki Country Demographics Webscraping Project

My plan of attack for this project

  1. Scrape the page that contains a table of countries and the demographics page links
  2. Make a list of dictionaries that contains the above country information
  3. Read this into a CSV file that can be handled by Pandas
  4. Crawl the links in the CSV file and see how many country pages have the table I am trying to scrape
  5. Write code specific to the majority of demographics page tables to scrape the data I want
  6. Handle instances where the relevant tables do not exist in the demographics page
  7. Look at the data in Pandas and see if there is some 1st order way to clean the data further
  8. Write the demographics info to a CSV file

(1) We first build a list of countries and their relative wiki links

  1. Scrape https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population
  2. Put the rows of the HTML table into a dictionary
  3. Convert the dictionary to a Pandas DataFrame and write it to a csv file

alt.

# now lets get all of the countries from the table
# ths is from the table at https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population

from bs4 import BeautifulSoup
import pandas as pd
import requests

def get_headers(table_header_row):
    """
    Takes a HTML table header
    returns a list of strings of the headers
    """
    cols = []
    for col in table_header_row:
        cols.append(col.text.strip()) 
    return cols[:4] # we dont want the date or the source references

def get_row_values(country): 
    """
    takes in HTML for a row of table data
    returns a dictionary of that data
    """
    info = {
            'rank': country.th.text.strip(),
            'link': country.a['href'],
            'name': country.a.text.strip(),
            'population': country.find('td', style='text-align:right').text.strip().replace(',',""),
            'percent_of_world': country.find_all('td', style='text-align:right')[1].text \
                                .replace('"', '').strip().strip('%')
            }
    return info

def build_country_list(rows):
    """
    take in a beautiful soup object containing rows of an HTML table
    returns a list of dictionaries each containing the data from each row
    """
    country_list = []
    for row in rows:
        country_list.append(get_row_values(row))
    return country_list

def scrape_wiki_table():
    """
    does what it says
    returns a list of dictionaries
    """
    world_countries_url = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
    soup = BeautifulSoup(requests.get(world_countries_url).text)
    tables = soup.find_all('table', class_='wikitable sortable plainrowheaders')
    rows = tables[0].find_all('tr')
    #headers = get_headers(rows[0].find_all('th'))
    return build_country_list(rows[1:242])

def write_csv(path, list_of_dicts):
    """
    writes a list of dicts to a csv
    """
    df = pd.DataFrame(list_of_dicts)
    df.to_csv(path, index=False, header=True)
samantha-roberts
Samantha Roberts6 months ago