Web Scraping Yahoo! Finance using Python

A detailed guide for web scraping https://finance.yahoo.com/ using Requests, BeautifulSoup, Selenium, HTML tags & embedded JSON data.

Introduction

What is Web scraping?
Web scraping is the process of extracting and parsing data from websites in an automated fashion using a computer program. It's a useful technique for creating datasets for research and learning.

Objective
The main objective of this tutorial is to showcase different web scraping methods which can be applied to any web page. This is for educational purposes only. Please read the Terms & Conditions carefully for any website to see whether you can legally use the data.

In this project, we will perform web scraping using the following 3 techniques based on the problem statement.

  • use Requests, BeautifulSoup and HTML tags to extract web page
  • use Selenium to scrape data from dynamically loading websites
  • use embedded JSON data to scrape website

The problem statement

  1. Web Scraping Stock Market News (url : https://finance.yahoo.com/topic/stock-market-news/)
    This web page shows the latest news related to stock market, we will try to extract data from this web page and store it in a CSV (comma-separated values) file. The file layout would be as mentioned below.
  1. Web Scraping Cryptocurrencies (url : https://finance.yahoo.com/cryptocurrencies)
    This Yahoo! finance web page shows list of trending Cryptocurrencies in tabular format, we will perform the web scraping to retrieve first 10 columns for top 100 Cryptocurrencies in following CSV format.
  1. Web Scraping Market Events Calendar (url : https://finance.yahoo.com/calendar)
    This page shows date-wise market events, user have the option to select the date and choose any one of the following market events Earnings, Stock Splits, Economic Events & IPO. Our aim is to create a script which can be run for any single date and market event which grabs the data and loads in CSV format as shown below.

Prerequisites

  • Knowledge of Python
  • Basic knowledge of HTML although it is not necessary

How to run the Code
You can execute the code using "Run" button on the top of this page and selecting "Run on Colab" or "Run Locally"

Setup and Tools
Run on Colab : You will need to provide the Google login to run this notebook on Colab.
Run Locally : Download and install Anaconda framework, We will be using Jupyter Notebook for writing & executing code.

Version control

You can make changes and save your version of the notebook to Jovian by executing following cells.

!pip install jovian --quiet
import jovian
# Execute this to save new versions of the notebook
jovian.commit(project="yahoo-finance-web-scraper")

1. Web Scraping Stock Market News

In this section we will learn a basic Python web scraping technique using Requests, BeautifulSoup and HTML tags. The objective here is to perform web scraping of Yahoo! finance Stock Market News

Let's kick-start with the first objective. Here's an outline of the steps we'll follow
1.1 Download & Parse web page using Requests and BeautifulSoup
1.2 Exploring and locating Elements
1.3 Extract & Compile the information into python list
1.4 Save the extracted information to a CSV file

1.1 Download & Parse webpage using Requests and BeautifulSoup

First step is to install requests & beautifulsoup4 Libraries using pip.

!pip install requests --quiet
!pip install beautifulsoup4 --quiet

import requests
from bs4 import BeautifulSoup

The libraries are installed and imported.

To download the page, we can use requests.get, which returns a response object. We can access the content of the web page using response.text
Also the response.ok & response.status_code can be used for error trapping & tracking.
Finally, we can use BeautifulSoup to parse the HTML data. This will return bs4.BeautifulSoup object.

my_url = 'https://finance.yahoo.com/topic/stock-market-news/'
response = requests.get(my_url)
print("response.ok : {} , response.status_code : {}".format(response.ok , response.status_code))
response.ok : True , response.status_code : 200
print("Preview of response.text : ", response.text[:500])
Preview of response.text : <!DOCTYPE html><html data-color-theme="light" id="atomic" class="NoJs chrome desktop failsafe" lang="en-US"><head prefix="og: http://ogp.me/ns#"><script>window.performance && window.performance.mark && window.performance.mark('PageStart');</script><meta charset="utf-8" /><title>Latest Stock Market News</title><meta name="keywords" content="401k, Business, Financial Information, Investing, Investor, Market News, Stock Research, Stock Valuation, business news, economy, finance, investment tools, m
 

Let's create a function to perform this step.

def get_page(url):
    """Download a webpage and return a beautiful soup doc"""
    response = requests.get(url)
    if not response.ok:
        print('Status code:', response.status_code)
        raise Exception('Failed to load page {}'.format(url))
    page_content = response.text
    doc = BeautifulSoup(page_content, 'html.parser')
    return doc

calling function get_page and analyzing the output.

doc = get_page(my_url)
print('Type of doc: ',type(doc))
Type of doc: <class 'bs4.BeautifulSoup'>

You can access different properties of HTML web page from doc. Following example will display Title of the web page.

doc.find('title')
<title>Latest Stock Market News</title>

We can use the function get_page to download any web page and parse it using beautiful soup.

1.2 Exploring and locating Elements

Now its time to explore the elements to find the required data point from the web page. Web pages are written in a language called HTML (Hyper Text Markup Language). HTML is a fairly simple language comprised of tags (also called nodes or elements) e.g. <a href="https://finance.yahoo.com/" target="_blank">Go to Yahoo! Finance</a>. An HTML tag has three parts:

  1. Name: (html, head, body, div, etc.) Indicates what the tag represents and how a browser should interpret the information inside it.
  2. Attributes: (href, target, class, id, etc.) Properties of tag used by the browser to customize how a tag is displayed and decide what happens on user interactions.
  3. Children: A tag can contain some text or other tags or both between the opening and closing segments, e.g., <div>Some content</div>.

Let's inspect the webpage source code by right-clicking and selecting the "Inspect" option. First, we need to identify the tag which represents the news listing.

In this case we can see the <div> tag having class name "Ov(h) Pend(44px) Pstart(25px)" is representing news listing. We can apply find_all method to grab this information

div_tags = doc.find_all('div', {'class': "Ov(h) Pend(44px) Pstart(25px)"})

Total elements in the <div> tag list matching with the numbers of news displaying on the webpage, so we are heading towards the right direction.

len(div_tags)
9

Next step to inspect the individual <div> tag and try to find more information. I am using "Visual Studio Code", but you can use any tool as simple as notepad.

print(div_tags[1])
<div class="Ov(h) Pend(44px) Pstart(25px)"><div class="C(#959595) Fz(11px) D(ib) Mb(6px)">Bloomberg</div><h3 class="Mb(5px)"><a class="js-content-viewer wafer-caas Fw(b) Fz(18px) Lh(23px) LineClamp(2,46px) Fz(17px)--sm1024 Lh(19px)--sm1024 LineClamp(2,38px)--sm1024 mega-item-header-link Td(n) C(#0078ff):h C(#000) LineClamp(2,46px) LineClamp(2,38px)--sm1024 not-isInStreamVideoEnabled" data-uuid="ee555bdf-8ced-3ec5-8527-f11ab477237d" data-wf-caas-prefetch="1" data-wf-caas-uuid="ee555bdf-8ced-3ec5-8527-f11ab477237d" href="/news/bond-yields-jump-asia-stocks-223246963.html"><u class="StretchedBox"></u>U.S. Index Futures Drop as Yields, Netflix Watched: Markets Wrap</a></h3><p class="Fz(14px) Lh(19px) Fz(13px)--sm1024 Lh(17px)--sm1024 LineClamp(2,38px) LineClamp(2,34px)--sm1024 M(0)">(Bloomberg) -- U.S. index futures fell as investors weighed the impact of rising real yields on the appeal of riskier assets and Netflix Inc. damped the earnings-season outlook. European stocks rose with focus turning to corporate results.Most Read from BloombergNetflix Tumbles as 200,000 Users Exit for First Drop in DecadeIn Defense of Elon Musk's Managerial ExcellenceTwitter Has a Poison Pill NowPutin Calls Time on Foreign Listings in Fresh Hit to TycoonsU.S. Stops Mask Requirement on Planes A</p></div>

Luckily, most of the required data points are available in this <div>, so we can use find method to grab each item.

print("Source: ", div_tags[1].find('div').text)
print("Head Line : {}".format(div_tags[1].find('a').text))
Source: Bloomberg Head Line : U.S. Index Futures Drop as Yields, Netflix Watched: Markets Wrap

If any tag is not accessible directly, then you can use methods like findParent() or 'findChild() to point to the required tag.

Key Takeout from this exercise is to identify the optimal tag which will provide us required information. Mostly this is straight forward, but sometimes you will have to perform a little more research.

1.3 Extract & Compile the information into python list

We've identified all the required tags and information. Let's put this together in the functions.

def get_news_tags(doc):
    """Get the list of tags containing news information"""
    news_class = "Ov(h) Pend(44px) Pstart(25px)" ## class name of div tag 
    news_list  = doc.find_all('div', {'class': news_class})
    return news_list

sample run of the function get_news_tags

my_news_tags = get_news_tags(doc)

we will create one more function, to parse individual <div> tags and return the information in dictionary form

BASE_URL = 'https://finance.yahoo.com' #Global Variable 

def parse_news(news_tag):
    """Get the news data point and return dictionary"""
    news_source = news_tag.find('div').text #source
    news_headline = news_tag.find('a').text #heading
    news_url = news_tag.find('a')['href'] #link
    news_content = news_tag.find('p').text #content
    news_image = news_tag.findParent().find('img')['src'] #thumb image
    return { 'source' : news_source,
            'headline' : news_headline,
            'url' : BASE_URL + news_url,
            'content' : news_content,
            'image' : news_image
           }

Testing the parse_news function for first <div> tag

parse_news(my_news_tags[0])
{'source': 'Barrons.com',
 'headline': 'A Buoyant Housing Market Complicates the Fed’s Job of Taming Inflation',
 'url': 'https://finance.yahoo.com/m/72fe26d5-feca-3515-b60d-10e0f092562c/a-buoyant-housing-market.html',
 'content': 'Housing activity is showing minimal effects from the sharp rise in mortgage interest rates.  New housing starts increased 0.3% to a seasonally adjusted annual rate of 1.793 million units in March, the Commerce Department reported Tuesday.  The backlog of houses under construction and the large number that have been authorized but haven’t started construction means homebuilders will remain busy through the year.',
 'image': 'https://s.yimg.com/uu/api/res/1.2/TSeqis7904utWuz0Zr3fgQ--~B/Zmk9c3RyaW07aD0xMjM7cT04MDt3PTIyMDthcHBpZD15dGFjaHlvbg--/https://s.yimg.com/uu/api/res/1.2/WYiyjKWjCHToP4bVcMKmbA--~B/aD02NDA7dz0xMjgwO2FwcGlkPXl0YWNoeW9u/https://media.zenfs.com/en/Barrons.com/37892ac8a0c8df46df2dc51c4742e76d.cf.jpg'}

We can use the get_news_tags & parse_news functions to pars news.

1.4 Save the extracted information to a CSV file

This is the last step of this section. We are going to use Python library pandas to save the data in CSV format. Install and then import the pandas Library.

!pip install pandas --upgrade --quiet

import pandas as pd

Creating wrapper function which will call previously created helper functions.

The get_page function will download HTML page, then we can pass the result in get_news_tags to identify list of <div> tags for news.
After that we will use List Comprehension technique to parse each <div> tag using parse_news, the output will be in the form of lists of dictionaries
Finally, we will use DataFrame method to create pandas dataframe and use to_csv method to store required data in CSV format.

def scrape_yahoo_news(url, path=None):
    """Get the yahoo finance market news and write them to CSV file """
    if path is None:
        path = 'stock-market-news.csv'
        
    print('Requesting html page')
    doc = get_page(url)

    print('Extracting news tags')
    news_list = get_news_tags(doc)

    print('Parsing news tags')
    news_data = [parse_news(news_tag) for news_tag in news_list]

    print('Save the data to a CSV')
    news_df = pd.DataFrame(news_data)
    news_df.to_csv(path, index=None)
    
    #This return statement is optional, we are doing this just analyze the final output 
    return news_df 

Scraping the news using scrape_yahoo_news function

YAHOO_NEWS_URL = BASE_URL+'/topic/stock-market-news/'
news_df = scrape_yahoo_news(YAHOO_NEWS_URL)
Requesting html page Extracting news tags Parsing news tags Save the data to a CSV

The "stock-market-news.csv" should be available in File \(\rightarrow\) Open Menu. You can download the file or directly open it on browser. Please verify the file content and compare it with the actual information available on the webpage.

You can also check the data by grabbing a few rows from the data frame returned by the scrape_yahoo_news function

news_df.head()

Summary : Hopefully I was able to explain this simple but very powerful Python technique to scrape the Yahoo! finance market news. These steps can be used to scrape any web page. You just have to do a little research to identify the required <tags> and use relevant python methods to collect the data.

2. Web Scraping Cryptocurrencies

In phase One we were able to scrape the yahoo market news web page. However, if you've noticed, as we scroll down the webpage more news will appear at the bottom of the page. This is called dynamic page loading. The previous technique is a basic Python method useful to scrape static data, To scrape the dynamically loading data will use a different method called web scraping using Selenium. Let's move ahead with this topic. The goal of this section is to extract top listing Crypto currencies from Yahoo! finance.

Here's an outline of the steps we'll follow
2.1 Introduction of selenium
2.2 Downloads & Installation
2.3 Install & Import libraries
2.4 Create Web Driver
2.5 Exploring and locating Elements
2.6 Extract & Compile the information into a python list
2.7 Save the extracted information to a CSV file

2.1 Introduction of selenium

Selenium is an open-source web-based automation tool. Python language and other languages are used with Selenium for testing as well as web scraping. Here we will use Chrome browser, but you can try on any browser.

Why you should use Selenium?

  • Clicking on buttons
  • Filling forms
  • Scrolling
  • Taking a screen-shot
  • Refreshing the page

You can find proper documentation on selenium here

The following methods will help to find elements in a webpage (these methods will return a list):

  • find_elements_by_name
  • find_elements_by_xpath
  • find_elements_by_link_text
  • find_elements_by_partial_link_text
  • find_elements_by_tag_name
  • find_elements_by_class_name
  • find_elements_by_css_selector

In this tutorial we will use only find_elements_by_xpath and find_elements_by_tag_name You can find complete documentation of these methods here

2.2 Downloads & Installation

Unlike the previous section, here we'll have to do some prep work to implement this method. We will need to install Selenium & proper web browser driver

If you are using Google Colab platform then execute following code to perform Initial installation. This piece of code 'google.colab' in str(get_ipython()) is used to identify the Google Colab platform.

if 'google.colab' in str(get_ipython()):
    print('Google CoLab Installation')
    !apt update --quiet
    !apt install chromium-chromedriver --quiet

To run it on Locally you will need Webdriver for Chrome on your machine. You can download it from this link https://chromedriver.chromium.org/downloads and just copy the file in the folder where we will create the python file (No need of installation). But make sure that the driver‘s version matches the Chrome browser version installed on the local machine.

2.3 Install & Import libraries

Installation of the required libraries.

!pip install selenium --quiet
!pip install pandas --quiet

Once the Libraries installation is done, next step is to import all the required modules / libraries.

print('Library Import')
if 'google.colab' not in str(get_ipython()):
    print('Not running on CoLab')
    from selenium.webdriver.chrome.options import Options
    from selenium.webdriver.chrome.service import Service
    import os
else:
    print('Running on CoLab')
    
print('Common Library Import')
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd 
import time

So all the necessary prep work is done. Let's move ahead to implement this method.

2.4 Create Web Driver

In this step first we will create the instance of Chrome WebDriver using webdriver.Chrome() method. and then the driver.get() method will navigate to a page given by the URL. In this case also there is slight variation based on platform. Also we have used options parameters for e.g. --headless option will load the driver in background.

if 'google.colab' in str(get_ipython()):
    print('Running on CoLab')
    def get_driver(url):
        """Return web driver"""
        colab_options = webdriver.ChromeOptions()
        colab_options.add_argument('--no-sandbox')
        colab_options.add_argument('--disable-dev-shm-usage')
        colab_options.add_argument('--headless')
        colab_options.add_argument('--start-maximized') 
        colab_options.add_argument('--start-fullscreen')
        colab_options.add_argument('--single-process')
        driver = webdriver.Chrome(options=colab_options)
        driver.get(url)
        return driver
else:
    print('Not running on CoLab')
    def get_driver(url):
        """Return web driver"""
        chrome_options = Options()
        chrome_options.add_argument('--no-sandbox')
        chrome_options.add_argument('--disable-dev-shm-usage')
        chrome_options.add_argument('--headless')
        chrome_options.add_argument('--start-maximized') 
        chrome_options.add_argument('--start-fullscreen')
        chrome_options.add_argument('--single-process')
        serv = Service(os.getcwd()+'/chromedriver')
        driver = webdriver.Chrome(options=chrome_options, service=serv)
        driver.get(url)
        return driver

Test run of get_driver

driver = get_driver('https://finance.yahoo.com/cryptocurrencies')
print(driver.title)

2.5 Exploring and locating Elements

This is almost similar step that we have done in phase 1. We will try to identify relevant information like <tags>, class , XPath etc from the web page. Right-click and select the "Inspect" to do further analysis.

As the webpage showing cryptocurrency information in the Table form. We can grab the table header by using tag <th>, we will use find_elements by TAG to get the table headers. These headers can be used as columns for a CSV file.

header = driver.find_elements(By.TAG_NAME, value= 'th')
print(header[0].text)
print(header[2].text)

Creating a helper function to get first 10 columns from header, we have used List comprehension with conditions. You can also check out usage of enumerate method.

def get_table_header(driver):
    """Return Table columns in list form """
    header = driver.find_elements(By.TAG_NAME, value= 'th')
    header_list = [item.text for index, item in enumerate(header) if index < 10]
    return header_list

Next we find out number of rows available in a Page, you can see table rows are placed in <tr> tag, we can capture the XPath by selection <tr> tag the Right Click \(\rightarrow\) Copy \(\rightarrow\) Copy XPath.

So we get the XPath value as //*[@id="scr-res-table"]/div[1]/table/tbody/tr[1], Let's use this with find_element() & By.XPATH.

txt=driver.find_element(By.XPATH, value='//*[@id="scr-res-table"]/div[1]/table/tbody/tr[1]').text
txt

Above XPath points to first row, we can get rid of row number part from XPath and use it with find_elements to get hold of all the available rows. Let's implement this in a function.

def get_table_rows(driver):
    """Get number of rows available on the page """
    tablerows = len(driver.find_elements(By.XPATH, value='//*[@id="scr-res-table"]/div[1]/table/tbody/tr'))
    return tablerows    
print(get_table_rows(driver))

Similarly, we can take the XPath for any column value.

This is the XPAth for a column //*[@id="scr-res-table"]/div[1]/table/tbody/tr[1]/td[2].
If you noticed the number after tr & td represents the row_number and column_number, we can check this with find_element() method

driver.find_element(By.XPATH, value='//*[@id="scr-res-table"]/div[1]/table/tbody/tr[1]/td[2]').text

So we can change the row_number & column_number in XPath and loop it through row count and column count to get all the available column values. Let's generalize this and put it in a function. We will get the data for one row at a time and return column values in the form of a dictionary

def parse_table_rows(rownum, driver, header_list):
    """get the data for one row at a time and return column value in the form of dictionary"""
    row_dictionary = {}
    #time.sleep(1/3)
    for index , item in enumerate(header_list):
        time.sleep(1/20)
        column_xpath = '//*[@id="scr-res-table"]/div[1]/table/tbody/tr[{}]/td[{}]'.format(rownum, index+1)
        row_dictionary[item] = driver.find_element(By.XPATH, value=column_xpath).text
    return row_dictionary

The Yahoo! Finance web page shows only 25 Cryptocurrencies per page and user will have to click Next button to load next set of crypto currencies. This is called Pagination. This is the main reason we are implementing selenium methods to handle events like pagination. you can perform multiple events like clicking, scrolling , refreshing etc. on a webpage using selenium methods.

Now we will grab the XPath of Next button, find the element using find_element method, and after that we can perform click action using .click() method

button_element = driver.find_element(By.XPATH, value = '//*[@id="scr-res-table"]/div[2]/button[3]')
button_element.click()
txt=driver.find_element(By.XPATH, value='//*[@id="scr-res-table"]/div[1]/table/tbody/tr[1]').text
txt

Now I am trying to check the first row on the web page to verify if .click() really worked, and you will see first row has changed. Click action was successful.

In this section we have learned how to get required data points, and perform events on webpage.

#terminating driver from test runs 
driver.close()
driver.quit() 

2.6 Extract & Compile the information into python list

Let's put all the pieces in the puzzle, we will pass the integer total_crypto i.e. numbers of rows to be scraped (in this case 100 rows) in the function. Parse each row from the page and append the data in the List till the total parsed row count reach to total_crypto. In addition, we will perform Next button click if we are at the last row of the table.

Please Note : Here to identify the Next button element we have used WebDriverWait class instead of using find_element() method. In this technique we can pass some wait-time before grabbing the element. This type of implementation is done to avoid the StaleElementReferenceException.

Code Sample:

element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//*[@id="scr-res-table"]/div[2]/button[3]')))
def parse_multiple_pages(driver, total_crypto):
    """Loop through each row, perform Next button click at the end of page 
    return total_crypto numbers of rows 
    """
    table_data = []
    page_num = 1
    is_scraping = True
    header_list = get_table_header(driver)

    while is_scraping:
        table_rows = get_table_rows(driver)
        print('Found {} rows on Page : {}'.format(table_rows, page_num))
        print('Parsing Page : {}'.format(page_num))
        table_data += [parse_table_rows(i, driver, header_list) for i in range (1, table_rows + 1)]
        total_count = len(table_data)
        print('Total rows scraped : {}'.format(total_count))
        if total_count >= total_crypto:
            print('Done Parsing..')
            is_scraping = False
        else:    
            print('Clicking Next Button')
            element = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, '//*[@id="scr-res-table"]/div[2]/button[3]')))
            element.click() 
            page_num += 1
    return table_data

2.7 Save the extracted information to a CSV file.

This is the last step of this section, we are creating a last function which will be the placeholder for all helper functions and at the and we will save the data in CSV format using pd.to_csv method.

def scrape_yahoo_crypto(url, total_crypto, path=None):
    """Get the list of yahoo finance crypto-currencies and write them to CSV file """
    if path is None:
        path = 'crypto-currencies.csv'
    print('Creating driver')
    driver = get_driver(url)    
    table_data = parse_multiple_pages(driver, total_crypto)
    driver.close()
    driver.quit()
    print('Save the data to a CSV')
    table_df = pd.DataFrame(table_data)
    table_df.to_csv(path, index=None)
    #This return statement is optional, we are doing this just analyze the final output 
    return table_df 

Time to scrape some cryptos!!! , we will scrape top 100 cryptos in Yahoo! Finance webpage by calling scrape_yahoo_crypto

YAHOO_FINANCE_URL = BASE_URL+'/cryptocurrencies'
TOTAL_CRYPTO = 100
crypto_df = scrape_yahoo_crypto(YAHOO_FINANCE_URL, TOTAL_CRYPTO,'crypto-currencies.csv')

The "crypto-currencies.csv" should be available in File \(\rightarrow\) Open Menu. You can download the file or directly open it on browser. Please verify the file content and compare it with the actual information available on the webpage.

You can also check the data by grabbing a few rows from the data frame returned by the scrape_yahoo_crypto function

crypto_df.head()

Summary : Hope you've enjoyed this tutorial. Selenium enables us to perform multiple actions on the web browser, which is really very handy for scraping different types of data from any webpage.

3. Web Scraping Market Events Calendar

This is the final segment of the tutorial in this section, we will learn how to extract embedded JSON formatted data which can be easily converted to Python dictionary. Problem statement for section is to scrape date-wise market events from Yahoo! finance.

Here's an outline of the steps we'll follow
3.1 Install & Import libraries
3.2 Download & Parse web page
3.3 Get Embedded Json data
3.4 Locating Json Keys
3.5 Pagination & Compiling the information into a python list
3.6 Save the extracted information to a CSV file

3.1 Install & Import libraries

First step to install and import Python Libraries

!pip install requests --quiet
!pip install beautifulsoup4 --quiet
!pip install pandas --quiet
import re
import json
import requests
import pandas as pd
from io import StringIO
from bs4 import BeautifulSoup
from IPython.display import display

3.2 Download & Parse web page

This is exactly the same step that we've performed to download webpage in section 1.1, Here we have used custom header in requests.get()

Most of the things are explained in section 1.1, creating the helper function.

def get_event_page(scraper_url):
    """Download a webpage and return a beautiful soup doc"""
    headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
                  "(KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
    }
    response = requests.get(scraper_url, headers=headers)
    if not response.ok:
        print('Status code:', response.status_code)
        raise Exception('Failed to fetch web page ' + scraper_url)
    # Construct a beautiful soup document
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc
doc = get_event_page('https://finance.yahoo.com/calendar/earnings?from=2022-02-27&to=2022-03-05&day=2022-02-28')
print(doc.find('title'))
<title>Company Earnings Calendar - Yahoo Finance</title>

3.3 Get Embedded Json data

In this step we will locate the Jason formatted data, Open the web page and do Right Click \(\rightarrow\) View Page Source, If you scroll down to source page you will notice the Json formatted data. Apparently this information is <script> tag which contains the following text /* -- Data -- */.

We will use Regular expressions to get text inside <script> tag.

pattern = re.compile(r'\s--\sData\s--\s')
#script_data = doc.find('script', text=pattern).text
script_data = doc.find('script', text=pattern).contents[0]

Further, the Json formatted string has the first key as context and it ends at 12 characters from the end

print(script_data[:150])
print(script_data[-150:])
(function (root) { /* -- Data -- */ root.App || (root.App = {}); root.App.now = 1650517499902; root.App.main = {"context":{"dispatcher":{"stores":{"P odal":{"strings":1},"tdv2-wafer-header":{"strings":1},"yahoodotcom-layout":{"strings":1}}},"options":{"defaultBundle":"td-app-finance"}}}}; }(this));

We can grab the Json string using Python slicing.

start  = script_data.find('context')-2
json_text  = script_data[start:-12]
print(json_text[:100])
{"context":{"dispatcher":{"stores":{"PageStore":{"currentPageName":"calendar","currentEvent":{"event

Using json.loads()method to convert Jason string into Python Dictionary

parsed_dictionary = json.loads(json_text)
type(parsed_dictionary)
dict

Creating a function using above information.

def get_json_dictionary(doc):
    """Get Json formated data in the form of Python Dictionary"""
    pattern = re.compile(r'\s--\sData\s--\s')
    script_data = doc.find('script', text=pattern).text
    script_data = doc.find('script', text=pattern).contents[0]
    
    start  = script_data.find('context')-2
    json_text  = script_data[start:-12]
    
    parsed_dictionary = json.loads(json_text)
    return parsed_dictionary    

3.4 Locating Json Keys

So basically the Json text is multi level nested dictionaries, and some keys are used to store all the meta data displayed on the webpage. In this section we will identify the keys for the data we are trying to scrape.

We'll need some Json Formatter tool to navigate through multiple keys, I am using online tool https://jsonblob.com/. However, you can choose any tool.

We will write the Json text into my_json_file.json file, then grab the file content and paste it to the left panel of https://jsonblob.com/. The JSON Blob it will do nice formatting. We can easily navigate through each Keys and search any item.

with open('my_json_file.json', 'w', encoding="utf-8") as file:
    file.write(json_text)

Next step is to find the Required Key location. Let's search the company name 3D Systems Corporation displayed in the webpage in the JSON Blob formatter.

You can see the table data is stored in the rows key, and we can track down the parent keys as shown in the above screen, checkout the content of row key.

parsed_dictionary['context']['dispatcher']['stores']['ScreenerResultsStore']['results']['rows'][:3]
[{'ticker': 'DDD',
  'companyshortname': '3D Systems Corporation',
  'startdatetime': '2022-02-28T16:05:00.000Z',
  'startdatetimetype': 'TAS',
  'epsestimate': 0.03,
  'epsactual': 0.09,
  'epssurprisepct': 181.25,
  'timeZoneShortName': 'EST',
  'gmtOffsetMilliSeconds': -18000000,
  'quoteType': 'EQUITY'},
 {'ticker': 'FNNTF',
  'companyshortname': 'flatexDEGIRO AG',
  'startdatetime': '2022-02-28T16:31:00.000Z',
  'startdatetimetype': 'TAS',
  'epsestimate': None,
  'epsactual': None,
  'epssurprisepct': None,
  'timeZoneShortName': 'EST',
  'gmtOffsetMilliSeconds': -18000000,
  'quoteType': 'EQUITY'},
 {'ticker': 'GCP',
  'companyshortname': 'GCP Applied Technologies Inc.',
  'startdatetime': '2022-02-28T19:00:00.000Z',
  'startdatetimetype': 'TAS',
  'epsestimate': 0.18,
  'epsactual': 0.12,
  'epssurprisepct': -33.33,
  'timeZoneShortName': 'EST',
  'gmtOffsetMilliSeconds': -18000000,
  'quoteType': 'EQUITY'}]
print('Total Rows on the Current page :',len(parsed_dictionary['context']['dispatcher']['stores']['ScreenerResultsStore']['results']['rows']))
Total Rows on the Current page : 100

This sub-dictionary shows all the data displayed on the current page.
You can do more research and exploration to get different information from the web page.

print('Total Rows for the search criteria :',parsed_dictionary['context']['dispatcher']['stores']['ScreenerResultsStore']['results']['total'])
Total Rows for the search criteria : 271
print("Columns")
parsed_dictionary['context']['dispatcher']['stores']['ScreenerResultsStore']['results']['columns']
Columns
[{'data': 'ticker', 'content': 'Symbol'},
 {'data': 'companyshortname', 'content': 'Company Name'},
 {'data': 'startdatetime', 'content': 'Event Start Date'},
 {'data': 'startdatetimetype', 'content': 'Event Start Time'},
 {'data': 'epsestimate', 'content': 'EPS Estimate'},
 {'data': 'epsactual', 'content': 'Reported EPS'},
 {'data': 'epssurprisepct', 'content': 'Surprise (%)'},
 {'data': 'timeZoneShortName', 'content': 'Timezone short name'},
 {'data': 'gmtOffsetMilliSeconds', 'content': 'GMT Offset'}]

Putting this in a function

def get_total_rows(parsed_dictionary):
    '''Get the Total Rows for the search criteria & Columns detail''' 
    total_rows = parsed_dictionary['context']['dispatcher']['stores']['ScreenerResultsStore']['results']['total']
    return total_rows
def get_page_rows(parsed_dictionary):
    """Get the Content current page"""    
    data_dictionary = parsed_dictionary['context']['dispatcher']['stores']['ScreenerResultsStore']['results']['rows']
    return data_dictionary

3.5 Pagination & Compiling the information into python list

As we saw in the previous section on how to handle Pagination using selenium methods, here we'll learn a new technique for accessing multiple pages.

Most of the times webpage url gets changed at runtime depending on the user selection, e.g. In the below screen-shot, I selected the Earnings for 1-March-2022. You can notice how that information is passed in the URL.

Similarly, when i click next button, offset& size values gets changed in the url.

So we can figure out the pattern & structure of the url and how it affects page navigation.

In this case webpage url pattern is mentioned below:

  • The following values are used for calendar event types event_types = ['splits','economic','ipo','earnings']
  • Date passed in yyyy-mm-dd format
  • Page number is controlled by offset value (for first page offset=0)
  • Maximum number of rows in a page is assigned to size

Based on the above information, we can build the URL at runtime and download the page, then extract the information. This is how we handle pagination.

Putting all things together in a function. In this function we will pass event_type and date, then we will calculate the total rows for matching criteria using get_total_rows function. Maximum rows per page are constant (i.e. 100), so we can build iterating summation logic to calculate the total number of pages involved for the current criteria and extract each page data in the loop.

def scrape_all_pages(event_type, date):
    """Loop through each row and return lists of data dictionary"""
    YAHOO_CAL_URL = BASE_URL+'/calendar/{}?day={}&offset={}&size={}'
    max_rows_per_page = '100' # this indicates max rows per page 
    page_number = 1
    final_data_dictionary = []
    
    while page_number > 0:
        print("Processing page # {}".format(page_number))
        page_url = str((page_number - 1 ) * int(max_rows_per_page))
        scrape_url = YAHOO_CAL_URL.format(event_type, date, page_url, max_rows_per_page)
        print("Scrape url for page {} is {}".format(page_number,scrape_url))
        page_doc = get_event_page(scrape_url)
        parse_dict = get_json_dictionary(page_doc)
        if page_number == 1:
            total_rows = get_total_rows(parse_dict)        
        final_data_dictionary += get_page_rows(parse_dict)
        if len(final_data_dictionary) >= total_rows:
            page_number = 0
            return final_data_dictionary
        page_number += 1

creating one more variant of same function for the date range

def scrape_all_pages_daterange(event_type, dates):
    """Loop through each row and return lists of data dictionary"""
    YAHOO_CAL_URL = BASE_URL+'/calendar/{}?day={}&offset={}&size={}'
    max_rows_per_page = '100' # this indicates max rows per page 
    data_dictionary_date = []
    for date in dates:
        print("Processing for date : {}".format(date))
        page_number = 1
        final_data_dictionary = []
    
        while page_number > 0:
            print("Processing page # {}".format(page_number))
            page_url = str((page_number - 1 ) * int(max_rows_per_page))
            scrape_url = YAHOO_CAL_URL.format(event_type, date, page_url, max_rows_per_page)
            print("Scrape url for page {} is {}".format(page_number,scrape_url))
            page_doc = get_event_page(scrape_url)
            parse_dict = get_json_dictionary(page_doc)
            if page_number == 1:
                total_rows = get_total_rows(parse_dict)        
            final_data_dictionary += get_page_rows(parse_dict)
            if len(final_data_dictionary) >= total_rows:
                page_number = 0
                data_dictionary_date += [{'primary-key':date, **item} for item in final_data_dictionary]
            else:    
                page_number += 1
        print("Processing done")    
    return data_dictionary_date

3.6 Save the extracted information to a CSV file.

In this last section, we will save the data to csv format using pd.DataFrame() & to_csv() and call everything in a single placeholder function.

def scrape_yahoo_calendar(event_types, date_param):
    """Get the list of yahoo finance calendar and write them to CSV file """
    for event in event_types:
        data_dict = {}
        print('Web Scraping for ', event  )
        data_dict = scrape_all_pages(event, date_param)
        if len(data_dict) > 0:
            scraped_df = pd.DataFrame(data_dict)
            scraped_df.to_csv(event+'_'+date_param+'.csv',index=False)
            print("checking few rows.. for event : {} & date : {}".format(event, date_param))
            display(scraped_df.head())
        else:
            print("No data found for event : {} & date : {}".format(event, date_param))

creating one more variant of same function for the date range

def scrape_yahoo_calendar_daterange(event_types, date_param):
    """Get the list of yahoo finance calendar and write them to CSV file """
    for event in event_types:
        data_dict = {}
        print('Web Scraping for ', event  )
        data_dict = scrape_all_pages_daterange(event, date_param)
        if len(data_dict) > 0:
            scraped_df = pd.DataFrame(data_dict)
            scraped_df.to_csv(event+'.csv',index=False)
            print("checking few rows.. for event : {}".format(event))
            display(scraped_df.head())
        else:
            print("No data found for event : {} ".format(event))

calling the final function scrape_yahoo_calendar_daterange

import pandas as pd
from datetime import datetime
BASE_URL = 'https://finance.yahoo.com' #Global Variable
#date_param = ['2022-02-28','2022-03-01','2022-03-02']
date_param = pd.date_range(start="2022-02-28",end="2022-03-02").strftime("%Y-%m-%d").tolist()
date_param = pd.date_range(start="2022-02-28",periods = 3).strftime("%Y-%m-%d").tolist()
event_types = ['splits','economic','ipo','earnings']
scrape_yahoo_calendar_daterange(event_types, date_param)
Web Scraping for splits Processing for date : 2022-02-28 Processing page # 1 Scrape url for page 1 is https://finance.yahoo.com/calendar/splits?day=2022-02-28&offset=0&size=100 Processing done Processing for date : 2022-03-01 Processing page # 1 Scrape url for page 1 is https://finance.yahoo.com/calendar/splits?day=2022-03-01&offset=0&size=100 Processing done Processing for date : 2022-03-02 Processing page # 1 Scrape url for page 1 is https://finance.yahoo.com/calendar/splits?day=2022-03-02&offset=0&size=100 Processing done checking few rows.. for event : splits
Web Scraping for economic Processing for date : 2022-02-28 Processing page # 1 Scrape url for page 1 is https://finance.yahoo.com/calendar/economic?day=2022-02-28&offset=0&size=100 Processing done Processing for date : 2022-03-01 Processing page # 1 Scrape url for page 1 is https://finance.yahoo.com/calendar/economic?day=2022-03-01&offset=0&size=100 Processing done Processing for date : 2022-03-02 Processing page # 1 Scrape url for page 1 is https://finance.yahoo.com/calendar/economic?day=2022-03-02&offset=0&size=100 Processing done checking few rows.. for event : economic
Web Scraping for ipo Processing for date : 2022-02-28 Processing page # 1 Scrape url for page 1 is https://finance.yahoo.com/calendar/ipo?day=2022-02-28&offset=0&size=100 Processing done Processing for date : 2022-03-01 Processing page # 1 Scrape url for page 1 is https://finance.yahoo.com/calendar/ipo?day=2022-03-01&offset=0&size=100 Processing done Processing for date : 2022-03-02 Processing page # 1 Scrape url for page 1 is https://finance.yahoo.com/calendar/ipo?day=2022-03-02&offset=0&size=100 Processing done checking few rows.. for event : ipo
Web Scraping for earnings Processing for date : 2022-02-28 Processing page # 1 Scrape url for page 1 is https://finance.yahoo.com/calendar/earnings?day=2022-02-28&offset=0&size=100 Processing page # 2 Scrape url for page 2 is https://finance.yahoo.com/calendar/earnings?day=2022-02-28&offset=100&size=100 Processing page # 3 Scrape url for page 3 is https://finance.yahoo.com/calendar/earnings?day=2022-02-28&offset=200&size=100 Processing done Processing for date : 2022-03-01 Processing page # 1 Scrape url for page 1 is https://finance.yahoo.com/calendar/earnings?day=2022-03-01&offset=0&size=100 Processing page # 2 Scrape url for page 2 is https://finance.yahoo.com/calendar/earnings?day=2022-03-01&offset=100&size=100 Processing page # 3 Scrape url for page 3 is https://finance.yahoo.com/calendar/earnings?day=2022-03-01&offset=200&size=100 Processing done Processing for date : 2022-03-02 Processing page # 1 Scrape url for page 1 is https://finance.yahoo.com/calendar/earnings?day=2022-03-02&offset=0&size=100 Processing page # 2 Scrape url for page 2 is https://finance.yahoo.com/calendar/earnings?day=2022-03-02&offset=100&size=100 Processing done checking few rows.. for event : earnings

calling the final function scrape_yahoo_calendar

BASE_URL = 'https://finance.yahoo.com' #Global Variable 
#date_param = '2022-03-18' # no data condition
date_param = '2022-02-28'
event_types = ['splits','economic','ipo','earnings']
scrape_yahoo_calendar(event_types, date_param)
Web Scraping for splits Processing page # 1 Scrape url for page 1 is https://finance.yahoo.com/calendar/splits?day=2022-02-28&offset=0&size=100 checking few rows.. for event : splits & date : 2022-02-28
Web Scraping for economic Processing page # 1 Scrape url for page 1 is https://finance.yahoo.com/calendar/economic?day=2022-02-28&offset=0&size=100 checking few rows.. for event : economic & date : 2022-02-28
Web Scraping for ipo Processing page # 1 Scrape url for page 1 is https://finance.yahoo.com/calendar/ipo?day=2022-02-28&offset=0&size=100 checking few rows.. for event : ipo & date : 2022-02-28
Web Scraping for earnings Processing page # 1 Scrape url for page 1 is https://finance.yahoo.com/calendar/earnings?day=2022-02-28&offset=0&size=100 Processing page # 2 Scrape url for page 2 is https://finance.yahoo.com/calendar/earnings?day=2022-02-28&offset=100&size=100 Processing page # 3 Scrape url for page 3 is https://finance.yahoo.com/calendar/earnings?day=2022-02-28&offset=200&size=100 checking few rows.. for event : earnings & date : 2022-02-28
 

Total 4 csv files "event_type_yyyy-mm-dd.csv" should be available in File \(\rightarrow\) Open Menu. You can download the file or directly open it in a browser. Please verify the file content and compare it with the actual information available on the webpage.

Summary : This is a very useful technique which can be easily replicable. Without writing any customized code, we were able to extract the data from multiple types of web pages just by changing one variable (in this case event_type).

Future Work

Ideas for future work

  • Automate this process using AWS Lambda to download daily market calendar, crypto-currencies & market news in CSV format.
  • Move the old files to an Archive folder append date-stamp to the file if required, also delete the Archived files older than 2 weeks.
  • Process the raw data extracted from third technique using different methods of pandas

Conclusion

In this tutorial we implement the following web scraping techniques.

  • Use Requests, BeautifulSoup and HTML tags to extract web page.
  • Use Selenium to scrape data from dynamically loading websites.
  • Use embedded JSON data to scrape website.

I hope I was able to teach you these webscraping methods and I hope you can use this knowledge to scrape any website.

If you have any questions, feedback feel free to post a comment or contact me on LinkedIn. Thank you for reading and if you liked this post, please consider following me. Until next time… Happy coding !!

Don’t forget to give your 👏 !

jovian.commit(project="yahoo-finance-web-scraper",files=['stock-market-news.csv','crypto-currencies.csv','splits_2022-02-28.csv','economic_2022-02-28.csv','ipo_2022-02-28.csv','earnings_2022-02-28.csv'])
jovian.commit(project="yahoo-finance-web-scraper",files=['yahooscraper.py','lambda_function.py'])
jovian.commit(project="yahoo-finance-web-scraper")
 
vinodvidhole
Vinod Dhole8 months ago
Jovian
Sign In