Web scrapping news articles on Economics.
North East West South. Coming from the days where News used to arrive to us through one hour program in the evening, the news data that arrives to us now is quite overwhelming. Trying to be a rational decision maker, it is wise to be up to date with the current world scenario.
Keeping that in mind, in the current web scrapping project, I would like to gather the data on Economics from one website, namely MoneyControl.com. Later on the project will be accentuated to look into other websites as well and accumulate as much information on some particular topics as well and work on the data for further processing.
What is scrapping you ask? Since the websites have so much data coming in from all around the places. The main frame of the website is the same but the data it shows change over time. Web scrapping is a way to collect the required data for some personal and professional use.
In this project, python programming language is used. There are several inbuilt and community built libraries which can be used to download a web page in the form of text and use that to extract the required data. We will be using Rerquests and BeautifulSoup libraries for the above.
Here in this project, MoneyControl.com has been used for downloading the data for first 40 pages on the topic of Economics. MoneyConrol.com is an Indian Online news media website which is a subsidary of the media house TV 18.
- Find the potential news websites to scrape.
- Download the desired webpage using Requests library.
- Find all the tags containing the necessary information from the webpage.
- Create list of dictionaries of the information.
- Write the information into the csv file.
- Repeat the steps for 40 pages
By the end of this project a csv file will be created with Date, Title, Brief and news link followed be the corrosponding data for the next 975 enteries.
How to run the code.
The current project is hosted on [Jovian]. You can execute the notebook by going to the THIS link and click on [run] button and select [run on binder] option there. Make sure to clear all the out put and run all the cells one after another.
Installing and importing the Libraries required for the current project.
Before starting the project importing some of the dependencies is a good idea. The current project uses the following libraries:
- Requests - to download the the web page in the text format.
- BeautifulSoup - to be able to use the text downloaded from the Requests library to be used for further processing.
- Jovian - Jovian.ai provides a neat way to save the versions of the project. It is useful if one is working either online or offline given a case where machine crashes.
# Install the library !pip install requests --upgrade --quiet !pip install jovian --upgrade --quiet !pip install beautifulsoup4 --upgrade --quiet !pip install pandas --upgrade --quiet