Scraping and Crawling Springer Journal Articles using Scrapy

Introduction

Scrapy is a python library implementing an open source and collaborative framework for extracting the data you need from websites in a fast, simple, yet extensible way.

Springer, is a German multinational publishing company of books, e-books and peer-reviewed journals in science, humanities, technical and medical (STM) publishing.

Web scraping is an automatic method to obtain large amounts of data from websites. Web scraping finds use cases for individual people wanting to automate their tasks to Data collectors in big corporations for their data needs. This data then can be used for things like product review analysis, sales, analysis, competitor analysis and in various other applications in Data Science field.

A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web.

Springer journals are reputed collection of articles complied on the basis of a topic in one of the STM fields. They are a valuable scientific resource where articles from researchers across the world are Published for other researchers to read and cite in their own research.

Why scraping is needed here?

As a researcher looking for articles for his own research and learning, going to springer and looking through thousands of articles each day is a very tedious task. Even though articles are grouped into journals depending on the topic, there is no way of marking which articles a person already has gone through and it takes a lot of clicks to get to articles you wanna read through the website when each topic has thousands of articles.

As a 'Scripting' language Python really shines in automating tasks like these. Using the power of Scrapy spiders we are going to make this poor researcher's job a little bit easier by:

Scraping Titles, Author names, Links, PDF download links etc for a page for a particular topic using Scrapy.
Crawling more pages to uncover more articles on the same topic and scraping them as well.
Cleaning and Processing the scraped data
Finally outputting the scraped data in a CSV format and into a MySQL database via Python-SQL connector(another python library)

Project Outline

Project outline

This is the Path we will take to get out data from the webpage to an SQL database and a CSV file. I will Explain each step as we go.

But, Why a CSV and MySql you ask?

Our hardworking and organized researcher can use Microsoft Excel or Google Sheets to import this data via a CSV and can keep track of which articles he has already read, from which ones he is going to pull citations from and which ones have nothing relating to his research etc. Same goes for an SQL database, we are giving our researcher choices between the two.

Setting Up

For managing the package versions of my project i created a development environment using Anaconda using the terminal command:

conda create -n myscrapingenv python3.9

in the ubuntu terminal. Then for activating the environment:

conda activate myscrapingenv

to add the packages we need in the project i used:

conda install scrapy

and

pip install mysql-connector-python

Project Code on Github:

This project has been uploaded to Github and anyone is free to use the code Under MIT license.

GitHub Link

GitHub repository contains the code and the Scraped CSV from springer with 218 rows and 6 columns.

Executing the Code

The code can be executed by navigating to the project folder in the terminal and then using the command:

scrapy crawl "spider name"

Spider's name must be without the quotes.