Scraping smartphone details from GSMArena using Python.

GSMArena is a popular website which documents data related to specifications of every mobile device launched. In this project we will scape the website and find useful information regarding various mobile devices/tablets/smart-watches.

We will be scraping multiple URLs and eventually we will have 110 .CSV files corresponding to various brands (such as: Asus, Nokia, Samsung,etc) and each file will contain three columns namely: Model, Release Date and Features.

These three columns will tell about the particular model name. when it was released and some basic features such as screen size, battery, camera resolution, etc.

Legal Disclaimer from the website can be found here. Lets get started with the project.

Outline of the project:

Introduction of Web-Scraping
Installing and Importing required libraries
Creating the first .csv file
Using the first .CSV file to scrape more web-pages
Compiling all the data into a DataFrame using Pandas and saving the data into CSV file.

Introduction to Web-Scraping

Web scraping is an automatic method to obtain large amounts of data from websites. Most of this data is unstructured data in an HTML format which is then converted into structured data in a spreadsheet or a database so that it can be used in various applications.

Web Scrapers can extract all the data on particular sites or the specific data that a user wants. Web Scraping is an important technique as it helps to collect data from various sources. Data once collected can be further used to create visualizations or make decision.

Web Scraping is technically not any kind of illegal process but the decision is based on further various factors such as How do you use the extracted data? I have gone through the legal disclaimer of the website and as mentioned on the website, the data can be used personal and non-commercial purposes, GSMArea is not responsible for the accuracy of the data.

Introduction to Various Tools used for web-scraping.

Python Requests Module:
The requests module allows you to send HTTP requests using Python. The HTTP request returns a Response Object with all the response data (content, encoding, status, etc).
Beautiful Soup Module:
Beautiful Soup is a Python library for pulling data out of HTML and XML files.
Pandas library:
Pandas is a python library used to create and manipulate data frames. In this project we would be mainly using it to create .CSV files.
Regular Expression Python Library.
This module provides regular expression matching operations, which we will be using to separate , clean and organize our data.

Scraping GSMArena to create the first .CSV file.

The first page we will be scraping is this. Herein, we will create a .CSV file which will contain these three columums:

Brand Name
Number of Devices
Brand URL
Brand URL is the URL of that particular brand on the GSMAreana website.

Lets us start the project by installing the required libraries. i.e requests and gather the data of this webpage.

import re #This is a python library which is later on needed in this particular case.

!pip install requests --upgrade --quiet
!pip install beautifulsoup4 --upgrade --quiet

import requests
from bs4 import BeautifulSoup