Project Web Scraping With Python
Scrapping most popular CAD models in different categeroies from GRABCAD
GRABCAD ( largest online community of professional engineers, designers & students to work and share there cad models among the community )
INTRODUCTION:
Grabcad is a platform where we can upload or download CAD models to show up our work and get a chance to win exciting prizes too. Basically GRABCAD evolved as a community of engineers and currently there 52 lakh registered users and 31 lakh open source models available on the website . This vast free cad model library is very helpful for students and learning professionals who wants to be a part of CAD related jobs or research for learning different designing softwares such as Solidworks, Catia, Autocad, pro-E etc.
It brings together all the tools engineers need to manage and share CAD files into one platform.
OBJECTIVE:
As a Data Science Engineer we aims to get the all time most downloaded design models by parsing the information from this website in to a form of tabular data under different categories of knowledge domain such as Machine Design, 3D printing, Aerospace, Electrical so that we can further get to know the interests among the community, difficulty level faced to design the models and ofcourse to distribute the prizes for the most popular ones.
(In this notebook we will limit our objective to scrape the data for each category separately to limit the dataset , We can also combine the data for different categories and further analysis and testing on that complete data can be done on a similar path)
The overall steps I'll follow are:
- Understanding the structure of grabcadwebsite
- Install and Import required libraries
- Download the page and extract the urls from grabcad's all time most downloaded library page using
selenium.webdriver
andkora.selenium
under different cageories (Total 33 gategories are there on the page) - Extract model links( 100 per page) from each url extracted above under the required categories among those 33 mentioned above
- Download each model link and parse the data out of it in 4 categories i.e Names, Downloads, Likes, Comments
- Combine extracted data into a dictionary from each category.
- Compiling all details into a
Pandas
dataframe and creating a CSV file
By the end of the project, is expected to create a csv file with the following information under machine design category:
name,downloads, likes, comments
Stepper Motor Nema 17, 41925, 575, 78
MQ-1 Predator UAV, 31373, 802, 144
CNC 3-axis, 30116, 994, 175
Planetary Gearbox, 29050, 900, 189
NOTE:
-
Grabcad is a dynamic website using javascript therefore we can not extract the webpage HTML from
beautiful soup
here, Therefore use ofselenium
is preffered for these kind of websites. But yes we can usebeautiful soup
after getting the webpage HTML from the webdriver in some websites. -
If you want to code on you local computer install
Selenium
and one of the webdriver depends on your browser to extract the page, But if you are coding on cloud based services such as google colab then you need to installkora Selenium
but remember thiskora Selenium
will not work on binder and others so be aware.
Install and Import required libraries:
!pip install kora -q
!pip install requests
from kora.selenium import wd
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import requests
Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (2.23.0)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests) (2021.5.30)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests) (1.24.3)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests) (3.0.4)