Project Web Scraping With Python
Scrapping most popular CAD models in different categeroies from GRABCAD
GRABCAD ( largest online community of professional engineers, designers & students to work and share there cad models among the community )
Grabcad is a platform where we can upload or download CAD models to show up our work and get a chance to win exciting prizes too. Basically GRABCAD evolved as a community of engineers and currently there 52 lakh registered users and 31 lakh open source models available on the website . This vast free cad model library is very helpful for students and learning professionals who wants to be a part of CAD related jobs or research for learning different designing softwares such as Solidworks, Catia, Autocad, pro-E etc.
It brings together all the tools engineers need to manage and share CAD files into one platform.
As a Data Science Engineer we aims to get the all time most downloaded design models by parsing the information from this website in to a form of tabular data under different categories of knowledge domain such as Machine Design, 3D printing, Aerospace, Electrical so that we can further get to know the interests among the community, difficulty level faced to design the models and ofcourse to distribute the prizes for the most popular ones.
(In this notebook we will limit our objective to scrape the data for each category separately to limit the dataset , We can also combine the data for different categories and further analysis and testing on that complete data can be done on a similar path)
The overall steps I'll follow are:
- Understanding the structure of grabcadwebsite
- Install and Import required libraries
- Download the page and extract the urls from grabcad's all time most downloaded library page using
kora.seleniumunder different cageories (Total 33 gategories are there on the page)
- Extract model links( 100 per page) from each url extracted above under the required categories among those 33 mentioned above
- Download each model link and parse the data out of it in 4 categories i.e Names, Downloads, Likes, Comments
- Combine extracted data into a dictionary from each category.
- Compiling all details into a
Pandasdataframe and creating a CSV file
By the end of the project, is expected to create a csv file with the following information under machine design category:
name,downloads, likes, comments Stepper Motor Nema 17, 41925, 575, 78 MQ-1 Predator UAV, 31373, 802, 144 CNC 3-axis, 30116, 994, 175 Planetary Gearbox, 29050, 900, 189
beautiful souphere, Therefore use of
seleniumis preffered for these kind of websites. But yes we can use
beautiful soupafter getting the webpage HTML from the webdriver in some websites.
If you want to code on you local computer install
Seleniumand one of the webdriver depends on your browser to extract the page, But if you are coding on cloud based services such as google colab then you need to install
kora Seleniumbut remember this
kora Seleniumwill not work on binder and others so be aware.
Install and Import required libraries:
!pip install kora -q !pip install requests from kora.selenium import wd from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC from bs4 import BeautifulSoup import requests
Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (2.23.0) Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests) (2.10) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests) (2021.5.30) Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests) (1.24.3) Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests) (3.0.4)