Assignment 3 - Web Scraping Practice
Introduction to Programming with Python
In this assignment, you will apply your knowledge of Python and its ecosystem of libraries to scrape information from any website in the given list of websites and create a dataset of CSV file(s). Here are the steps you'll follow:
-
Pick a website and describe your objective
-
Pick a site to scrape from the given list of websites below: (NOTE: you can also pick some other site that's not listed below)
- Dataset of Quotes (BrainyQuote): https://www.brainyquote.com/topics
- Dataset of Movies/TV Shows (TMDb):https://www.themoviedb.org.
- Dataset of Books (BooksToScrape): http://books.toscrape.com
- Dataset of Quotes (QuotesToScrape): http://quotes.toscrape.com
- Scrape User's Reviews (ConsumerAffairs): https://www.consumeraffairs.com/.
- Stocks Prices (Yahoo Finance): https://finance.yahoo.com/quote/TWTR.
- Songs Dataset (AZLyrics): https://www.azlyrics.com/f.html.
- Scrape a Popular Blog: https://m.signalvnoise.com/search/.
- Weekly Top Songs (Top 40 Weekly):https://top40weekly.com.
- Video Games Dataset (Steam): https://store.steampowered.com/genre/Free%20to%20Play/
-
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your assignment idea in a paragraph using a Markdown cell and outline your strategy.
-
Use the requests library to download web pages
- Inspect the website's HTML source and identify the right URLs to download.
- Download and save web pages locally using the
requests
library. - Create a function to automate downloading for different topics/search queries.
-
Use Beautiful Soup to parse and extract information
- Parse and explore the structure of downloaded web pages using Beautiful soup.
- Use the right properties and methods to extract the required information.
- Create functions to extract from the page into lists and dictionaries.
- (Optional) Use a REST API to acquire additional information if required.
-
Create CSV file(s) with the extracted information
- Create functions for the end-to-end process of downloading, parsing, and saving CSVs.
- Execute the function with different inputs to create a dataset of CSV files.
- Attach the CSV files with your notebook using
jovian.commit
.
-
Document and share your work
- Add proper headings and documentation in your Jupyter notebook.
- Publish your Jupyter notebook to Jovian and make a submission.
- (Optional) Write a blog post about your project and share it online.
Notes
-
Review the evaluation criteria on the "Submit" tab and look for project ideas under the "Resources" tab below
-
There's no starter notebook for this project. Use the "New" button on Jovian, and select "Run on Binder" to get started.
-
Ask questions, get help, and share your work on the Slack group. Help others by sharing feedback and answering questions.
-
Record snapshots of your notebook from time to time using
ctrl/cmd +s
, to ensure that you don't lose any work. -
Websites with dynamic content (fetched after page load) cannot be scraped using BeautifulSoup. One way to scrape a dynamic website is by using Selenium.
The "Resume Description" field below should contain a summary of your assignment in no more than 3 points. You'll can use this description to present this assignment as a project on your Resume. Follow this guide to come up with a good description
Evaluation Criteria
Your submission must meet the following criteria to receive a PASS grade in the assignment:
- The Jupyter notebook should run end-to-end without any errors or exceptions
- The Jupyter notebook should contain execution outputs for all the code cells
- The Jupyter notebook should contain proper explanations i.e. proper documentation (headings, sub-headings, summary, future work ideas, references, etc) in Markdown cells
- Your assignment should involve web scraping of at least two web pages
- Your assignment should use the appropriate libraries for web scraping
- Your submission should include the CSV file generated by scraping
- The submitted CSV file should contain at least 3 columns and 100 rows of data
- The Jupyter notebook should be publicly accessible (not "Private" or "Secret")
- Follow this guide for the "Resume Description" field in the submission form: https://jovian.com/program/jovian-data-science-bootcamp/knowledge/08-presenting-projects-on-your-resume-231