Scraping Tuan Nguyen
Bioinformatics Questions Asked On Stackoverflow
If you're a coder, you should be familiar with Stackoverflow. Almost any problems you may encounter in your code have already been asked and answered on this website. In this project, I will try to gather information about the most recent questions of specific tags on this page and summarize them into a table.
For each question, the details include:
- What is the question headline?
- Who asked and answered the question, and what is his/her reputation (if available)?
- When was the question asked and answered (if available)?
- How popular is that question? (views, number of answers, number of votes for the question and top answer)
- What tags are associated with it?
- Extract raw information from Stackoverflow question pages .
- Inspecting the structure of the webpage and analyze its URL's components.
- Define functions: Retrieve all data shown on a single page of the website.
- Checkpoint: Scrape the data about the 50 most recent bioinformatics questions on Stackoverflow.
- Draw out the fields of interest from the raw data.
- Determine the patterns associated with each of the 14 fields of interest: asker's name, asker's reputation, question, tags, time asked, # views, # answers, # votes for the question, has accepted answer, top answerer's name, top answerer's reputation, # votes for the top answer, time of the top answer, link to the question (example below).
- Define functions: fetch relevant information from the chunk of raw data and correct their formats, if necessary.
- Checkpoint: parsing through the 50 most recent questions (scraped in 1st step) and pull the 14 fields out.
- Automate the process for multiple pages and organize the results into a table
- Define functions: create a loop to extract and filter data from multiple pages of the website, then save all results in a csv files.
- Checkpoint: create a csv file containing 100 recent bioinformatics questions on Stackoverflow.
- Apply my functions for a different topic.
- Summarize what I have accomplished in this project.
- Ideas for future work.
Here is a glimpse of what our final output would look like.