Jovian
Sign In

Scraping Tuan Nguyen

Bioinformatics Questions Asked On Stackoverflow

Stackoverflow-logo

If you're a coder, you should be familiar with Stackoverflow. Almost any problems you may encounter in your code have already been asked and answered on this website. In this project, I will try to gather information about the most recent questions of specific tags on this page and summarize them into a table.

For each question, the details include:

  • What is the question headline?
  • Who asked and answered the question, and what is his/her reputation (if available)?
  • When was the question asked and answered (if available)?
  • How popular is that question? (views, number of answers, number of votes for the question and top answer)
  • What tags are associated with it?

Outline

  1. Extract raw information from Stackoverflow question pages .
  • Inspecting the structure of the webpage and analyze its URL's components.
  • Define functions: Retrieve all data shown on a single page of the website.
  • Checkpoint: Scrape the data about the 50 most recent bioinformatics questions on Stackoverflow.
  1. Draw out the fields of interest from the raw data.
  • Determine the patterns associated with each of the 14 fields of interest: asker's name, asker's reputation, question, tags, time asked, # views, # answers, # votes for the question, has accepted answer, top answerer's name, top answerer's reputation, # votes for the top answer, time of the top answer, link to the question (example below).
  • Define functions: fetch relevant information from the chunk of raw data and correct their formats, if necessary.
  • Checkpoint: parsing through the 50 most recent questions (scraped in 1st step) and pull the 14 fields out.
  1. Automate the process for multiple pages and organize the results into a table
  • Define functions: create a loop to extract and filter data from multiple pages of the website, then save all results in a csv files.
  • Checkpoint: create a csv file containing 100 recent bioinformatics questions on Stackoverflow.
  1. Conclusion
  • Apply my functions for a different topic.
  • Summarize what I have accomplished in this project.
  • Ideas for future work.
  1. References

Here is a glimpse of what our final output would look like.

ptuan5
Tuan Nguyen6 months ago