Learn practical skills, build real-world projects, and advance your career

Data Cleaning


This notebook goes through a necessary step of any data science project - data cleaning. Data cleaning is a time consuming and unenjoyable task, yet it's a very important one. So we keep in mind, "garbage in, garbage out". Feeding dirty data into a model will give us results that are meaningless.

Specifically, we'll be walking through:

  1. **Getting the data - **in this case, we'll be scraping data from a website
  2. **Cleaning the data - **we will walk through popular text pre-processing techniques
  3. **Organizing the data - **we will organize the cleaned data into a way that is easy to input into other algorithms

The output of this notebook will be clean, organized data in two standard text formats:

  1. Corpus - a collection of text
  2. Document-Term Matrix - word counts in matrix format

Problem Statement

As a reminder, our goal is to look at all eighteen books of the Mahābhārata and note their similarities and differences. Specifically, we'd like to know if the central kernel is different than other peripheral sections, since it is the part that got us interested in studying the Mahābhārata.